ื›ืœ ื”ืึทื‘ืจ ืื™ืŸ ืื™ื™ืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก

ื ื’ื•ื˜ืŸ ืžื™ื˜ืื’. ืขืก ืื™ื– ืฉื•ื™ืŸ 2 ื™ืืจ ื–ื™ื ื˜ ืื™ืš ื’ืขืฉืจื™ื‘ืŸ ืขืก ืœืขืฆื˜ืข ืึทืจื˜ื™ืงืœ ื•ื•ืขื’ืŸ ื”ืึทื‘ืจ ืคึผืึทืจืกื™ื ื’, ืื•ืŸ ืขื˜ืœืขื›ืข ื–ืื›ืŸ ื”ืึธื‘ืŸ ื’ืขื‘ื™ื˜ืŸ.

ื•ื•ืขืŸ ืื™ืš ื”ืื‘ ื’ืขื•ื•ืืœื˜ ื”ืื‘ืŸ ื ืงืืคื™ ืคื•ืŸ ื”ืื‘, ื”ืื‘ ืื™ืš ื‘ืืฉืœืืกืŸ ืฆื• ืฉืจื™ื™ื‘ืŸ ื ืคืืจืคื™ืจืขืจ, ื•ื•ืืก ื•ื•ืขื˜ ืืคื”ื™ื˜ืŸ ื“ืขื ื’ืื ืฆืŸ ืื™ื ื”ืืœื˜ ืคื•ืŸ ื“ื™ ืžื—ื‘ืจื™ื ืื™ืŸ ื ื“ืึทื˜ืึทื‘ื™ื™ืก. ื•ื•ื™ ืขืก ื’ืขื˜ืจืืคืŸ ืื•ืŸ ื•ื•ืึธืก ืขืจืจืึธืจืก ืื™ืš ื’ืขืคึผืœืึธื ื˜ืขืจื˜ - ืื™ืจ ืงืขื ืขืŸ ืœื™ื™ืขื ืขืŸ ืื•ื ื˜ืขืจ ื“ื™ ืฉื ื™ื™ึทื“ืŸ.

ื˜ืœ; ื“ืจ - ืœื™ื ืง ืฆื• ื“ื™ื™ื˜ืึทื‘ื™ื™ืก

ืขืจืฉื˜ืขืจ ื•ื•ืขืจืกื™ืข ืคื•ืŸ โ€‹โ€‹ื“ื™ ืคึผืึทืจืกืขืจ. ืื™ื™ืŸ ืคืึธื“ืขื, ืคื™ืœืข ืคึผืจืึธื‘ืœืขืžืก

ืฆื• ืึธื ื”ื™ื™ื‘ืŸ ืžื™ื˜, ืื™ืš ื‘ืึทืฉืœืึธืกืŸ ืฆื• ืžืึทื›ืŸ ืึท ืคึผืจืึธื•ื˜ืึทื˜ื™ื™ืคึผ ืคื•ืŸ ืึท ืฉืจื™ืคื˜ ืื™ืŸ ื•ื•ืึธืก, ื’ืœื™ื™ืš ื ืึธืš ื“ืึทื•ื ืœืึธื•ื“ื™ื ื’, ื“ืขืจ ืึทืจื˜ื™ืงืœ ื•ื•ืึธืœื˜ ื–ื™ื™ืŸ ืคึผืึทืจืกืขื“ ืื•ืŸ ืฉื˜ืขืœืŸ ืื™ืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก. ืึธืŸ ื˜ืจืื›ื˜ืŸ ืฆื•ื•ื™ื™ ืžืึธืœ, ืื™ืš ื’ืขื•ื•ื™ื™ื ื˜ sqlite3, ื•ื•ื™ื™ึทืœ ... ืขืก ืื™ื– ื’ืขื•ื•ืขืŸ ื•ื•ื™ื™ื ื™ืงืขืจ ืึทืจื‘ืขื˜-ืื™ื ื˜ืขื ืกื™ื•ื•ืข: ืื™ืจ ื˜ืึธืŸ ื ื™ื˜ ื“ืึทืจืคึฟืŸ ืฆื• ื”ืึธื‘ืŸ ืึท ื”ื™ื’ืข ืกืขืจื•ื•ืขืจ, ืฉืึทืคึฟืŸ, ืงื•ืง, ื•ื™ืกืžืขืงืŸ ืื•ืŸ ืึทื–ืึท.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "ะŸั€ะธ ะฟะฐั€ัะธะฝะณะต ัั‚ะพะน ัั‚ั€ะฐะฝะธั†ะต ะฟั€ะพะธะทะพัˆะปะฐ ะพัˆะธะฑะบะฐ."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

ืึทืœืฅ ืื™ื– ืœื•ื™ื˜ ื“ื™ ืงืœืึทืกื™ืงืก - ืžื™ืจ ื ื•ืฆืŸ ืฉื™ื™ืŸ ื–ื•ืคึผ, ืจื™ืงื•ื•ืขืก ืื•ืŸ ื“ื™ ืฉื ืขืœ ืคึผืจืึธื•ื˜ืึทื˜ื™ื™ืคึผ ืื™ื– ื’ืจื™ื™ื˜. ื“ืึธืก ืื™ื– ื ืึธืจ ...

  • ื“ืขืจ ื‘ืœืึทื˜ ืื™ื– ื“ืึทื•ื ืœืึธื•ื“ื™ื“ ืื™ืŸ ืื™ื™ืŸ ืคืึธื“ืขื

  • ืื•ื™ื‘ ืื™ืจ ื™ื‘ืขืจืจื™ื™ึทืกืŸ ื“ื™ ื“ื•ืจื›ืคื™ืจื•ื ื’ ืคื•ืŸ ื“ื™ ืฉืจื™ืคื˜, ื“ื™ ื’ืื ืฆืข ื“ืึทื˜ืึทื‘ื™ื™ืก ื•ื•ืขื˜ ื’ื™ื™ืŸ ื™ื  ืขืจื’ืขืฆ ื ื™ื˜. ื ืึธืš ืึทืœืข, ื“ื™ ื™ื‘ืขืจื’ืขื‘ืŸ ืื™ื– ืขืงืกืึทืงื™ื•ื˜ืึทื“ ื‘ืœื•ื™ื– ื ืึธืš ืึทืœืข ื“ื™ ืคึผืึทืจืกื™ื ื’.
    ืคื•ืŸ ืงื•ืจืก, ืื™ืจ ืงืขื ืขืŸ ื™ื‘ืขืจื’ืขื‘ืŸ ืขื ื“ืขืจื•ื ื’ืขืŸ ืฆื• ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก ื ืึธืš ื™ืขื“ืขืจ ื™ื ืกืขืจืฉืึทืŸ, ืึธื‘ืขืจ ื“ืขืžืึธืœื˜ ื“ืขืจ ืฉืจื™ืคื˜ ื“ื•ืจื›ืคื™ืจื•ื ื’ ืฆื™ื™ื˜ ื•ื•ืขื˜ ืคืึทืจื’ืจืขืกืขืจืŸ ื‘ืื˜ื™ื™ื˜ื™ืง.

  • ืคึผืึทืจืกื™ื ื’ ื“ื™ ืขืจืฉื˜ืขืจ 100 ืึทืจื˜ื™ืงืœืขืŸ ื’ืขื ื•ืžืขืŸ ืžื™ืจ 000 ืฉืขื”.

ื“ืขืจื ืึธืš ืื™ืš ื’ืขืคึฟื™ื ืขืŸ ื“ืขื ื‘ืึทื ื™ืฆืขืจ 'ืก ืึทืจื˜ื™ืงืœ ืงืึธื™ื ื˜ืขื’ืจืึทื˜ืขื“, ื•ื•ืึธืก ืื™ืš ืœื™ื™ืขื ืขืŸ ืื•ืŸ ื’ืขืคึฟื•ื ืขืŸ ืขื˜ืœืขื›ืข ืœืขื‘ืŸ ื›ืึทืงืก ืฆื• ืคืึทืจื’ื™ื›ืขืจืŸ ื“ืขื ืคึผืจืึธืฆืขืก:

  • ื ื™ืฆืŸ ืžื•ืœื˜ื™ื˜ื”ืจืขืึทื“ื™ื ื’ ืกืคึผื™ื“ื– ื“ื™ ื“ืึทื•ื ืœืึธื•ื“ื™ื ื’ ื‘ืื˜ื™ื™ื˜ื™ืง.
  • ืื™ืจ ืงืขื ื˜ ื ื™ืฉื˜ ื‘ืึทืงื•ืžืขืŸ ื“ื™ ืคื•ืœ ื•ื•ืขืจืกื™ืข ืคื•ืŸ โ€‹โ€‹โ€‹โ€‹Habr, ืึธื‘ืขืจ ื“ื™ ืจื™ืจืขื•ื•ื“ื™ืง ื•ื•ืขืจืกื™ืข.
    ืคึฟืึทืจ ื‘ื™ื™ึทืฉืคึผื™ืœ, ืื•ื™ื‘ ืึท ืงืึธื™ื ื˜ืขื’ืจืึทื˜ืขื“ ืึทืจื˜ื™ืงืœ ืื™ืŸ ื“ื™ ื“ืขืกืงื˜ืึทืคึผ ื•ื•ืขืจืกื™ืข ื•ื•ื™ื™ื– 378 ืงื‘, ืื™ืŸ ื“ื™ ืจื™ืจืขื•ื•ื“ื™ืง ื•ื•ืขืจืกื™ืข ืขืก ืื™ื– ืฉื•ื™ืŸ 126 ืงื‘.

ืฆื•ื•ื™ื™ื˜ืข ื•ื•ืขืจืกื™ืข. ืคื™ืœืข ืคึฟืขื“ืขื, ืฆื™ื™ื˜ื•ื•ื™ื™ืœื™ื’ืข ืื™ืกื•ืจ ืคื•ืŸ ื”ื‘ืจ

ื•ื•ืขืŸ ืื™ืš ืกืงืึธื•ืจื“ ื“ื™ ืื™ื ื˜ืขืจื ืขื˜ ืื•ื™ืฃ ื“ื™ ื˜ืขืžืข ืคื•ืŸ โ€‹โ€‹ืžื•ืœื˜ื™ื˜ื”ืจืขืึทื“ื™ื ื’ ืื™ืŸ ืคึผื™ื˜ื”ืึธืŸ ืื•ืŸ ืื•ื™ืกื“ืขืจื•ื•ื™ื™ืœื˜ ื“ื™ ืกื™ืžืคึผืœืึทืกื˜ ืึธืคึผืฆื™ืข ืžื™ื˜ ืžื•ืœื˜ื™ืคึผืจืึธืกืขืกืกื™ื ื’.ื“ืึทืžืžื™, ืื™ืš ื‘ืืžืขืจืงื˜ ืึทื– ืคึผืจืึธื‘ืœืขืžืก ืืจื•ื™ืก ืฆื•ื–ืืžืขืŸ ืžื™ื˜ ืžื•ืœื˜ื™ื˜ื”ืจืขืึทื“ื™ื ื’.

SQLite3 ื˜ื•ื˜ ื ื™ืฉื˜ ื•ื•ืขืœืŸ ืฆื• ืึทืจื‘ืขื˜ืŸ ืžื™ื˜ ืžืขืจ ื•ื•ื™ ืื™ื™ืŸ ืคืึธื“ืขื.
ืคืึทืจืคืขืกื˜ื™ืงื˜ check_same_thread=False, ืึธื‘ืขืจ ื“ืขืจ ื˜ืขื•ืช ืื™ื– ื ื™ืฉื˜ ื“ืขืจ ื‘ืœื•ื™ื– ืื™ื™ื ืขืจ; ื•ื•ืขืŸ ื˜ืจื™ื™ื ื’ ืฆื• ืึทืจื™ื™ึทื ืœื™ื™ื’ืŸ ืื™ืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก, ืžืืœ ืขืจืจืึธืจืก ืื•ื™ืคืฉื˜ื™ื™ืŸ ืึทื– ืื™ืš ืงืขืŸ ื ื™ืฉื˜ ืกืึธืœื•ื•ืข.

ื“ืขืจื™ื‘ืขืจ, ืื™ืš ื‘ืึทืฉืœื™ืกืŸ ืฆื• ืคืึทืจืœืึธื–ืŸ ื“ื™ ืจืขื’ืข ื™ื ืกืขืจืฉืึทืŸ ืคื•ืŸ ืึทืจื˜ื™ืงืœืขืŸ ื’ืœื™ื™ึทืš ืื™ืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก ืื•ืŸ, ื’ืขื“ืขื ืงืขืŸ ื“ื™ ืงืึธื™ื ื˜ืขื’ืจืึทื˜ืขื“ ืœื™ื™ื–ื•ื ื’, ืื™ืš ื‘ืึทืฉืœื™ืกืŸ ืฆื• ื ื•ืฆืŸ ื˜ืขืงืขืก, ื•ื•ื™ื™ึทืœ ืขืก ื–ืขื ืขืŸ ืงื™ื™ืŸ ืคืจืื‘ืœืขืžืขืŸ ืžื™ื˜ ืžืึทืœื˜ื™-ื˜ืจืขื“ื™ื“ ืฉืจื™ื™ื‘ืŸ ืฆื• ืึท ื˜ืขืงืข.

Habr ืกื˜ืึทืจืฅ ืคืึทืจื‘ืึธื˜ ืคึฟืึทืจ ื ื™ืฆืŸ ืžืขืจ ื•ื•ื™ ื“ืจื™ื™ ืคึฟืขื“ืขื.
ื“ืขืจ ื”ื•ื™ืคึผื˜ ืคืึทืจื‘ืจืขื ื˜ ืคืจื•ื•ื•ืŸ ืฆื• ื“ืขืจื’ืจื™ื™ื›ืŸ Habr ืงืขืŸ ืคื™ืจืŸ ืฆื• ืึทืŸ IP ืคืึทืจื‘ืึธื˜ ืคึฟืึทืจ ืึท ืคึผืึธืจ ืคื•ืŸ ืฉืขื”. ืึทื–ื•ื™ ืื™ืจ ืžื•ื–ืŸ ื ื•ืฆืŸ ื‘ืœื•ื™ื– 3 ืคึฟืขื“ืขื, ืึธื‘ืขืจ ื“ืึธืก ืื™ื– ืฉื•ื™ืŸ ื’ื•ื˜, ื•ื•ื™ื™ึทืœ ื“ื™ ืฆื™ื™ื˜ ืฆื• ืกืึธืจื˜ ื“ื•ืจืš 100 ืึทืจื˜ื™ืงืœืขืŸ ืื™ื– ืจื™ื“ื•ืกื˜ ืคื•ืŸ 26 ืฆื• 12 ืกืขืงื•ื ื“ืขืก.

ืขืก ืื™ื– ื›ื“ืื™ ืฆื• ื‘ืืžืขืจืงืŸ ืึทื– ื“ื™ ื•ื•ืขืจืกื™ืข ืื™ื– ื’ืึทื ืฅ ืึทื ืกื˜ื™ื™ื‘ืึทืœ, ืื•ืŸ ื“ื™ ืืจืืคืงืืคื™ืข ืคึผื™ืจื™ืึทื“ื™ืงืœื™ ืคื™ื™ืœื– ืื•ื™ืฃ ืึท ื’ืจื•ื™ืก ื ื•ืžืขืจ ืคื•ืŸ ืึทืจื˜ื™ืงืœืขืŸ.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # ะ—ะฐะฟะธััŒ ะทะฐะฑะปะพะบะธั€ะพะฒะฐะฝะฝั‹ั… ะทะฐะฟั€ะพัะพะฒ ะฝะฐ ัะตั€ะฒะตั€
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # ะ•ัะปะธ ะฟะพัั‚ะฐ ะฝะต ััƒั‰ะตัั‚ะฒัƒะตั‚ ะธะปะธ ะพะฝ ะฑั‹ะป ัะบั€ั‹ั‚
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # ะœะตั‚ะบะฐ, ั‡ั‚ะพ ะฟะพัั‚ ัะฒะปัะตั‚ัั ะฟะตั€ะตะฒะพะดะพะผ ะธะปะธ ั‚ัƒั‚ะพั€ะธะฐะปะพะผ.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "ะŸั€ะธ ะฟะฐั€ัะธะฝะณะต ัั‚ะพะน ัั‚ั€ะฐะฝะธั†ะต ะฟั€ะพะธะทะพัˆะปะฐ ะพัˆะธะฑะบะฐ."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # ะ—ะฐะฟะธัั‹ะฒะฐะตะผ ัั‚ะฐั‚ัŒัŽ ะฒ json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("ะะตะพะฑั…ะพะดะธะผั‹ ะฟะฐั€ะฐะผะตั‚ั€ั‹ min ะธ max. ะ˜ัะฟะพะปัŒะทะพะฒะฐะฝะธะต: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # ะ•ัะปะธ ะฟะพั‚ะพะบะพะฒ >3
    # ั‚ะพ ั…ะฐะฑั€ ะฑะฐะฝะธั‚ ipัˆะฝะธะบ ะฝะฐ ะฒั€ะตะผั
    pool = ThreadPool(3)

    # ะžั‚ัั‡ะตั‚ ะฒั€ะตะผะตะฝะธ, ะทะฐะฟัƒัะบ ะฟะพั‚ะพะบะพะฒ
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ะŸะพัะปะต ะทะฐะบั€ั‹ั‚ะธั ะฒัะตั… ะฟะพั‚ะพะบะพะฒ ะฟะตั‡ะฐั‚ะฐะตะผ ะฒั€ะตะผั
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

ื“ืจื™ื˜ ื•ื•ืขืจืกื™ืข. ืœืขืฆื˜ื’ื™ืœื˜ื™ืง

ื‘ืฉืขืช ื“ื™ื‘ืึทื’ื™ื ื’ ื“ื™ ืจื’ืข ื•ื•ืขืจืกื™ืข, ืื™ืš ื“ื™ืกืงืึทื•ื•ืขืจื“ ืึทื– Habr ืคึผืœื•ืฆืœื™ื ื’ ื”ืื˜ ืึท API ื•ื•ืึธืก ืื™ื– ืึทืงืกืขืกื˜ ื“ื•ืจืš ื“ื™ ืจื™ืจืขื•ื•ื“ื™ืง ื•ื•ืขืจืกื™ืข ืคื•ืŸ โ€‹โ€‹โ€‹โ€‹ื“ืขื ืคึผืœืึทืฅ. ืขืก ืœืึธื•ื“ื– ืคืึทืกื˜ืขืจ ื•ื•ื™ ื“ื™ ืจื™ืจืขื•ื•ื“ื™ืง ื•ื•ืขืจืกื™ืข, ื•ื•ื™ื™ึทืœ ืขืก ืื™ื– ื ืึธืจ ื“ื–ืฉืกืึธืŸ, ื•ื•ืึธืก ื˜ื•ื˜ ื ื™ืฉื˜ ืืคื™ืœื• ื“ืึทืจืคึฟืŸ ืฆื• ื–ื™ื™ืŸ ืคึผืึทืจืกืขื“. ืื™ืŸ ื“ื™ ืกื•ืฃ, ืื™ืš ื‘ืึทืฉืœืึธืกืŸ ืฆื• ื™ื‘ืขืจืฉืจื™ื™ื‘ืŸ ืžื™ื™ืŸ ืฉืจื™ืคื˜ ื•ื•ื™ื“ืขืจ.

ืึทื–ื•ื™, ื•ื•ื™ื™ืœ ื“ื™ืกืงืึทื•ื•ืขืจื“ ื“ืขื ืœื™ื ืง API, ืื™ืจ ืงืขื ืขืŸ ืึธื ื”ื™ื™ื‘ืŸ ืคึผืึทืจืกื™ื ื’ ืขืก.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("ะะตะพะฑั…ะพะดะธะผั‹ ะฟะฐั€ะฐะผะตั‚ั€ั‹ min ะธ max. ะ˜ัะฟะพะปัŒะทะพะฒะฐะฝะธะต: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # ะ•ัะปะธ ะฟะพั‚ะพะบะพะฒ >3
    # ั‚ะพ ั…ะฐะฑั€ ะฑะฐะฝะธั‚ ipัˆะฝะธะบ ะฝะฐ ะฒั€ะตะผั
    pool = ThreadPool(3)

    # ะžั‚ัั‡ะตั‚ ะฒั€ะตะผะตะฝะธ, ะทะฐะฟัƒัะบ ะฟะพั‚ะพะบะพะฒ
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ะŸะพัะปะต ะทะฐะบั€ั‹ั‚ะธั ะฒัะตั… ะฟะพั‚ะพะบะพะฒ ะฟะตั‡ะฐั‚ะฐะตะผ ะฒั€ะตะผั
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

ืขืก ื›ึผื•ืœืœ ืคืขืœื“ืขืจ ืฉื™ื™ึทื›ื•ืช ืฆื• ื“ืขื ืึทืจื˜ื™ืงืœ ื–ื™ืš ืื•ืŸ ื“ืขื ืžื—ื‘ืจ ื•ื•ืืก ื”ืื˜ ื’ืขืฉืจื™ื‘ืŸ ืขืก.

API.png

ื›ืœ ื”ืึทื‘ืจ ืื™ืŸ ืื™ื™ืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก

ืื™ืš ื”ืื˜ ื ื™ืฉื˜ ื“ืึทืžืคึผ ื“ื™ ืคื•ืœ ื“ื–ืฉืกืึธืŸ ืคื•ืŸ ื™ืขื“ืขืจ ืึทืจื˜ื™ืงืœ, ืึธื‘ืขืจ ื’ืขืจืื˜ืขื•ื•ืขื˜ ื‘ืœื•ื™ื– ื“ื™ ืคืขืœื“ืขืจ ืื™ืš ื“ืืจืฃ:

  • id
  • ืื™ื–_ื˜ื•ื˜ืึธืจื™ืึทืœ
  • time_published
  • ื˜ื™ื˜ืœ
  • ืฆื•ืคืจื™ื“ืŸ
  • comments_count
  • ืœืึทื ื’ ืื™ื– ื“ื™ ืฉืคึผืจืึทืš ืื™ืŸ ื•ื•ืึธืก ื“ืขืจ ืึทืจื˜ื™ืงืœ ืื™ื– ื’ืขืฉืจื™ื‘ืŸ. ื‘ื™ื– ืื™ืฆื˜ ืขืก ื›ึผื•ืœืœ ื‘ืœื•ื™ื– ืขืŸ ืื•ืŸ ืจื•.
  • ื˜ืึทื’ืก_ืกื˜ืจื™ื ื’ - ืึทืœืข ื˜ืึทื’ืก ืคื•ืŸ ื“ืขื ืคึผืึธืกื˜ืŸ
  • ืœื™ื™ืขื ืขืŸ_ืฆื™ื™ืœืŸ
  • ืžืขื›ืึทื‘ืขืจ
  • ื›ืขื–ืฉื‘ืŸ - ืึทืจื˜ื™ืงืœ ืฉืึทืฅ.

ืื–ื•ื™, ืžื™ื˜ ื“ื™ API, ืื™ืš ืจื™ื“ื•ืกื˜ ื“ื™ ื“ื•ืจื›ืคื™ืจื•ื ื’ ืฆื™ื™ื˜ ืคื•ืŸ ืฉืจื™ืคื˜ ืฆื• 8 ืกืขืงื•ื ื“ืขืก ืคึผืขืจ 100 URL.

ื ืึธืš ืžื™ืจ ื”ืึธื‘ืŸ ื“ืึทื•ื ืœืึธื•ื“ื™ื“ ื“ื™ ื“ืึทื˜ืŸ ืžื™ืจ ื“ืึทืจืคึฟืŸ, ืžื™ืจ ื“ืึทืจืคึฟืŸ ืฆื• ืคึผืจืึธืฆืขืก ืขืก ืื•ืŸ ืึทืจื™ื™ึทืŸ ืขืก ืื™ืŸ ื“ื™ ื“ืึทื˜ืึทื‘ื™ื™ืก. ืขืก ื–ืขื ืขืŸ ืื•ื™ืš ืงื™ื™ืŸ ืคืจืื‘ืœืขืžืขืŸ ืžื™ื˜ ื“ืขื:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # ะžั‚ะบะปัŽั‡ะฐะตะผ ะฟะพะดั‚ะฒะตั€ะถะดะตะฝะธะต ะทะฐะฟะธัะธ, ั‚ะฐะบ ัะบะพั€ะพัั‚ัŒ ัƒะฒะตะปะธั‡ะธะฒะฐะตั‚ัั ะฒ ั€ะฐะทั‹.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # ะ ะฐะดะธ ะปัƒั‡ัˆะตะน ั‡ะธั‚ะฐะตะผะพัั‚ะธ ะฑะฐะทั‹ ะผะพะถะฝะพ ะฟั€ะตะฝะตะฑั€ะตั‡ัŒ ั‡ะธั‚ะฐะตะผะพัั‚ัŒัŽ ะบะพะดะฐ. ะ˜ะปะธ ะฝะตั‚?
                # ะ•ัะปะธ ะฒะฐะผ ั‚ะฐะบ ะบะฐะถะตั‚ัั, ะผะพะถะฝะพ ะฟั€ะพัั‚ะพ ะทะฐะผะตะฝะธั‚ัŒ ะบะพั€ั‚ะตะถ ะฐั€ะณัƒะผะตะฝั‚ะพะผ data. ะ ะตัˆะฐั‚ัŒ ะฒะฐะผ.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

ืกื˜ืึทื˜ื™ืกื˜ื™ืง

ื ื•, ื˜ืจืึทื“ื™ืฉืึทื ืึทืœื™, ืœืขืกืึธืฃ, ืื™ืจ ืงืขื ืขืŸ ืขืงืกื˜ืจืึทืงื˜ ืขื˜ืœืขื›ืข ืกื˜ืึทื˜ื™ืกื˜ื™ืง ืคื•ืŸ ื“ื™ ื“ืึทื˜ืŸ:

  • ืคื•ืŸ ื“ื™ ื“ืขืจื•ื•ืึทืจื˜ 490, ื‘ืœื•ื™ื– 406 ืึทืจื˜ื™ืงืœืขืŸ ื–ืขื ืขืŸ ื“ืึทื•ื ืœืึธื•ื“ื™ื“. ืขืก ื•ื•ืฒึทื–ื˜ ื–ื™ืš ืื•ื™ืก, ืึทื– ืžืขืจ ื•ื•ื™ ื”ืึทืœื‘ (228) ืคึฟื•ืŸ ื“ื™ ืึทืจื˜ื™ืงืœืขืŸ ื•ื•ืขื’ืŸ ื”ืึทื‘ืจืข ื–ืขื ืขืŸ ื‘ืึทื”ืึทืœื˜ืŸ ืึธื“ืขืจ ืื•ื™ืกื’ืขืžืขืงื˜ ื’ืขื•ื•ืึธืจืŸ.
  • ื“ื™ ื’ืื ืฆืข ื“ืึทื˜ืึทื‘ื™ื™ืก, ื•ื•ืึธืก ื‘ืืฉื˜ื™ื™ื˜ ืคื•ืŸ ื›ึผืžืขื˜ ืึท ื”ืึทืœื‘ ืžื™ืœื™ืึธืŸ ืึทืจื˜ื™ืงืœืขืŸ, ื•ื•ื™ื™ื– 2.95 ื’ื™ื’ืื‘ื™ื™ื˜. ืื™ืŸ ืงืึทืžืคึผืจืขืกื˜ ืคืึธืจืขื - 495 ืžืขื’ืื‘ื™ื™ื˜ืŸ.
  • ืื™ืŸ ื’ืึทื ืฅ, ืขืก ื–ืขื ืขืŸ 37804 ืžื—ื‘ืจื™ื ืื•ื™ืฃ Habrรฉ. ืœืึธื–ืŸ ืžื™ืจ ื“ืขืจืžืึธื ืขืŸ ืื™ืจ ืึทื– ื“ืึธืก ื–ืขื ืขืŸ ืกื˜ืึทื˜ื™ืกื˜ื™ืง ื‘ืœื•ื™ื– ืคึฟื•ืŸ ืœืขื‘ืŸ ืึทืจื˜ื™ืงืœืขืŸ.
  • ื“ื™ ืžืขืจืกื˜ ืคึผืจืึธื“ื•ืงื˜ื™ื•ื• ืžื—ื‘ืจ ืื•ื™ืฃ Habrรฉ - ืึทืœื™ื–ืขืจ โ€” 8774 ืืจื˜ื™ืงืœืขืŸ .
  • ื”ืขื›ืกื˜ ืจื™ื™ื˜ืึทื“ ืึทืจื˜ื™ืงืœ โ€” 1448 ืคืœืืฅ
  • ืžืขืจืกื˜ ืœื™ื™ืขื ืขืŸ ืึทืจื˜ื™ืงืœ โ€” 1660841 ืงื•ืงืŸ
  • ืจื•ื‘ึฟ ื’ืขืจืขื“ื˜ ื•ื•ืขื’ืŸ ืึทืจื˜ื™ืงืœ โ€” 2444 ืงืืžืขื ื˜ืืจืŸ

ื ื•, ืื™ืŸ ื“ื™ ืคืึธืจืขื ืคื•ืŸ ื˜ืึทืคึผืกTop 15 ืžื—ื‘ืจื™ืื›ืœ ื”ืึทื‘ืจ ืื™ืŸ ืื™ื™ืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก
Top 15 ืœื•ื™ื˜ ืฉืึทืฅื›ืœ ื”ืึทื‘ืจ ืื™ืŸ ืื™ื™ืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก
Top 15 ืœื™ื™ืขื ืขืŸื›ืœ ื”ืึทื‘ืจ ืื™ืŸ ืื™ื™ืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก
Top 15 ื“ื™ืกืงืึทืกื˜ื›ืœ ื”ืึทื‘ืจ ืื™ืŸ ืื™ื™ืŸ ื“ืึทื˜ืึทื‘ื™ื™ืก

ืžืงื•ืจ: www.habr.com

ืœื™ื™ื’ืŸ ืึท ื‘ืึทืžืขืจืงื•ื ื’