Habr niile n'otu nchekwa data

Ehihie ọma. Afọ 2 agaala ka edere ya. ikpeazụ isiokwu gbasara ịkọwa Habr, isi ihe ụfọdụ agbanweela.

Mgbe m chọrọ inwe otu Habr, ekpebiri m ide ihe nzacha nke ga-echekwa ọdịnaya niile nke ndị dere na nchekwa data. Otú o si mee na ihe njehie m zutere - ị nwere ike ịgụ n'okpuru ịkpụ.

TLDR- nchekwa data njikọ

Ụdị nke mbụ nke parser. Otu eriri, ọtụtụ nsogbu

Iji malite, ekpebiri m ịme ụdị edemede nke a ga-atụgharị ma tinye ya na nchekwa data ozugbo na nbudata. N'echeghị echiche ugboro abụọ, ejiri m sqlite3, n'ihi na. ọ dị obere na-arụsi ọrụ ike: ọ dịghị mkpa ịnweta ihe nkesa mpaghara, ehichapụrụ-ele anya na ihe ndị dị otú ahụ.

otu_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Ihe niile bụ kpochapụwo - anyị na-eji ofe mara mma, arịrịọ na ngwa ngwa prototype dị njikere. Nke ahụ bụ naanị…

  • Nbudata ibe dị n'otu eri

  • Ọ bụrụ na ị kwụsịrị ogbugbu nke edemede, mgbe ahụ dum nchekwa data agaghị aga ebe ọ bụla. E kwuwerị, a na-eme mkpebi ahụ naanị mgbe nyochachara niile.
    N'ezie, ị nwere ike ịme mgbanwe na nchekwa data mgbe ntinye nke ọ bụla, ma mgbe ahụ, oge igbu oge ga-abawanye nke ukwuu.

  • Ịtụle akụkọ 100 mbụ were m awa 000.

Ọzọ, ahụrụ m akụkọ onye ọrụ jikọtara ọnụ, nke m gụrụ wee chọta ndụ hacks ole na ole iji mee ka usoro a dị ngwa:

  • Iji multithreading na-eme ka nbudata ngwa ngwa mgbe ụfọdụ.
  • Ịnweghị ike nweta ụdị habr zuru ezu, mana ụdị mkpanaka ya.
    Dịka ọmụmaatụ, ọ bụrụ na isiokwu jikọtara na ụdị desktọpụ dị 378 KB, mgbe ahụ na ụdị mkpanaka ọ dịlarị 126 KB.

Ụdị nke abụọ. Ọtụtụ eri, mmachibido iwu nwa oge na Habr

Mgbe m nyochara ịntanetị na isiokwu nke multithreading na Python, ahọpụtara m nhọrọ kachasị mfe na multiprocessing.dummy, achọpụtara m na nsogbu pụtara yana multithreading.

SQLite3 achọghị iji ihe karịrị otu eri rụọ ọrụ.
edoziri check_same_thread=False, ma njehie a abụghị naanị, mgbe ị na-agbalị itinye n'ime nchekwa data, njehie na-eme mgbe ụfọdụ nke m na-enweghị ike idozi.

Ya mere, m na-ekpebi ịhapụ ntinye ngwa ngwa nke isiokwu ozugbo na nchekwa data na, na-echeta ngwọta nke jikọtara ọnụ, m na-ekpebi iji faịlụ, n'ihi na ọ dịghị nsogbu na multi-threaded ederede na faịlụ.

Habr malitere machibido maka iji ihe karịrị eri atọ.
Mgbalị ịnụ ọkụ n'obi na-eme iji gafere Habr nwere ike mechie site na mmachibido ip maka awa ole na ole. Ya mere, ị ga-eji naanị 3 eri, ma nke a adịlarị mma, ebe ọ bụ na oge ikwugharị ihe karịrị 100 akụkọ na-ebelata site na 26 ruo 12 sekọnd.

Ọ dị mma ịmara na ụdị a adịghị akwụsi ike, na nbudata nbudata kwa oge na-adaba na ọnụ ọgụgụ buru ibu nke akụkọ.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Ụdị nke atọ. Ikpeazụ

Mgbe m na-emezigharị ụdị nke abụọ, achọpụtara m na Habr, na mberede, nwere API nke ụdị mkpanaka nke saịtị ahụ na-enweta. Ọ na-ebu ngwa ngwa karịa ụdị mkpanaka, ebe ọ bụ naanị json, nke na-adịghị mkpa ka atụgharị ya. N'ikpeazụ, ekpebiri m idegharị edemede m ọzọ.

Ya mere, ọ chọtara njikọ a API, ị nwere ike ịmalite ịtụgharị ya.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

O nwere ngalaba metụtara ma akụkọ ahụ n'onwe ya ma onye dere ya dere ya.

API.png

Habr niile n'otu nchekwa data

Atụfuru m json zuru ezu nke akụkọ ọ bụla, mana echekwara m naanị ubi m chọrọ:

  • id
  • bụ_nkuzi
  • oge_ebipụta
  • aha
  • content
  • nkọwa_count
  • lang bụ asụsụ e ji dee akụkọ ahụ. Ka ọ dị ugbu a, ọ nwere naanị en na ru.
  • tags_string - mkpado niile sitere na post
  • ọgụgụ_agụ
  • odee
  • akara - isiokwu rating.

Ya mere, n'iji API, m belatara oge mkpochapụ ederede gaa na 8 sekọnd kwa 100 url.

Mgbe anyị ebudatara data anyị chọrọ, anyị kwesịrị ịhazi ya wee tinye ya na nchekwa data. Enweghị m nsogbu ọ bụla na nke a:

okwu.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Ọnụ ọgụgụ

Ọfọn, omenala, n'ikpeazụ, ị nwere ike wepụ ụfọdụ ọnụ ọgụgụ na data:

  • N'ime nbudata 490 a tụrụ anya ya, naanị 406 ka ebudatara. Ọ tụgharịrị na ihe karịrị ọkara (228) nke akụkọ dị na Habré zoro ezo ma ọ bụ ehichapụ.
  • Ebe nchekwa data niile, nke nwere ihe fọrọ nke nta ka ọ bụrụ ọkara nde akụkọ, ruru 2.95 GB. Na abịakọrọ ụdị - 495 MB.
  • Na mkpokọta, mmadụ 37804 bụ ndị dere Habré. A na m echetara gị na ọnụ ọgụgụ ndị a sitere na akwụkwọ ozi dị ndụ.
  • Onye ode akwụkwọ kacha arụ ọrụ na Habré - alizar - 8774 akụkọ.
  • Edemede kacha elu - 1448 gbakwunyere
  • Akụkọ kacha agụ - 1660841 echiche
  • Edemede a kacha kparịta ụka - 2444 kwuru

Ọfọn, n'ụdị n'eluNdị ode akwụkwọ iri na ise kacha eluHabr niile n'otu nchekwa data
Top 15 site n'okeHabr niile n'otu nchekwa data
Top 15 gụrụHabr niile n'otu nchekwa data
Top 15 a kparịtara ụkaHabr niile n'otu nchekwa data

isi: www.habr.com

Tinye a comment