Yonke iHabr kwisiseko sedatha enye

Mholo emvakwemini. Ibiyi 2 years ndayibhala inqaku lokugqibela malunga nokwahlulahlula kukaHabr, kwaye ezinye izinto zitshintshile.

Xa ndifuna ukuba nekopi yeHabr, ndagqiba ekubeni ndibhale i-parser eya kugcina yonke imixholo yababhali kwisiseko sedatha. Kwenzeka njani kwaye zeziphi iimpazamo endidibana nazo - unokufunda phantsi kokusikwa.

TL;DR - ikhonkco kwisiseko sedatha

Uguqulelo lokuqala lomhlalutyi. Umsonto omnye, iingxaki ezininzi

Ukuqala, ndaye ndagqiba ekubeni ndenze iprototype yeskripthi apho, ngokukhawuleza emva kokukhuphela, inqaku liya kucazululwa kwaye libekwe kwisiseko sedatha. Ngaphandle kokucinga kabini, ndasebenzisa i-sqlite3, kuba... bekunzima kakhulu: awudingi ukuba neseva yendawo, yenza, jonga, cima kunye nezinto ezinjalo.

umsonto_omnye.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Yonke into ihambelana neeklasikhi - sisebenzisa iSobho esihle, izicelo kunye neprototype ekhawulezayo ilungile. Yiloo nto nje...

  • Iphepha likhutshelwe kumsonto omnye

  • Ukuba uphazamisa ukuphunyezwa kweskripthi, ngoko yonke i-database ayiyi kuya ndawo. Emva kwayo yonke loo nto, isibophelelo senziwa kuphela emva kokwahlulahlula konke.
    Ngokuqinisekileyo, unokwenza utshintsho kwisiseko sedatha emva kokufaka ngalunye, kodwa ke ixesha lokwenziwa kwescript liya kwanda kakhulu.

  • Ukuhlaziya amanqaku okuqala ayi-100 kundithathe iiyure ezisi-000.

Emva koko ndifumana inqaku lomsebenzisi zidityanisiwe, endiyifundileyo kwaye ndafumana iihacks ezininzi zobomi ukukhawulezisa le nkqubo:

  • Ukusebenzisa imisonto emininzi kukhawuleza ukukhuphela ngokubalulekileyo.
  • Awunakufumana inguqulelo epheleleyo yeHabr, kodwa inguqulelo yayo yeselula.
    Ngokomzekelo, ukuba inqaku elidibeneyo kwi-desktop version inobunzima be-378 KB, ngoko kwi-mobile version sele i-126 KB.

Inguqulelo yesibini. Imisonto emininzi, ukuvalwa okwethutyana kuHabr

Xa ndikhangela i-Intanethi ngesihloko sokufundwa kwee-multithreading kwipython kwaye ndakhetha eyona ndlela ilula nge-multiprocessing.dummy, ndaqaphela ukuba iingxaki zivela kunye nokuphindaphinda okuninzi.

I-SQLite3 ayifuni ukusebenza ngomsonto ongaphezulu kwesinye.
Ilungisiwe check_same_thread=False, kodwa le mpazamo asiyiyo yodwa, xa uzama ukufaka kwisiseko sedatha, ngamanye amaxesha iimpazamo zivela endingakwaziyo ukuzisombulula.

Ngoko ke, ndigqiba ekubeni ndilahle ukufakwa kwamanqaku ngokukhawuleza kwisiseko sedatha kwaye, ndikhumbula isisombululo esidibeneyo, ndigqiba ekubeni ndisebenzise iifayile, kuba akukho ngxaki ngokubhala imisonto emininzi kwifayile.

UHabr uqala ukuvala ukusebenzisa imisonto engaphezu kwemithathu.
Ingakumbi imizamo yenzondelelo yokufikelela kuHabr inokubangela ukuvalwa kwe-IP iiyure ezimbalwa. Ke kufuneka usebenzise imisonto emi-3 kuphela, kodwa oku sele kulungile, kuba ixesha lokuhlela amanqaku ali-100 lincitshisiwe ukusuka kwimizuzwana engama-26 ukuya kwe-12.

Kuyaphawuleka ukuba le nguqulo ayizinzanga, kwaye ukukhuphela rhoqo kusilela kwinani elikhulu lamanqaku.

isync_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Inguqulelo yesithathu. Okokugqibela

Ngelixa ndilungisa inguqulelo yesibini, ndafumanisa ukuba uHabr ngequbuliso une-API efikelelwa yinguqulelo yeselula yesiza. Ilayisha ngokukhawuleza kunenguqulelo yeselula, kuba yi-json nje, engadingi kwahlulwa. Ekugqibeleni, ndagqiba ekubeni ndibhale kwakhona iskripthi sam kwakhona.

Ngoko, emva kokufumana esi sixhobo API, ungaqala ukuyicalula.

isync_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Iqulethe imimandla ehambelana nenqaku ngokwalo kunye nombhali oyibhalileyo.

API.png

Yonke iHabr kwisiseko sedatha enye

Khange ndilahle i-json epheleleyo yenqaku ngalinye, kodwa ndigcine kuphela iindawo endizifunayo:

  • id
  • si_sisifundo
  • ixesha_lipapashiwe
  • isihloko
  • umxholo
  • izimvo_ukubalwa
  • isiLang lulwimi elibhalwa ngalo inqaku. Ukuza kuthi ga ngoku iqulethe kuphela i-en kunye ne-ru.
  • tags_string - zonke iithegi ezivela kwiposi
  • ukufunda_ukubala
  • umbhali
  • amanqaku - amanqaku amanqaku.

Ngaloo ndlela, usebenzisa i-API, ndanciphisa ixesha lokubhalwa kweskripthi ukuya kwimizuzwana eyi-8 nge-100 url.

Emva kokuba sikhuphe idatha esiyifunayo, kufuneka siyiqhube kwaye siyifake kwisiseko sedatha. Bekungekho ngxaki nakule:

uhlalutyi.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Izibalo

Ewe, ngokwesiko, ekugqibeleni, unokukhupha iinkcukacha-manani kwidatha:

  • Kuma-490 406 ebekulindeleke, kukhutshelwe amanqaku angama-228 kuphela. Kuvela ukuba ngaphezu kwesiqingatha (512) samanqaku eHabré afihliweyo okanye acinywa.
  • I-database yonke, equka malunga nesiqingatha sesigidi samanqaku, inobunzima be-2.95 GB. Kwifom ecinezelweyo - 495 MB.
  • Bebonke, kukho ababhali abangama-37804 kwiHabré. Makhe ndikukhumbuze ukuba ezi zibalo zivela kwizithuba eziphilayo kuphela.
  • Oyena mbhali unemveliso kuHabré - alizar - 8774 amanqaku.
  • Inqaku elikalwe phezulu - 1448 pluss
  • Uninzi lwenqaku elifundwayo - 1660841 izimvo
  • Uninzi lwathetha ngenqaku - 2444 izimvo

Ewe, ngendlela yeetopsTop 15 ababhaliYonke iHabr kwisiseko sedatha enye
Top 15 ngokukalaYonke iHabr kwisiseko sedatha enye
I-15 ephezulu efundiweyoYonke iHabr kwisiseko sedatha enye
Top 15 KuxoxiweYonke iHabr kwisiseko sedatha enye

umthombo: www.habr.com

Yongeza izimvo