Zonse za Habr mu database imodzi

Masana abwino. Patha zaka 2 chilembedwe. nkhani yapitayi za kuyika Habr, ndipo mfundo zina zasintha.

Ndikafuna kukhala ndi buku la Habr, ndidaganiza zolemba cholembera chomwe chingasunge zonse zomwe olembawo adalemba ku database. Zomwe zidachitika komanso zolakwika zomwe ndidakumana nazo - mutha kuwerenga pansi padulidwe.

TLDR- ulalo wa database

Mtundu woyamba wa parser. Ulusi umodzi, mavuto ambiri

Poyamba, ndinaganiza zopanga script prototype momwe nkhaniyo ingagawidwe ndikuyiyika mu database ndikangotsitsa. Popanda kuganiza kawiri, ndinagwiritsa ntchito sqlite3, chifukwa. zinali zochepa zogwira ntchito: palibe chifukwa chokhala ndi seva yapafupi, yowoneka-yochotsedwa ndi zinthu monga choncho.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Chilichonse ndichabwino kwambiri - timagwiritsa ntchito Msuzi Wokongola, zopempha ndi chitsanzo chachangu chakonzeka. Ndizo basi…

  • Kutsitsa tsamba kuli mu ulusi umodzi

  • Ngati musokoneza machitidwe a script, ndiye kuti database yonseyo sipita kulikonse. Kupatula apo, kudzipereka kumangochitika pambuyo pa kugawa konse.
    Zachidziwikire, mutha kupanga zosintha pazosungidwa mukatha kuyika kulikonse, koma nthawi yopangira script idzawonjezeka kwambiri.

  • Kulemba zolemba 100 zoyambirira zidanditengera maola 000.

Kenako ndimapeza nkhani ya wosuta kuphatikiza, zomwe ndidawerenga ndikupeza ma hacks ochepa kuti afulumizitse njirayi:

  • Kugwiritsa ntchito multithreading kumathandizira kutsitsa nthawi zina.
  • Simungapeze mtundu wonse wa habr, koma mtundu wake wam'manja.
    Mwachitsanzo, ngati nkhani yophatikizidwa mumtundu wa desktop ikulemera 378 KB, ndiye kuti mumtundu wam'manja ndi 126 KB kale.

Baibulo lachiwiri. Ulusi wambiri, kuletsedwa kwakanthawi kuchokera kwa Habr

Nditafufuza pa intaneti pamutu wa multithreading mu python, ndinasankha njira yosavuta kwambiri ndi multiprocessing.dummy, ndinawona kuti mavuto adawonekera pamodzi ndi multithreading.

SQLite3 safuna kugwira ntchito ndi ulusi wopitilira umodzi.
okhazikika check_same_thread=False, koma cholakwika ichi sichokha, poyesera kuyika mu database, nthawi zina zolakwika zimachitika zomwe sindinathe kuzithetsa.

Chifukwa chake, ndaganiza zosiya kuyikapo nthawi yomweyo zolembazo mwachindunji mu database ndipo, pokumbukira yankho lophatikizidwa, ndisankha kugwiritsa ntchito mafayilo, chifukwa palibe zovuta pakulemba kwamitundu yambiri pafayilo.

Habr akuyamba kuletsa kugwiritsa ntchito ulusi wopitilira atatu.
Makamaka kuyesa mwachangu kuti mudutse kwa Habr kumatha kukhala ndi kuletsa kwa ip kwa maola angapo. Chifukwa chake muyenera kugwiritsa ntchito ulusi wa 3 okha, koma izi ndizabwino kale, popeza nthawi yobwereza zolemba zopitilira 100 imachepetsedwa kuchoka pa masekondi 26 mpaka 12.

Ndizofunikira kudziwa kuti mtundu uwu ndi wosakhazikika, ndipo kutsitsa nthawi ndi nthawi kumagwera pamitu yambiri.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Mtundu wachitatu. Chomaliza

Ndikusintha mtundu wachiwiri, ndidazindikira kuti Habr, mwadzidzidzi, ali ndi API yomwe tsamba lawebusayiti limapeza. Imadzaza mwachangu kuposa mtundu wam'manja, chifukwa ndi json chabe, yomwe sifunikiranso kugawidwa. Pamapeto pake, ndinaganiza zolemberanso script yanga kachiwiri.

Choncho, anapeza izi API, mutha kuyamba kuyigawa.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Lili ndi magawo okhudzana ndi nkhaniyo komanso kwa wolemba amene adalemba.

API.png

Zonse za Habr mu database imodzi

Sindinatayire json yonse ya nkhani iliyonse, koma ndinangosunga magawo omwe ndimafunikira:

  • id
  • ndi_maphunziro
  • nthawi_yosindikizidwa
  • mutu
  • okhutira
  • ndemanga_kuwerengera
  • lang ndi chinenero chimene nkhaniyo inalembedwa. Mpaka pano, ili ndi en ndi ru.
  • tags_string - ma tag onse kuchokera pa positi
  • kuwerenga_kuwerengera
  • wolemba
  • mphambu - mlingo wa nkhani.

Chifukwa chake, pogwiritsa ntchito API, ndinachepetsa nthawi yolemba script kukhala masekondi 8 pa 100 url.

Titatsitsa zomwe tikufuna, tiyenera kuzikonza ndikuzilowetsa mu database. Nanenso ndinalibe vuto lililonse ndi izi:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Amabala

Chabwino, mwamwambo, pomaliza, mutha kuchotsa ziwerengero kuchokera pazambiri:

  • Pa zotsitsa 490 zomwe zikuyembekezeredwa, zolemba 406 zokha zidatsitsidwa. Zikuoneka kuti zoposa theka (228) za zolemba za Habré zidabisika kapena kuchotsedwa.
  • Dongosolo lonse, lomwe lili ndi zolemba pafupifupi theka la miliyoni, limalemera 2.95 GB. Mu wothinikizidwa mawonekedwe - 495 MB.
  • Pazonse, anthu 37804 ndi omwe adalemba Habré. Ndikukumbutsani kuti ziwerengerozi zikuchokera pazolemba zomwe zikuchitika.
  • Wolemba wopambana kwambiri pa Habré - alizar Zithunzi za 8774
  • Nkhani yovoteledwa kwambiri - 1448 zowonjezera
  • Nkhani yowerengedwa kwambiri - 1660841 mawonedwe
  • Nkhani Yokambidwa Kwambiri - 2444 ndemanga

Chabwino, mu mawonekedwe a pamwambaOlemba 15 apamwambaZonse za Habr mu database imodzi
Top 15 povoteraZonse za Habr mu database imodzi
Top 15 werenganiZonse za Habr mu database imodzi
Top 15 ZokambidwaZonse za Habr mu database imodzi

Source: www.habr.com

Kuwonjezera ndemanga