O Habr uma ile tasi faʻamaumauga

Manuia le aoauli. Ua 2 tausaga talu ona tusia. mataupu mulimuli e uiga i le faʻavasegaina o Habr, ma o nisi o manatu ua suia.

Ina ua ou manaʻo e maua se kopi o le Habr, na ou filifili e tusi se parser e faʻasaoina uma mea o loʻo i ai tusitala i le database. Na faapefea ona tupu ma o a mea sese na ou feagai - e mafai ona e faitau i lalo o le tipi.

TLDR- feso'ota'iga fa'amaumauga

Le lomiga muamua o le parser. Tasi filo, tele faafitauli

I le amataga, na ou filifili e fai se faʻataʻitaʻiga faʻataʻitaʻiga, lea o le a faʻapipiʻi vave ai le tusiga pe a sii mai ma tuʻu i totonu o le database. A aunoa ma le mafaufau faalua, na ou faʻaaogaina sqlite3, aua. sa itiiti le galue malosi: e le mana'omia le i ai o se 'au'aunaga fa'apitonu'u, faia-va'aiga-tapē ma mea faapena.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

E masani mea uma - matou te faʻaaogaina Supa Matagofie, talosaga ma se faʻataʻitaʻiga vave ua saunia. E na o le…

  • O le laiga itulau o lo'o i totonu o le filo e tasi

  • Afai e te faʻalavelaveina le faʻatinoina o le tusitusiga, o le a leai se mea e alu ai le database atoa. A uma mea uma, o le tautinoga e faia naʻo le maeʻa uma o faʻasalalauga.
    Ioe, e mafai ona e faʻaleleia suiga i le database pe a maeʻa faʻapipiʻi taʻitasi, ae o le taimi o le faʻatinoina o tusitusiga o le a faʻatupulaia tele.

  • O le su'esu'eina o ulua'i tala e 100 na alu ai le 000 itula.

Le isi ou te mauaina le tusiga a le tagata faʻaoga fa'atasi, lea na ou faitauina ma maua ai ni nai hacks o le olaga e faatelevave ai lenei faagasologa:

  • O le fa'aaogaina o le multithreading e fa'atelevave ai le la'uina i lalo i nisi taimi.
  • E le mafai ona e mauaina le atoatoaga o le habr, ae o lona telefoni feaveaʻi.
    Mo se faʻataʻitaʻiga, afai o se tusitusiga faʻapipiʻi i le desktop version e mamafa le 378 KB, ona i ai lea i le telefoni feaveaʻi ua uma ona 126 KB.

Faiga lona lua. Tele filo, fa'asa le tumau mai Habr

Ina ua ou suʻesuʻeina le Initaneti i luga o le autu o le multithreading i le python, sa ou filifilia le filifiliga sili ona faigofie ma multiprocessing.dummy, na ou matauina o faʻafitauli na faʻaalia faatasi ma multithreading.

SQLite3 e le manaʻo e galue ma sili atu ma le tasi filo.
mautu check_same_thread=False, ae o lenei mea sese e le naʻo le tasi, pe a taumafai e faʻaofi i totonu o le database, o nisi taimi e tupu ai mea sese e le mafai ona ou foia.

O le mea lea, ou te filifili e lafoaʻia le faʻapipiʻiina vave o tusiga i totonu o le database ma, manatua le fofo faʻapipiʻi, ou te filifili e faʻaoga faila, aua e leai ni faʻafitauli i le tele-fila tusitusi i se faila.

Ua amata ona faasaina e Habr le faʻaaogaina o le sili atu ma le tolu filo.
Aemaise lava taumafaiga maelega e oo atu i Habr e mafai ona iu i se ip faasaina mo ni nai itula. O lea e tatau ona e faʻaaogaina naʻo le 3 filo, ae o le mea lea ua lelei, talu ai o le taimi e toe faʻataʻitaʻia ai le 100 tala e faʻaititia mai le 26 i le 12 sekone.

E taua le maitauina o lenei faʻamatalaga e le mautonu, ma o faʻamaumauga e paʻu i lea taimi ma lea taimi i luga o le tele o tala.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Faiga lona tolu. Mulimuli

A o faʻapipiʻiina le lomiga lona lua, na ou iloa ai o Habr, faʻafuaseʻi lava, o loʻo i ai se API e maua e le telefoni feaveaʻi o le saite. E vave tele atu nai lo le telefoni feaveai, talu ai e na o le json, e le manaʻomia foʻi ona faʻasalalau. I le faaiuga, sa ou filifili e toe tusi la'u tusitusiga.

O lea la, ina ua e mauaina lenei sootaga API, e mafai ona e amata fa'avasegaina.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

O lo'o iai vaega e feso'ota'i uma i le tala lava ia ma le tusitala na tusia.

API.png

O Habr uma ile tasi faʻamaumauga

Ou te lei lafoaia le json atoa o tusiga taitasi, ae na o fanua ou te manaomia na faasaoina:

  • id
  • is_tutorial
  • time_published
  • Igoa
  • lotomalie
  • fa'amatalaga_faitau
  • lang o le gagana o lo'o tusia ai le tala. E oo mai i le taimi nei, e na o le en ma le ru.
  • tags_string - fa'ailoga uma mai le pou
  • faitau_faitau
  • tusitala
  • togi - fa'avasegaina o tusitusiga.

O le mea lea, i le faʻaaogaina o le API, na ou faʻaititia le taimi o le faʻatinoina o tusitusiga i le 8 sekone i le 100 url.

A maeʻa ona matou siiina mai faʻamaumauga matou te manaʻomia, matou te manaʻomia le faʻagasolo ma tuʻuina i totonu o le database. E leʻi iai foʻi ni faʻafitauli i lenei mea:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Faʻamatalaga

Ia, masani, mulimuli ane, e mafai ona e suʻeina ni fuainumera mai faʻamaumauga:

  • Mai le 490 na faʻamoemoeina na sii mai, naʻo le 406 tala na sii mai. E foliga mai e sili atu ma le afa (228) o tala i luga o Habré na natia pe tapeina.
  • O le faʻamaumauga atoa, e aofia ai le toeitiiti atoa le afa miliona tala, e mamafa le 2.95 GB. I le faʻapipiʻiina - 495 MB.
  • I le aofaʻi, 37804 tagata o tusitala o Habré. Ou te faamanatu atu ia te oe o nei fuainumera e na'o tala ola.
  • Le tusitala sili ona aoga i Habré - alizar - 8774 mataupu.
  • Tusitala pito i luga — 1448 faaopoopo
  • Le tele o tala faitau — 1660841 vaaiga
  • Mataupu e Tele Talanoaina — 2444 faamatalaga

Ia, i foliga o pito i lugaTop 15 tusitalaO Habr uma ile tasi faʻamaumauga
Top 15 ile fa'avasegagaO Habr uma ile tasi faʻamaumauga
Faitauga 15 pito i lugaO Habr uma ile tasi faʻamaumauga
Top 15 TalanoainaO Habr uma ile tasi faʻamaumauga

puna: www.habr.com

Faaopoopo i ai se faamatalaga