ʻO Habr āpau i hoʻokahi waihona

Aloha awakea, Aloha Auinalā. He 2 makahiki mai kona kakau ana. ʻatikala hope loa e pili ana i ka hoʻopaʻa ʻana iā Habr, a ua loli kekahi mau wahi.

I koʻu makemake e loaʻa kahi kope o Habr, ua hoʻoholo wau e kākau i kahi parser e mālama i nā ʻike āpau o nā mea kākau i ka waihona. Pehea i hana ai a me nā hewa i loaʻa iaʻu - hiki iā ʻoe ke heluhelu ma lalo o ka ʻoki.

TLDR- loulou waihona

ʻO ka mana mua o ka parser. Hoʻokahi pae, nui nā pilikia

I ka hoʻomaka ʻana, ua hoʻoholo wau e hana i kahi prototype script, kahi e paʻi koke ʻia ai ka ʻatikala ma ka hoʻoiho ʻana a waiho ʻia i loko o ka waihona. Me ka noʻonoʻo ʻole ʻelua, ua hoʻohana wau i ka sqlite3, no ka mea. ʻaʻole pono e loaʻa kahi kikowaena kūloko, hana ʻia-nānā- holoi ʻia a me nā mea like.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Maikaʻi nā mea āpau - hoʻohana mākou i ka Soup Nani, ua mākaukau nā noi a me kahi prototype wikiwiki. ʻO ia wale nō…

  • Aia ka hoʻoiho ʻana i ka ʻaoʻao ma hoʻokahi pae

  • Inā hoʻopau ʻoe i ka hoʻokō ʻana i ka palapala, a laila ʻaʻole e hele ka waihona āpau. Ma hope o nā mea a pau, hana ʻia ka hana ma hope o ka pau ʻana.
    ʻOiaʻiʻo, hiki iā ʻoe ke hoʻololi i ka waihona ma hope o kēlā me kēia hoʻokomo ʻana, akā e hoʻonui nui ʻia ka manawa hoʻokō script.

  • ʻO ka pau ʻana i nā ʻatikala 100 mua i lawe iaʻu i 000 mau hola.

A laila ʻike wau i ka ʻatikala a ka mea hoʻohana hui pū ʻia, aʻu i heluhelu ai a loaʻa i kekahi mau hacks ola e wikiwiki i kēia kaʻina hana:

  • ʻO ka hoʻohana ʻana i ka multithreading e wikiwiki i ka hoʻoiho ʻana i kekahi manawa.
  • ʻAʻole hiki iā ʻoe ke loaʻa i ka mana piha o ka habr, akā ʻo kāna polokalamu kelepona.
    No ka laʻana, inā he 378 KB ke kaumaha o kahi ʻatikala i hui pū ʻia ma ka pākaukau, a laila ma ka polokalamu kelepona ʻo 126 KB.

Manaʻo lua. Nui nā kaula, pāpā ʻia mai Habr

I koʻu ʻimi ʻana i ka Pūnaewele ma ke kumuhana o ka multithreading i ka python, ua koho au i ka koho maʻalahi loa me ka multiprocessing.dummy, ʻike wau ua ʻike ʻia nā pilikia me ka multithreading.

ʻAʻole makemake ʻo SQLite3 e hana me nā pae ʻoi aku ma mua o hoʻokahi.
paa check_same_thread=False, akā, ʻaʻole kēia hewa wale nō, i ka wā e hoʻāʻo ai e hoʻokomo i loko o ka waihona, hiki mai nā hewa i kekahi manawa ʻaʻole hiki iaʻu ke hoʻoponopono.

No laila, hoʻoholo wau e haʻalele i ka hoʻokomo koke ʻana o nā ʻatikala i loko o ka waihona a, me ka hoʻomanaʻo ʻana i ka hopena cointegrated, hoʻoholo wau e hoʻohana i nā faila, no ka mea, ʻaʻohe pilikia me ka kākau ʻana i ka multi-threaded i kahi faila.

Hoʻomaka ʻo Habr e pāpā no ka hoʻohana ʻana ma mua o ʻekolu mau kaula.
ʻO nā ho'āʻo ikaika loa e hele i Habr hiki ke hoʻopau i ka pāpā ip no nā hola ʻelua. No laila pono ʻoe e hoʻohana i 3 mau kaula wale nō, akā ua maikaʻi kēia, no ka mea, ua hoʻemi ʻia ka manawa e hoʻololi ai ma luna o 100 mau ʻatikala mai 26 a 12 kekona.

He mea pono e hoʻomaopopo he paʻa ʻole kēia mana, a hāʻule nā ​​​​hoʻoiho i kēlā me kēia manawa ma kahi nui o nā ʻatikala.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Manaʻo ʻekolu. hope loa

I ka hoʻopau ʻana i ka mana ʻelua, ʻike wau ua loaʻa iā Habr kahi API i loaʻa i ka polokalamu kelepona o ka pūnaewele. ʻOi aku ka wikiwiki o ka hoʻouka ʻana ma mua o ka polokalamu kelepona, ʻoiai ʻo ia wale nō ʻo json, ʻaʻole pono e hoʻopili ʻia. I ka hopena, ua hoʻoholo wau e kākau hou i kaʻu palapala.

No laila, ua loaʻa kēia loulou API, hiki iā ʻoe ke hoʻomaka e hoʻopau iā ia.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Loaʻa iā ia nā kahua e pili ana i ka ʻatikala ponoʻī a me ka mea kākau nāna i kākau.

API.png

ʻO Habr āpau i hoʻokahi waihona

ʻAʻole wau i hoʻolei i ka json piha o kēlā me kēia ʻatikala, akā mālama wale i nā māla aʻu e pono ai:

  • id
  • is_tutorial
  • manawa_puka
  • inoa
  • maʻiʻo
  • helu_helu
  • ʻO lang ka ʻōlelo i kākau ʻia ai ka ʻatikala. I kēia manawa, aia wale nō ka en a me ka ru.
  • tags_string - nā hōʻailona a pau mai ka pou
  • helu_heluhelu
  • mea kākau
  • helu - helu ʻatikala.

No laila, me ka hoʻohana ʻana i ka API, ua hoʻemi au i ka manawa hoʻokō script i 8 kekona no 100 url.

Ma hope o ka hoʻoiho ʻana i ka ʻikepili a mākou e pono ai, pono mākou e hana a hoʻokomo i loko o ka waihona. ʻAʻohe oʻu pilikia me kēia:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Lakeponahelu

ʻAe, ma ka maʻamau, hiki iā ʻoe ke unuhi i kekahi mau helu mai ka ʻikepili:

  • No ka 490 hoʻoiho i manaʻo ʻia, 406 wale nō ʻatikala i hoʻoiho ʻia. ʻIke ʻia ʻoi aku ma mua o ka hapalua (228) o nā ʻatikala ma Habré i hūnā a holoi ʻia.
  • ʻO ka ʻikepili holoʻokoʻa, kahi kokoke i ka hapalua miliona mau ʻatikala, ʻo 2.95 GB ke kaumaha. Ma ke ʻano paʻa - 495 MB.
  • ʻO ka huina, 37804 poʻe nā mea kākau o Habré. Ke hoʻomanaʻo nei au iā ʻoe mai nā pou ola wale nō kēia mau helu.
  • ʻO ka mea kākau ʻoi loa ma Habré - alizar - 8774 ʻatikala.
  • ʻatikala i helu ʻia — 1448 mea hoonui
  • Heluhelu nui i ka 'atikala — 1660841 nana
  • ʻatikala i kūkākūkā nui ʻia — 2444 manaʻo

ʻAe, ma ke ʻano o ka pikoNā mea kākau kiʻekiʻe 15ʻO Habr āpau i hoʻokahi waihona
Top 15 ma ka heluʻO Habr āpau i hoʻokahi waihona
Heluhelu 15 kiʻekiʻeʻO Habr āpau i hoʻokahi waihona
Top 15 Kūkākūkā ʻiaʻO Habr āpau i hoʻokahi waihona

Source: www.habr.com

Pākuʻi i ka manaʻo hoʻopuka