Ko nga Habr katoa i roto i te papaarangi kotahi

Kia pai te ahiahi. Kua 2 tau mai i te wa i tuhia ai. tuhinga whakamutunga mo te whakamaarama i a Habr, a kua rereke etahi tohu.

I taku hiahia ki te whai kape o Habr, ka whakatau ahau ki te tuhi i tetahi parser hei whakaora i nga korero katoa o nga kaituhi ki te papaarangi. I pehea te tupu me nga hapa i pa ki ahau - ka taea e koe te panui i raro i te tapahi.

TLDR- hononga pātengi raraunga

Ko te putanga tuatahi o te parser. Kotahi te miro, he maha nga raruraru

I te timatanga, i whakatau ahau ki te hanga tauira tauira, ka poroporoakihia te tuhinga i te wa e tango ana ka tuu ki roto i te putunga korero. Ma te kore e whakaaro rua, ka whakamahia e ahau te sqlite3, na te mea. he iti ake te kaha o te mahi: kaore e tika kia whai i tetahi tūmau rohe, i hangaia-te ahua-muku me nga mea pera.

one_miro.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

He matarohia nga mea katoa - ka whakamahia e matou te Hupa Ataahua, nga tono me te tauira tere kua rite. Koina noa…

  • Ko te tango whārangi kei roto i te miro kotahi

  • Mena ka haukotia e koe te mahinga o te tuhinga, katahi ka kore te paataka raraunga katoa e haere ki hea. I muri i nga mea katoa, ka mahia te commit i muri i nga waahanga katoa.
    Ko te tikanga, ka taea e koe te whakarereke i te paataka raraunga i muri i ia whakaurunga, engari ka tino piki ake te waa mahi tuhinga.

  • Ko te whakamaarama i nga tuhinga tuatahi 100 ka roa ahau e 000 haora.

I muri ka kitea e ahau te tuhinga a te kaiwhakamahi whakakotahi, i panuihia e au, i kitea e au etahi hacks ora hei tere ake i tenei mahi:

  • Ma te whakamahi i nga miro maha ka tere ake te tango i etahi wa.
  • Kaore e taea e koe te tiki i te katoa o te habr, engari ko tana putanga pūkoro.
    Hei tauira, ki te 378 KB te taumaha o te tuhinga whakakotahi i te putanga papamahi, na, kei te putanga pūkoro kua 126 KB.

Putanga tuarua. He maha nga miro, he aukati mo te wa poto mai i a Habr

I taku tirotiro i te Ipurangi mo te kaupapa mo te miro maha i roto i te python, i whiriwhiria e au te kowhiringa ngawari me te multiprocessing.dummy, i kite ahau i puta nga raru me te miro maha.

Kaore a SQLite3 e hiahia ki te mahi me nga miro neke atu i te kotahi.
whakaritea check_same_thread=False, engari ehara i te mea ko tenei hapa anake, i te wa e ngana ana ki te whakauru ki roto i te papanga raraunga, ka puta nga hapa i etahi wa kaore e taea e au te whakaoti.

Na reira, ka whakatau ahau ki te whakarere i te whakauru tonu o nga tuhinga ki roto i te paataka korero, me te mahara ki te otinga whakakotahi, ka whakatau ahau ki te whakamahi i nga konae, na te mea kaore he raru o te tuhi maha-miro ki tetahi konae.

Ka tiimata a Habr ki te aukati mo te whakamahi neke atu i te toru nga miro.
Ina koa ko te ngana ki te haere ki Habr ka mutu te aukati ip mo etahi haora. Na me whakamahi e koe kia 3 noa nga miro, engari he pai tenei, na te mea kua heke te wa ki te huri i nga tuhinga 100 mai i te 26 ki te 12 hēkona.

He mea tika kia mohio ko tenei putanga kaore i te pumau, a ka taka nga tango i ia wa i runga i te maha o nga tuhinga.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Putanga tuatoru. Whakamutunga

I a au e tarai ana i te putanga tuarua, ka kitea e ahau ko Habr, ohorere, he API ka uru atu te putanga pūkoro o te pae. He tere ake te uta atu i te putanga pūkoro, na te mea he json noa, kaore e tika kia poroporoaki. I te mutunga, i whakatau ahau ki te tuhi ano i taku tuhinga.

Na, kua kitea tenei hononga API, ka taea e koe te timata ki te tarai.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Kei roto nga mara e pa ana ki te tuhinga ake me te kaituhi nana i tuhi.

API.png

Ko nga Habr katoa i roto i te papaarangi kotahi

Kaore au i maka i te katoa o te json o ia tuhinga, engari i tiakina anake nga mara e hiahiatia ana e au:

  • id
  • he_tutorial
  • wa_whakaputaia
  • taitara
  • ihirangi
  • tatau_korero
  • ko lang te reo i tuhia ai te tuhinga. I tenei wa, he en me te ru anake.
  • tags_string - nga tūtohu katoa mai i te whakairinga
  • tatau_panui
  • kaituhi
  • score — whakatauranga tuhinga.

No reira, ma te whakamahi i te API, i whakaitihia e ahau te wa mahi tuhinga ki te 8 hēkona mo ia 100 url.

I muri i te tango i nga raraunga e hiahiatia ana e matou, me tukatuka e matou me te whakauru ki roto i te raraunga. Kaore au i raru ki tenei:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Taatauranga

Ae, i nga wa o mua, ka taea e koe te tango i etahi tatauranga mai i nga raraunga:

  • O te 490 tangohanga e tumanakohia ana, 406 noa nga tuhinga i tangohia. Te ahua nei, neke atu i te haurua (228) o nga tuhinga mo Habré i huna, i whakakorea ranei.
  • Ko te patengi raraunga katoa, tata ki te haurua miriona tuhinga, he 2.95 GB te taumaha. I roto i te ahua kōpeke - 495 MB.
  • Hui katoa, 37804 nga tangata nga kaituhi o Habré. Ka whakamahara ahau ki a koe ko enei tatauranga mai i nga panui ora anake.
  • Ko te kaituhi tino whai hua mo Habré - alizar - 8774 tuhinga.
  • Tuhinga o runga — 1448 taapiri
  • Te nuinga panui tuhinga – 1660841 tirohanga
  • Te nuinga o te korero Tuhinga — 2444 nga korero

Ana, i te ahua o nga tihiTop 15 kaituhiKo nga Habr katoa i roto i te papaarangi kotahi
Top 15 mā te whakataurangaKo nga Habr katoa i roto i te papaarangi kotahi
Top 15 panuiKo nga Habr katoa i roto i te papaarangi kotahi
Top 15 I MatapakihiaKo nga Habr katoa i roto i te papaarangi kotahi

Source: will.com

Tāpiri i te kōrero