Gbogbo Habr ni ibi ipamọ data kan

E kaasan. O ti jẹ ọdun 2 lati igba ti Mo kowe kẹhin article nipa Habr parsing, ati diẹ ninu awọn ohun ti yi pada.

Nigbati Mo fẹ lati ni ẹda Habr, Mo pinnu lati kọ parser kan ti yoo fipamọ gbogbo akoonu ti awọn onkọwe sinu ibi ipamọ data. Bii o ṣe ṣẹlẹ ati awọn aṣiṣe wo ni Mo pade - o le ka labẹ gige.

TL; DR - asopọ si database

Ẹya akọkọ ti parser. Okun kan, ọpọlọpọ awọn iṣoro

Lati bẹrẹ pẹlu, Mo pinnu lati ṣe apẹrẹ kan ti iwe afọwọkọ ninu eyiti, lẹsẹkẹsẹ lẹhin igbasilẹ, nkan naa yoo ṣe itupalẹ ati gbe sinu ibi ipamọ data. Laisi ronu lẹmeji, Mo lo sqlite3, nitori ... o kere si alaapọn: iwọ ko nilo lati ni olupin agbegbe, ṣẹda, wo, paarẹ ati nkan bii iyẹn.

ọkan_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Ohun gbogbo wa ni ibamu si awọn kilasika - a lo Bimo Lẹwa, awọn ibeere ati apẹrẹ iyara ti ṣetan. Iyẹn lasan...

  • Oju-iwe naa ti wa ni igbasilẹ ni okun kan

  • Ti o ba da idaduro ipaniyan ti iwe afọwọkọ naa, lẹhinna gbogbo ibi ipamọ data kii yoo lọ nibikibi. Lẹhin gbogbo ẹ, ifaramọ naa jẹ ṣiṣe nikan lẹhin gbogbo itọka.
    Nitoribẹẹ, o le ṣe awọn ayipada si ibi ipamọ data lẹhin fifi sii kọọkan, ṣugbọn lẹhinna akoko ipaniyan iwe afọwọkọ yoo pọ si ni pataki.

  • Ṣiṣayẹwo awọn nkan 100 akọkọ gba mi wakati 000.

Lẹhinna Mo wa nkan olumulo ajọpọ, eyiti Mo ka ati rii ọpọlọpọ awọn hakii igbesi aye lati ṣe iyara ilana yii:

  • Lilo multithreading yiyara gbigba lati ayelujara ni pataki.
  • O le gba kii ṣe ẹya kikun ti Habr, ṣugbọn ẹya alagbeka rẹ.
    Fun apẹẹrẹ, ti nkan iṣọpọ ninu ẹya tabili ṣe iwọn 378 KB, lẹhinna ninu ẹya alagbeka o ti jẹ 126 KB tẹlẹ.

Ẹya keji. Ọpọlọpọ awọn okun, idinamọ igba diẹ lati Habr

Nigbati mo ṣawari Intanẹẹti lori koko-ọrọ ti multithreading ni Python ati yan aṣayan ti o rọrun julọ pẹlu multiprocessing.dummy, Mo woye pe awọn iṣoro han pẹlu multithreading.

SQLite3 ko fẹ ṣiṣẹ pẹlu okun to ju ọkan lọ.
Ti o wa titi check_same_thread=False, ṣugbọn aṣiṣe yii kii ṣe ọkan nikan; nigba igbiyanju lati fi sii sinu ibi ipamọ data, nigbami awọn aṣiṣe waye ti emi ko le yanju.

Nitorinaa, Mo pinnu lati kọ ifibọ lẹsẹkẹsẹ ti awọn nkan taara sinu ibi ipamọ data ati, ni iranti ojutu iṣọpọ, Mo pinnu lati lo awọn faili, nitori ko si awọn iṣoro pẹlu kikọ ọpọlọpọ-asapo si faili kan.

Habr bẹrẹ idinamọ fun lilo diẹ ẹ sii ju awọn okun mẹta lọ.
Paapa awọn igbiyanju itara lati de ọdọ Habr le ja si ni wiwọle IP fun awọn wakati meji. Nitorina o ni lati lo awọn okun 3 nikan, ṣugbọn eyi ti dara tẹlẹ, niwon akoko lati to awọn nkan 100 ti dinku lati 26 si 12 awọn aaya.

O tọ lati ṣe akiyesi pe ẹya yii jẹ riru, ati igbasilẹ lorekore kuna lori nọmba nla ti awọn nkan.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Ẹya kẹta. Ipari

Lakoko ti n ṣatunṣe ẹya keji, Mo ṣe awari pe Habr lojiji ni API kan ti o wọle nipasẹ ẹya alagbeka ti aaye naa. O fifuye yiyara ju ẹya alagbeka lọ, nitori pe o kan json, eyiti ko paapaa nilo lati ṣe itupalẹ. Ni ipari, Mo pinnu lati tun kọ iwe afọwọkọ mi lẹẹkansi.

Nitorina, ti o ti ṣe awari ọna asopọ yii API, o le bẹrẹ itupalẹ rẹ.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

O ni awọn aaye ti o ni ibatan si mejeeji nkan naa funrararẹ ati onkọwe ti o kọ ọ.

API.png

Gbogbo Habr ni ibi ipamọ data kan

Emi ko padanu json ni kikun ti nkan kọọkan, ṣugbọn ti o fipamọ awọn aaye ti Mo nilo nikan:

  • id
  • jẹ_ tutorial
  • akoko_atejade
  • akọle
  • akoonu
  • comments_count
  • lang ni ede ti a ti kọ nkan naa. Nítorí jina o nikan ni en ati ru.
  • tags_string - gbogbo awọn afi lati ifiweranṣẹ
  • kika_count
  • onkowe
  • Dimegilio - article Rating.

Nitorinaa, ni lilo API, Mo dinku akoko ipaniyan iwe afọwọkọ si awọn aaya 8 fun url 100.

Lẹhin ti a ti ṣe igbasilẹ data ti a nilo, a nilo lati ṣe ilana rẹ ki o tẹ sii sinu ibi ipamọ data. Ko si awọn iṣoro pẹlu eyi boya:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Статистика

O dara, ni aṣa, nikẹhin, o le jade diẹ ninu awọn iṣiro lati inu data naa:

  • Ninu 490 ti a nireti, awọn nkan 406 nikan ni a ṣe igbasilẹ. O wa jade pe diẹ sii ju idaji (228) ti awọn nkan lori Habré ti wa ni pamọ tabi paarẹ.
  • Gbogbo ibi ipamọ data, ti o ni awọn nkan ti o fẹrẹ to idaji miliọnu, ṣe iwọn 2.95 GB. Ni fisinuirindigbindigbin fọọmu - 495 MB.
  • Lapapọ, awọn onkọwe 37804 wa lori Habré. Jẹ ki n leti pe iwọnyi jẹ awọn iṣiro nikan lati awọn ifiweranṣẹ laaye.
  • Onkọwe ti o munadoko julọ lori Habré - alizar - 8774 ìwé.
  • Top won won article - 1448 pluss
  • Julọ kika article - 1660841 wiwo
  • Julọ ti sọrọ nipa article - 2444 comments

O dara, ni irisi awọn okeTop 15 onkọweGbogbo Habr ni ibi ipamọ data kan
Top 15 nipasẹ RatingGbogbo Habr ni ibi ipamọ data kan
Top 15 kikaGbogbo Habr ni ibi ipamọ data kan
Top 15 jiroroGbogbo Habr ni ibi ipamọ data kan

orisun: www.habr.com

Fi ọrọìwòye kun