Tout Habr nan yon sèl baz done

Bon apremidi. Sa fè 2 zan depi li ekri. dènye atik sou analiz Habr, ak kèk pwen yo te chanje.

Lè mwen te vle gen yon kopi Habr, mwen te deside ekri yon parser ki ta sove tout kontni otè yo nan baz done a. Ki jan li te rive ak ki erè mwen te rankontre - ou ka li anba koupe a.

TLDR- lyen baz done

Premye vèsyon analizeur la. Yon sèl fil, anpil pwoblèm

Pou kòmanse, mwen deside fè yon pwototip script, nan ki atik la ta dwe analize imedyatman apre telechaje epi mete yo nan baz done a. San yo pa reflechi de fwa, mwen te itilize sqlite3, paske. li te mwens travay-entansif: pa bezwen gen yon sèvè lokal, kreye-gade-efase ak bagay konsa.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Tout se klasik - nou itilize bèl soup, demann ak yon pwototip rapid pare. Se jis...

  • Paj telechaje se nan yon sèl fil

  • Si ou entèwonp ekzekisyon an nan script la, Lè sa a, tout baz done a pral ale okenn kote. Apre yo tout, komèt la fèt sèlman apre tout analiz la.
    Natirèlman, ou ka komèt chanjman nan baz done a apre chak ensèsyon, men Lè sa a, tan an ekzekisyon script ap ogmante anpil.

  • Analize premye 100 atik yo te pran m '000 èdtan.

Apre mwen jwenn atik itilizatè a koentegre, ke mwen li epi mwen jwenn kèk antay lavi pou akselere pwosesis sa a:

  • Sèvi ak multithreading akselere telechaje pafwa.
  • Ou ka jwenn pa vèsyon konplè a nan habr la, men vèsyon mobil li yo.
    Pou egzanp, si yon atik cointegrated nan vèsyon an Desktop peze 378 KB, Lè sa a, nan vèsyon an mobil li deja 126 KB.

Dezyèm vèsyon. Anpil fil, entèdiksyon tanporè nan Habr

Lè mwen te fouye entènèt la sou sijè a nan multithreading nan python, mwen te chwazi opsyon ki pi senp ak multiprocessing.dummy, mwen remake ke pwoblèm parèt ansanm ak multithreading.

SQLite3 pa vle travay ak plis pase yon fil.
fiks check_same_thread=False, men erè sa a se pa youn nan sèlman, lè w ap eseye insert nan baz done a, erè pafwa rive ke mwen pa t 'kapab rezoud.

Se poutèt sa, mwen deside abandone ensèsyon enstantane nan atik dirèkteman nan baz done a epi, sonje solisyon an kointegrated, mwen deside sèvi ak fichye, paske pa gen okenn pwoblèm ak ekri milti-threaded nan yon dosye.

Habr kòmanse entèdi pou itilize plis pase twa fil.
Espesyalman tantativ zele pou jwenn nan Habr ka fini ak yon entèdiksyon IP pou yon koup de èdtan. Se konsa, ou dwe itilize sèlman 3 fil, men sa a se deja bon, depi lè a repete plis pase 100 atik redwi soti nan 26 a 12 segonn.

Li se vo anyen ke vèsyon sa a se pito enstab, ak telechaje detanzantan tonbe sou yon gwo kantite atik.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Twazyèm vèsyon. Final

Pandan debogaj dezyèm vèsyon an, mwen dekouvri ke Habr, toudenkou, gen yon API ke vèsyon mobil sit la jwenn aksè. Li chaje pi vit pase vèsyon mobil lan, depi li jis json, ki pa menm bezwen analize. Nan fen a, mwen deside reekri script mwen an ankò.

Se konsa, li te jwenn lyen sa a API, ou ka kòmanse analize li.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Li genyen domèn ki gen rapò ak atik la li menm ak otè ki te ekri li a.

API.png

Tout Habr nan yon sèl baz done

Mwen pa t jete json konplè chak atik, men mwen te sove sèlman jaden mwen te bezwen yo:

  • id
  • se_tutorial
  • tan_pibliye
  • tit
  • kontni
  • comments_count
  • lang se lang ki ekri atik la. Jiskaprezan, li gen sèlman en ak ru.
  • tags_string - tout tags nan pòs la
  • reading_count
  • otè
  • nòt — evalyasyon atik.

Kidonk, lè l sèvi avèk API a, mwen redwi tan an ekzekisyon script a 8 segonn pou chak 100 url.

Apre nou fin telechaje done nou bezwen yo, nou bezwen trete yo epi antre nan baz done a. Mwen pa t 'gen okenn pwoblèm ak sa a tou:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Estatistik

Oke, tradisyonèlman, finalman, ou ka ekstrè kèk estatistik nan done yo:

  • Nan 490 telechajman yo te espere, se sèlman 406 atik yo te telechaje. Li sanble ke plis pase mwatye (228) nan atik yo sou Habré yo te kache oswa efase.
  • Baz done a tout antye, ki fòme ak prèske mwatye yon milyon atik, peze 2.95 GB. Nan fòm konprese - 495 MB.
  • An total, 37804 moun se otè Habré. Mwen raple w ke estatistik sa yo soti sèlman nan pòs ap viv yo.
  • Otè ki pi pwodiktif sou Habré - alizar - 8774 atik.
  • Top rated atik — 1448 plis
  • Atik ki pi li — 1660841 opinyon
  • Atik ki pi diskite — 2444 kòmantè

Oke, nan fòm lan nan tètTop 15 otè yoTout Habr nan yon sèl baz done
Top 15 pa evalyasyonTout Habr nan yon sèl baz done
Top 15 liTout Habr nan yon sèl baz done
Top 15 DiskiteTout Habr nan yon sèl baz done

Sous: www.habr.com

Add nouvo kòmantè