Kabeh Habr ing siji database

sugeng sonten. Wis 2 taun ora ditulis. artikel pungkasan bab parsing Habr, lan sawetara TCTerms wis diganti.

Nalika aku pengin duwe salinan Habr, aku mutusake kanggo nulis parser sing bakal nyimpen kabeh isi penulis menyang database. Kepiye kedadeyan lan kesalahan apa sing daktemoni - sampeyan bisa maca ing potongan kasebut.

TLDR- pranala database

Versi pisanan saka parser. Siji thread, akeh masalah

Kanggo miwiti, aku mutusake kanggo nggawe prototipe skrip, ing endi artikel kasebut bakal diurai langsung nalika diunduh lan diselehake ing database. Tanpa mikir kaping pindho, aku nggunakake sqlite3, amarga. iku kurang pegawe-intensif: ora perlu duwe server lokal, digawe-katon-dibusak lan kuwi.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Kabeh klasik - kita nggunakake Sup Cantik, panjaluk lan prototipe cepet wis siyap. Kuwi mung…

  • Ngundhuh kaca ana ing siji utas

  • Yen sampeyan ngganggu eksekusi skrip, mula kabeh database ora bakal ana. Sawise kabeh, komit ditindakake mung sawise kabeh parsing.
    Mesthi, sampeyan bisa nindakake owah-owahan ing basis data sawise saben sisipan, nanging wektu eksekusi skrip bakal tambah akeh.

  • Parsing 100 artikel pisanan njupuk kula 000 jam.

Sabanjure aku nemokake artikel pangguna kointegrasi, sing aku maca lan nemokake sawetara hacks urip kanggo nyepetake proses iki:

  • Nggunakake multithreading nyepetake download ing kaping.
  • Sampeyan ora bisa entuk versi lengkap saka habr, nanging versi seluler.
    Contone, yen artikel kointegrasi ing versi desktop bobote 378 KB, banjur ing versi seluler wis 126 KB.

Versi kapindho. Akeh benang, larangan sementara saka Habr

Nalika aku scoured Internet ing topik multithreading ing python, Aku milih pilihan paling gampang karo multiprocessing.dummy, Aku ngeweruhi sing masalah muncul bebarengan karo multithreading.

SQLite3 ora pengin nggarap luwih saka siji utas.
tetep check_same_thread=False, nanging kesalahan iki ora mung siji, nalika nyoba nglebokake database, kadhangkala ana kesalahan sing aku ora bisa ngatasi.

Mulane, aku mutusake kanggo ninggalake selipan cepet artikel langsung menyang database lan, ngelingi solusi cointegrated, aku mutusake kanggo nggunakake file, amarga ora ana masalah karo nulis multi-threaded menyang file.

Habr wiwit nglarang nggunakake luwih saka telung benang.
Utamané sregep nyoba kanggo njaluk liwat kanggo Habr bisa mungkasi munggah karo larangan ip kanggo saperangan jam. Dadi sampeyan kudu nggunakake mung 3 thread, nanging iki wis apik, amarga wektu kanggo iterate liwat 100 artikel suda saka 26 kanggo 12 detik.

Wigati dicathet yen versi iki rada ora stabil, lan download sacara periodik tiba ing pirang-pirang artikel.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Versi katelu. Pungkasan

Nalika debugging versi kapindho, aku nemokake yen Habr, dumadakan, duwe API sing diakses versi seluler situs kasebut. Muat luwih cepet tinimbang versi seluler, amarga mung json, sing ora perlu diurai. Pungkasane, aku mutusake kanggo nulis maneh naskahku.

Dadi, wis ketemu link iki API, sampeyan bisa miwiti parsing.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Isine kolom sing ana hubungane karo artikel kasebut lan penulis sing nulis.

API.png

Kabeh Habr ing siji database

Aku ora mbuwang json lengkap saben artikel, nanging mung nyimpen lapangan sing dibutuhake:

  • id
  • punika_tutorial
  • wektu_diterbitake
  • title
  • isi
  • komentar_count
  • lang minangka basa ing ngendi artikel kasebut ditulis. Supaya adoh, iku mung en lan ru.
  • tags_string - kabeh tag saka kiriman
  • maca_count
  • panganggité
  • skor - rating artikel.

Mangkono, nggunakake API, aku nyuda wektu eksekusi skrip dadi 8 detik saben 100 url.

Sawise kita ngundhuh data sing dibutuhake, kita kudu ngolah lan nglebokake menyang database. Aku uga ora duwe masalah karo iki:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Статистика

Inggih, tradisional, pungkasanipun, sampeyan bisa extract sawetara statistik saka data:

  • Saka 490 download sing dikarepake, mung 406 artikel sing diundhuh. Pranyata luwih saka setengah (228) artikel ing Habré didhelikake utawa dibusak.
  • Kabeh database, sing dumadi saka meh setengah yuta artikel, bobote 2.95 GB. Ing wangun kompres - 495 MB.
  • Gunggunge, 37804 wong minangka penulis Habré. Aku ngelingake sampeyan yen statistik iki mung saka kiriman langsung.
  • Penulis paling produktif ing Habré - alizar - 8774 artikel.
  • Artikel paling dhuwur - 1448 luwih
  • Paling maca artikel — 1660841 tampilan
  • Artikel sing paling akeh dibahas — 2444 komentar

Inggih, ing wangun ndhuwurTop 15 penulisKabeh Habr ing siji database
Top 15 dening HFSKabeh Habr ing siji database
Top 15 macaKabeh Habr ing siji database
Top 15 RembuganKabeh Habr ing siji database

Source: www.habr.com

Add a comment