Kabéh Habr dina hiji database

Wilujeng sonten. Geus 2 taun ti harita ditulis. artikel panungtungan ngeunaan parsing Habr, sarta sababaraha titik geus robah.

Nalika kuring hoyong gaduh salinan Habr, kuring mutuskeun pikeun nyerat parser anu bakal ngahémat sadaya eusi pangarang kana pangkalan data. Kumaha éta kajadian sareng naon kasalahan anu kuring hadapi - anjeun tiasa maca dina handapeun cut.

TLDR- link database

Versi mimiti parser. Hiji thread, loba masalah

Pikeun dimimitian ku, kuring mutuskeun pikeun nyieun prototipe Aksara, nu artikel bakal parsed geuwat sanggeus diundeur jeung disimpen dina database. Tanpa mikir dua kali, kuring dipaké sqlite3, sabab. éta kirang kuli-intensif: teu kudu boga server lokal, dijieun-katingali-dihapus jeung hal kawas éta.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Sadayana klasik - kami nganggo Sup Éndah, pamundut sareng prototipe gancang parantos siap. Éta ngan…

  • Page download aya dina hiji thread

  • Upami anjeun ngaganggu palaksanaan naskah, maka sadaya pangkalan data moal kamana-mana. Barina ogé, komitmen dilaksanakeun ngan saatos sadaya parsing.
    Tangtosna, anjeun tiasa ngalakukeun parobihan kana pangkalan data saatos unggal sisipan, tapi waktos palaksanaan naskah bakal ningkat sacara signifikan.

  • Parsing 100 artikel munggaran nyandak kuring 000 jam.

Salajengna kuring manggihan artikel pamaké dihijikeun, anu kuring baca sareng mendakan sababaraha hacks hirup pikeun nyepetkeun prosés ieu:

  • Ngagunakeun multithreading speeds nepi diundeur di kali.
  • Anjeun teu bisa meunangkeun versi pinuh ku habr, tapi versi mobile na.
    Salaku conto, upami artikel kointegrasi dina versi desktop beuratna 378 KB, maka dina versi mobile éta parantos 126 KB.

Vérsi kadua. Loba threads, larangan samentara ti Habr

Nalika kuring scoured Internet dina topik multithreading di python, Kuring milih pilihan pangbasajanna kalawan multiprocessing.dummy, Kuring noticed nu masalah muncul sapanjang kalawan multithreading.

SQLite3 henteu hoyong damel sareng langkung ti hiji utas.
dibereskeun check_same_thread=False, Tapi kasalahan ieu teu ngan hiji, nalika nyobian nyelapkeun kana database, kasalahan kadang lumangsung nu kuring teu bisa ngajawab.

Ku alatan éta, kuring mutuskeun pikeun abandon sisipan instan artikel langsung kana database na, remembering solusi cointegrated, abdi mutuskeun pikeun ngagunakeun file, sabab teu aya masalah sareng tulisan multi-threaded kana file.

Habr mimiti ngalarang pikeun ngagunakeun langkung ti tilu utas.
Utamana usaha getol pikeun ngaliwat ka Habr tiasa ditungtungan ku larangan ip pikeun sababaraha jam. Janten anjeun kedah ngan ukur nganggo 3 utas, tapi ieu parantos saé, sabab waktos pikeun ngulang langkung ti 100 tulisan diréduksi tina 26 dugi ka 12 detik.

Eta sia noting yén versi ieu rada teu stabil, sarta undeuran périodik ragrag kaluar dina angka nu gede ngarupakeun artikel.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Vérsi katilu. Pamungkas

Nalika nga-debug versi kadua, kuring mendakan yén Habr, ujug-ujug, ngagaduhan API anu diakses ku versi mobile situs éta. Ieu beban gancang ti versi mobile, saprak éta ngan json, nu malah teu perlu parsed. Tungtungna, kuring mutuskeun pikeun nulis deui naskah kuring.

Ku kituna, sanggeus kapanggih link ieu API, Anjeun bisa ngamimitian parsing eta.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Éta ngandung widang anu aya hubunganana sareng tulisan sorangan sareng panulis anu nyeratna.

API.png

Kabéh Habr dina hiji database

Kuring henteu ngalungkeun json pinuh ku unggal tulisan, tapi ngan ukur ngahemat widang anu kuring peryogikeun:

  • id
  • nyaéta_tutorial
  • time_published
  • gelar
  • eusi
  • comment_count
  • lang nya éta basa nu artikel ditulis. Sajauh, eta boga ngan en jeung ru.
  • tags_string - sadaya tag tina pos
  • maca_itung
  • nu ngarang
  • skor - rating artikel.

Ku kituna, ngagunakeun API, abdi ngurangan waktu palaksanaan naskah ka 8 detik per 100 url.

Saatos urang ngaunduh data anu diperyogikeun, urang kedah ngolah sareng ngalebetkeun kana pangkalan data. Abdi henteu ngagaduhan masalah sareng ieu ogé:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

statistik

Nya, sacara tradisional, tungtungna, anjeun tiasa nimba sababaraha statistik tina data:

  • Tina 490 undeuran anu diperkirakeun, ngan 406 artikel anu diunduh. Tétéla leuwih ti satengah (228) artikel ngeunaan Habré disumputkeun atawa dihapus.
  • Sakabéh database, diwangun ku ampir satengah juta artikel, beuratna 2.95 GB. Dina bentuk dikomprés - 495 MB.
  • Jumlahna aya 37804 urang pangarang Habré. Kuring ngingetan yén statistik ieu ngan ukur tina tulisan langsung.
  • Pangarang paling produktif dina Habré - alizar - 8774 artikel.
  • Top dipeunteun artikel - 1448 pluss
  • Paling maca artikel - 1660841 pintonan
  • Artikel Paling Dibahas — 2444 koméntar

Muhun, dina bentuk topsTop 15 pangarangKabéh Habr dina hiji database
Top 15 ku ratingKabéh Habr dina hiji database
Top 15 macaKabéh Habr dina hiji database
Top 15 DibahasKabéh Habr dina hiji database

sumber: www.habr.com

Tambahkeun komentar