Duk Habr a cikin database daya

Barka da rana. Shekara 2 kenan da rubuta shi. labarin karshe game da tantance Habr, kuma wasu abubuwan sun canza.

Lokacin da nake so in sami kwafin Habr, na yanke shawarar rubuta fassarar da za ta adana duk abubuwan da marubutan ke ciki zuwa ma'ajin bayanai. Yadda ya faru da kuma abin da kurakurai na ci karo - za ka iya karanta a karkashin yanke.

TL; DR - database link

Sigar farko ta parser. Zare ɗaya, matsaloli masu yawa

Da farko, na yanke shawarar yin samfurin rubutun da za a rarraba labarin kuma a sanya shi cikin ma'ajin bayanai nan da nan bayan zazzagewa. Ba tare da tunani sau biyu ba, na yi amfani da sqlite3, saboda. ya kasance ƙasa da ƙarfin aiki: babu buƙatar samun sabar gida, ƙirƙira-kalli-share da kaya makamantan haka.

zare_daya.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Komai na gargajiya ne - muna amfani da Kyawun Miya, buƙatun kuma an shirya samfur mai sauri. Wannan kawai…

  • Zazzagewar shafi yana cikin layi ɗaya

  • Idan ka katse aiwatar da rubutun, to duk bayanan ba za su je ko'ina ba. Bayan haka, ƙaddamarwa ana yin shi ne kawai bayan duk binciken.
    Tabbas, zaku iya yin canje-canje ga bayanan bayanan bayan kowane shigarwa, amma sai lokacin aiwatar da rubutun zai ƙaru sosai.

  • Yin nazarin labaran 100 na farko ya ɗauki ni sa'o'i 000.

Na gaba na sami labarin mai amfani hade, wanda na karanta kuma na sami wasu hacks na rayuwa don hanzarta wannan tsari:

  • Amfani da multithreading yana haɓaka zazzagewa a wasu lokuta.
  • Ba za ku iya samun cikakken sigar habr ba, amma sigar wayar hannu.
    Misali, idan labarin da aka haɗa a cikin nau'in tebur yana auna 378 KB, to a cikin sigar wayar hannu ya riga ya zama 126 KB.

Siga ta biyu. Yawancin zaren, haramcin wucin gadi daga Habr

Lokacin da na bincika Intanet akan batun multithreading a Python, na zaɓi zaɓi mafi sauƙi tare da multiprocessing.dummy, Na lura cewa matsaloli sun bayyana tare da multithreading.

SQLite3 baya son aiki tare da zare fiye da ɗaya.
gyarawa check_same_thread=False, amma wannan kuskure ba shine kadai ba, lokacin da ake ƙoƙarin sakawa a cikin ma'ajin bayanai, wasu lokuta kurakurai suna faruwa waɗanda ba zan iya magance su ba.

Sabili da haka, na yanke shawarar yin watsi da shigar da labaran kai tsaye a cikin bayanai kuma, tunawa da maganin da aka haɗa, na yanke shawarar yin amfani da fayiloli, saboda babu matsaloli tare da rubutu mai yawa zuwa fayil.

Habr ya fara haramta amfani da zaren sama da uku.
Musamman yunƙurin ƙwazo don isa zuwa Habr na iya ƙarewa tare da haramcin ip na sa'o'i biyu. Don haka dole ne ku yi amfani da zaren 3 kawai, amma wannan ya riga ya yi kyau, tun lokacin da za a sake maimaita abubuwa sama da 100 an rage daga 26 zuwa 12 seconds.

Shi ne ya kamata a lura da cewa wannan version ne wajen m, kuma downloads lokaci-lokaci fada kashe a kan babban adadin articles.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Siga ta uku. Karshe

Yayin da ake gyara sigar ta biyu, na gano cewa Habr, kwatsam, yana da API wanda nau'in rukunin yanar gizon ke shiga. Yana yin lodi da sauri fiye da nau'in wayar hannu, tunda json ne kawai, wanda ba ya buƙatar tantancewa. A ƙarshe, na yanke shawarar sake rubuta rubutuna.

Don haka, tun samu wannan haɗin API, zaku iya fara tantance shi.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Ya ƙunshi fannonin da suka shafi duka labarin da kansa da marubucin da ya rubuta ta.

API.png

Duk Habr a cikin database daya

Ban zubar da cikakken json na kowane labarin ba, amma na ajiye filayen da nake buƙata kawai:

  • id
  • koyarwa ce
  • lokaci_bugu
  • suna
  • abun ciki
  • comments_count
  • lang shine harshen da aka rubuta labarin. Ya zuwa yanzu, yana da kawai en da ru.
  • tags_string - duk alamun daga gidan
  • karatun_count
  • marubuci
  • score - labarin rating.

Don haka, ta amfani da API, na rage lokacin aiwatar da rubutun zuwa daƙiƙa 8 a kowace url 100.

Bayan mun sauke bayanan da muke bukata, muna buƙatar sarrafa su kuma mu shigar da su cikin ma'ajin bayanai. Ni ma ban sami wata matsala da wannan ba:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Stats

To, a al'adance, a ƙarshe, kuna iya fitar da wasu ƙididdiga daga bayanan:

  • Daga cikin abubuwan zazzagewa guda 490, labarai 406 ne kawai aka sauke. Ya bayyana cewa fiye da rabin (228) na labarin Habré an ɓoye ko share su.
  • Dukkanin bayanan, wanda ya ƙunshi labarai kusan rabin miliyan, suna auna 2.95 GB. A matsa lamba - 495 MB.
  • Gabaɗaya, mutane 37804 ne suka rubuta Habré. Ina tunatar da ku cewa waɗannan ƙididdiga daga posts ne kawai.
  • Mafi kyawun marubuci akan Habré - alizar - labarai 8774.
  • Labari mai daraja - 1448 ƙari
  • Mafi yawan karanta labarin - 1660841 ra'ayoyi
  • Labarin da Akafi Tattaunawa - 2444 sharhi

To, a cikin nau'i na samanManyan marubuta 15Duk Habr a cikin database daya
Manyan 15 ta ratingDuk Habr a cikin database daya
Top 15 karantaDuk Habr a cikin database daya
Manyan 15 da aka tattaunaDuk Habr a cikin database daya

source: www.habr.com

Add a comment