Habr zote katika hifadhidata moja

Habari za mchana. Ni miaka 2 imepita tangu kuandikwa. makala ya mwisho kuhusu kumchanganua Habr, na baadhi ya mambo yamebadilika.

Nilipotaka kuwa na nakala ya Habr, niliamua kuandika kichanganuzi ambacho kingehifadhi maudhui yote ya waandishi kwenye hifadhidata. Jinsi ilifanyika na ni makosa gani niliyokutana nayo - unaweza kusoma chini ya kukata.

TLDR- kiungo cha hifadhidata

Toleo la kwanza la mchanganuzi. thread moja, matatizo mengi

Kuanza, niliamua kutengeneza mfano wa hati ambayo nakala hiyo ingechanganuliwa na kuwekwa kwenye hifadhidata mara baada ya kupakua. Bila kufikiria mara mbili, nilitumia sqlite3, kwa sababu. haikuwa na kazi nyingi: hakuna haja ya kuwa na seva ya ndani, iliyoundwa-iliyofutwa-iliyofutwa na vitu kama hivyo.

thread_moja.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "ΠŸΡ€ΠΈ парсингС этой страницС ΠΏΡ€ΠΎΠΈΠ·ΠΎΡˆΠ»Π° ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Kila kitu ni cha kawaida - tunatumia Supu Nzuri, maombi na mfano wa haraka uko tayari. Hiyo tu…

  • Upakuaji wa ukurasa uko kwenye uzi mmoja

  • Ukikatiza utekelezaji wa hati, basi hifadhidata yote haitaenda popote. Baada ya yote, ahadi inafanywa tu baada ya uchanganuzi wote.
    Bila shaka, unaweza kufanya mabadiliko kwenye hifadhidata baada ya kila kuingizwa, lakini basi wakati wa utekelezaji wa hati utaongezeka kwa kiasi kikubwa.

  • Kuchanganua makala 100 za kwanza kulinichukua saa 000.

Ifuatayo napata nakala ya mtumiaji kuunganishwa, ambayo nilisoma na kupata hila chache za maisha ili kuharakisha mchakato huu:

  • Kutumia multithreading huongeza kasi ya kupakua wakati mwingine.
  • Huwezi kupata toleo kamili la habr, lakini toleo lake la rununu.
    Kwa mfano, ikiwa nakala iliyojumuishwa katika toleo la desktop ina uzito wa KB 378, basi katika toleo la rununu tayari ni 126 KB.

Toleo la pili. Mazungumzo mengi, marufuku ya muda kutoka kwa Habr

Nilipochunguza mtandao juu ya mada ya kusoma kwa wingi kwenye python, nilichagua chaguo rahisi zaidi na multiprocessing.dummy, niliona kuwa matatizo yalionekana pamoja na multithreading.

SQLite3 haitaki kufanya kazi na nyuzi zaidi ya moja.
fasta check_same_thread=False, lakini kosa hili sio pekee, wakati wa kujaribu kuingiza kwenye hifadhidata, makosa wakati mwingine hutokea ambayo sikuweza kutatua.

Kwa hivyo, ninaamua kuacha kuingizwa mara moja kwa nakala moja kwa moja kwenye hifadhidata na, nikikumbuka suluhisho lililojumuishwa, ninaamua kutumia faili, kwa sababu hakuna shida na uandishi wa nyuzi nyingi kwenye faili.

Habr anaanza kupiga marufuku kwa kutumia zaidi ya nyuzi tatu.
Majaribio ya bidii ya kumfikia Habr yanaweza kuisha kwa marufuku ya ip kwa saa kadhaa. Kwa hivyo lazima utumie nyuzi 3 tu, lakini hii tayari ni nzuri, kwani wakati wa kurudia nakala zaidi ya 100 umepunguzwa kutoka sekunde 26 hadi 12.

Inafaa kumbuka kuwa toleo hili sio thabiti, na upakuaji mara kwa mara huanguka kwenye idadi kubwa ya nakala.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Π—Π°ΠΏΠΈΡΡŒ Π·Π°Π±Π»ΠΎΠΊΠΈΡ€ΠΎΠ²Π°Π½Π½Ρ‹Ρ… запросов Π½Π° сСрвСр
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста Π½Π΅ сущСствуСт ΠΈΠ»ΠΈ ΠΎΠ½ Π±Ρ‹Π» скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # ΠœΠ΅Ρ‚ΠΊΠ°, Ρ‡Ρ‚ΠΎ пост являСтся ΠΏΠ΅Ρ€Π΅Π²ΠΎΠ΄ΠΎΠΌ ΠΈΠ»ΠΈ Ρ‚ΡƒΡ‚ΠΎΡ€ΠΈΠ°Π»ΠΎΠΌ.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "ΠŸΡ€ΠΈ парсингС этой страницС ΠΏΡ€ΠΎΠΈΠ·ΠΎΡˆΠ»Π° ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # ЗаписываСм ΡΡ‚Π°Ρ‚ΡŒΡŽ Π² json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("НСобходимы ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ min ΠΈ max. ИспользованиС: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² >3
    # Ρ‚ΠΎ Ρ…Π°Π±Ρ€ Π±Π°Π½ΠΈΡ‚ ipшник Π½Π° врСмя
    pool = ThreadPool(3)

    # ΠžΡ‚ΡΡ‡Π΅Ρ‚ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ, запуск ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ²
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ПослС закрытия всСх ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² ΠΏΠ΅Ρ‡Π°Ρ‚Π°Π΅ΠΌ врСмя
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Toleo la tatu. Mwisho

Nilipokuwa nikitatua toleo la pili, niligundua kuwa Habr, kwa ghafla, ana API ambayo toleo la rununu la tovuti hufikia. Inapakia haraka kuliko toleo la rununu, kwani ni json tu, ambayo haihitaji hata kuchanganuliwa. Mwishowe, niliamua kuandika tena hati yangu.

Kwa hivyo, baada ya kupatikana kiungo hiki API, unaweza kuanza kuichanganua.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("НСобходимы ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ min ΠΈ max. ИспользованиС: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² >3
    # Ρ‚ΠΎ Ρ…Π°Π±Ρ€ Π±Π°Π½ΠΈΡ‚ ipшник Π½Π° врСмя
    pool = ThreadPool(3)

    # ΠžΡ‚ΡΡ‡Π΅Ρ‚ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ, запуск ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ²
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ПослС закрытия всСх ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² ΠΏΠ΅Ρ‡Π°Ρ‚Π°Π΅ΠΌ врСмя
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Ina sehemu zinazohusiana na makala yenyewe na kwa mwandishi aliyeiandika.

API.png

Habr zote katika hifadhidata moja

Sikutupa json kamili ya kila nakala, lakini nilihifadhi tu sehemu nilizohitaji:

  • id
  • ni_mafunzo
  • wakati_iliyochapishwa
  • title
  • yaliyomo
  • hesabu_ya_maoni
  • lang ni lugha ambayo makala imeandikwa. Hadi sasa, ina tu en na ru.
  • tags_string - lebo zote kutoka kwa chapisho
  • hesabu_ya_kusoma
  • mwandishi
  • alama - rating ya makala.

Kwa hivyo, kwa kutumia API, nilipunguza wakati wa utekelezaji wa hati hadi sekunde 8 kwa url 100.

Baada ya kupakua data tunayohitaji, tunahitaji kuichakata na kuiingiza kwenye hifadhidata. Sikuwa na shida na hii pia:

mchanganuzi.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # ΠžΡ‚ΠΊΠ»ΡŽΡ‡Π°Π΅ΠΌ ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€ΠΆΠ΄Π΅Π½ΠΈΠ΅ записи, Ρ‚Π°ΠΊ ΡΠΊΠΎΡ€ΠΎΡΡ‚ΡŒ увСличиваСтся Π² Ρ€Π°Π·Ρ‹.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Π Π°Π΄ΠΈ Π»ΡƒΡ‡ΡˆΠ΅ΠΉ читаСмости Π±Π°Π·Ρ‹ ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡ€Π΅Π½Π΅Π±Ρ€Π΅Ρ‡ΡŒ Ρ‡ΠΈΡ‚Π°Π΅ΠΌΠΎΡΡ‚ΡŒΡŽ ΠΊΠΎΠ΄Π°. Или Π½Π΅Ρ‚?
                # Если Π²Π°ΠΌ Ρ‚Π°ΠΊ каТСтся, ΠΌΠΎΠΆΠ½ΠΎ просто Π·Π°ΠΌΠ΅Π½ΠΈΡ‚ΡŒ ΠΊΠΎΡ€Ρ‚Π΅ΠΆ Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠΌ data. Π Π΅ΡˆΠ°Ρ‚ΡŒ Π²Π°ΠΌ.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

takwimu

Kweli, jadi, mwishowe, unaweza kutoa takwimu kutoka kwa data:

  • Kati ya vipakuliwa 490 vinavyotarajiwa, ni vifungu 406 pekee vilivyopakuliwa. Inabadilika kuwa zaidi ya nusu (228) ya nakala za Habre zilifichwa au kufutwa.
  • Hifadhidata nzima, inayojumuisha nakala karibu nusu milioni, ina uzito wa GB 2.95. Katika fomu iliyoshinikizwa - 495 MB.
  • Kwa jumla, watu 37804 ndio waandishi wa Habre. Ninakukumbusha kuwa takwimu hizi ni za machapisho ya moja kwa moja pekee.
  • Mwandishi tija zaidi juu ya Habre - alizar - Nakala 8774.
  • Makala yaliyokadiriwa sana - 1448 pluses
  • Nakala iliyosomwa zaidi - maoni 1660841
  • Makala Iliyojadiliwa Zaidi - maoni 2444

Naam, kwa namna ya juuWaandishi 15 boraHabr zote katika hifadhidata moja
15 bora kwa kukadiriaHabr zote katika hifadhidata moja
15 bora kusomaHabr zote katika hifadhidata moja
15 Bora ZilizojadiliwaHabr zote katika hifadhidata moja

Chanzo: mapenzi.com

Kuongeza maoni