Yese Habr mune imwe dhatabhesi

Masikati akanaka. Pava ne2 years kubva zvanyorwa. chinyorwa chekupedzisira nezve kupatsanura Habr, uye mamwe mapoinzi achinja.

Pandakada kuva nekopi yaHabr, ndakafunga kunyora parser yaizochengeta zvese zvirimo zvevanyori kudhatabhesi. Zvakaitika sei uye ndezvipi zvikanganiso zvandakasangana nazvo - unogona kuverenga pasi pekucheka.

TLDR- database link

Shanduro yekutanga yeparser. Imwe thread, matambudziko akawanda

Kutanga, ndakafunga kugadzira script prototype, umo chinyorwa chaizopatsanurwa nekukurumidza pakurodha uye kuiswa mudhatabhesi. Pasina kufunga kaviri, ndakashandisa sqlite3, nekuti. yakanga isinganyanyi kushanda: hapana chikonzero chekuve nesevha yemunharaunda, yakagadzirwa-yakataridzika-yakabviswa uye zvinhu zvakadaro.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "ΠŸΡ€ΠΈ парсингС этой страницС ΠΏΡ€ΠΎΠΈΠ·ΠΎΡˆΠ»Π° ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Zvese ndezvemhando yepamusoro - isu tinoshandisa Yakanaka Soup, zvikumbiro uye inokurumidza prototype yakagadzirira. Ndizvo chete…

  • Kudhawunirodha peji kuri mushinda imwe

  • Kana iwe ukakanganisa kuitwa kweiyo script, ipapo dhatabhesi rese harizoendi kupi. Mushure mezvose, kuzvipira kunoitwa chete mushure mekuita kwese.
    Ehe, iwe unogona kuita shanduko kune dhatabhesi mushure mekuiswa kwega kwega, asi ipapo iyo script execution nguva ichawedzera zvakanyanya.

  • Kuongorora zvinyorwa zve100 zvekutanga zvakanditorera maawa masere.

Tevere ndinowana chinyorwa chemushandisi contegrated, yandakaverenga ndikawana mashoma ehupenyu hacks kukurumidza kuita izvi:

  • Kushandisa multithreading kunomhanyisa kurodha dzimwe nguva.
  • Iwe haugone kuwana kwete iyo yakazara vhezheni yehabr, asi nharembozha yayo.
    Semuenzaniso, kana chinyorwa chakabatanidzwa mudesktop vhezheni ichirema 378 KB, saka mune nharembozha yatove 126 KB.

Second version. Mashinda mazhinji, kurambidzwa kwenguva pfupi kubva kuna Habr

Pandakatsvaga Indaneti pamusoro pehurukuro yakawanda mu python, ndakasarudza sarudzo yakapfava ne multiprocessing.dummy, ndakaona kuti matambudziko akaonekwa pamwe chete nehuwandu hwekuverenga.

SQLite3 haidi kushanda neshinda imwe chete.
fixed check_same_thread=False, asi kukanganisa uku hakusi kwega, pakuedza kuisa mu database, dzimwe nguva zvikanganiso zvinoitika zvandisina kukwanisa kugadzirisa.

Naizvozvo, ndinosarudza kusiya nekukasira kuiswa kwezvinyorwa zvakananga mudhatabhesi uye, ndichirangarira mhinduro yakabatanidzwa, ndinosarudza kushandisa mafaera, nekuti hapana matambudziko neakawanda-tambo kunyora kune faira.

Habr anotanga kurambidza kushandisa shinda dzinopfuura nhatu.
Kunyanya kuedza kwekushingaira kusvika kuna Habr kunogona kuguma nekurambidzwa kwep kwemaawa akati wandei. Saka iwe unofanirwa kushandisa tambo nhatu chete, asi izvi zvatonaka, sezvo nguva yekudzokorora zvinyorwa zvinopfuura zana yakaderedzwa kubva pamasekonzi makumi maviri nematanhatu kusvika gumi nemaviri.

Zvakakosha kucherechedza kuti iyi vhezheni haina kugadzikana, uye kudhawunirodha nguva nenguva kunodonha pane nhamba huru yezvinyorwa.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Π—Π°ΠΏΠΈΡΡŒ Π·Π°Π±Π»ΠΎΠΊΠΈΡ€ΠΎΠ²Π°Π½Π½Ρ‹Ρ… запросов Π½Π° сСрвСр
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста Π½Π΅ сущСствуСт ΠΈΠ»ΠΈ ΠΎΠ½ Π±Ρ‹Π» скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # ΠœΠ΅Ρ‚ΠΊΠ°, Ρ‡Ρ‚ΠΎ пост являСтся ΠΏΠ΅Ρ€Π΅Π²ΠΎΠ΄ΠΎΠΌ ΠΈΠ»ΠΈ Ρ‚ΡƒΡ‚ΠΎΡ€ΠΈΠ°Π»ΠΎΠΌ.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "ΠŸΡ€ΠΈ парсингС этой страницС ΠΏΡ€ΠΎΠΈΠ·ΠΎΡˆΠ»Π° ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # ЗаписываСм ΡΡ‚Π°Ρ‚ΡŒΡŽ Π² json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("НСобходимы ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ min ΠΈ max. ИспользованиС: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² >3
    # Ρ‚ΠΎ Ρ…Π°Π±Ρ€ Π±Π°Π½ΠΈΡ‚ ipшник Π½Π° врСмя
    pool = ThreadPool(3)

    # ΠžΡ‚ΡΡ‡Π΅Ρ‚ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ, запуск ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ²
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ПослС закрытия всСх ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² ΠΏΠ΅Ρ‡Π°Ρ‚Π°Π΅ΠΌ врСмя
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Third version. Final

Ndichiri kugadzirisa iyo yechipiri vhezheni, ndakaona kuti Habr, kamwe kamwe, ane API iyo nharembozha yesaiti inowana. Inotakura nekukurumidza kupfuura iyo mobile vhezheni, sezvo ingori json, iyo isingatomboda kupatsanurwa. Pakupedzisira, ndakasarudza kunyora zvakare script yangu zvakare.

Saka, kuwana iyi link API, unogona kutanga kuipatsanura.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("НСобходимы ΠΏΠ°Ρ€Π°ΠΌΠ΅Ρ‚Ρ€Ρ‹ min ΠΈ max. ИспользованиС: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² >3
    # Ρ‚ΠΎ Ρ…Π°Π±Ρ€ Π±Π°Π½ΠΈΡ‚ ipшник Π½Π° врСмя
    pool = ThreadPool(3)

    # ΠžΡ‚ΡΡ‡Π΅Ρ‚ Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ, запуск ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ²
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ПослС закрытия всСх ΠΏΠΎΡ‚ΠΎΠΊΠΎΠ² ΠΏΠ΅Ρ‡Π°Ρ‚Π°Π΅ΠΌ врСмя
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Iine minda ine chekuita nezvose zviri zviviri chinyorwa pachacho uye nemunyori akachinyora.

API.png

Yese Habr mune imwe dhatabhesi

Ini handina kurasa json yakazara yechinyorwa chimwe nechimwe, asi ndakachengeta chete minda yandaida:

  • id
  • i_dzidziso
  • nguva_yakabudiswa
  • musoro wenyaya
  • gutsikana
  • comments_count
  • lang ndiwo mutauro unonyorwa nyaya yacho. Kusvika ikozvino, ine chete en uye ru.
  • tags_string - ese ma tag kubva pane post
  • kuverenga_kuverenga
  • munyori
  • chibodzwa - chinyorwa rating.

Nokudaro, ndichishandisa API, ndakaderedza nguva yekunyora script kusvika 8 seconds per 100 url.

Mushure mekudhawunirodha data yatinoda, isu tinofanirwa kuigadzirisa uye kuiisa mudhatabhesi. Iniwo handina kana dambudziko nazvo.

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # ΠžΡ‚ΠΊΠ»ΡŽΡ‡Π°Π΅ΠΌ ΠΏΠΎΠ΄Ρ‚Π²Π΅Ρ€ΠΆΠ΄Π΅Π½ΠΈΠ΅ записи, Ρ‚Π°ΠΊ ΡΠΊΠΎΡ€ΠΎΡΡ‚ΡŒ увСличиваСтся Π² Ρ€Π°Π·Ρ‹.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Π Π°Π΄ΠΈ Π»ΡƒΡ‡ΡˆΠ΅ΠΉ читаСмости Π±Π°Π·Ρ‹ ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡ€Π΅Π½Π΅Π±Ρ€Π΅Ρ‡ΡŒ Ρ‡ΠΈΡ‚Π°Π΅ΠΌΠΎΡΡ‚ΡŒΡŽ ΠΊΠΎΠ΄Π°. Или Π½Π΅Ρ‚?
                # Если Π²Π°ΠΌ Ρ‚Π°ΠΊ каТСтся, ΠΌΠΎΠΆΠ½ΠΎ просто Π·Π°ΠΌΠ΅Π½ΠΈΡ‚ΡŒ ΠΊΠΎΡ€Ρ‚Π΅ΠΆ Π°Ρ€Π³ΡƒΠΌΠ΅Π½Ρ‚ΠΎΠΌ data. Π Π΅ΡˆΠ°Ρ‚ΡŒ Π²Π°ΠΌ.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Бтатистика

Zvakanaka, zvechinyakare, pakupedzisira, unogona kutora mamwe manhamba kubva kune data:

  • Pazvinotarisirwa 490 zvekudhawunirodha, chete 406 zvinyorwa zvakatorwa. Zvinoitika kuti inopfuura hafu (228) yezvinyorwa zveHabrΓ© zvakavanzwa kana kubviswa.
  • Iyo yese dhatabhesi, ine inosvika hafu yemiriyoni zvinyorwa, inorema 2.95 GB. Mune fomu yakamanikidzwa - 495 MB.
  • Pakazara, 37804 vanhu ndivo vanyori veHabrΓ©. Ndinokuyeuchidzai kuti nhamba idzi dzinongobva kuzvinyorwa zvepamoyo.
  • Munyori anonyanya kugadzira paHabrΓ© - alizar - 8774 zvinyorwa.
  • Top rated article - 1448 pluses
  • Nyaya inoverengwa yakawanda - 1660841 maonero
  • Nyaya Inonyanya Kukurukurwa - 2444 comments

Zvakanaka, muchimiro chepamusoroVanyori vepamusoro gumi nevashanuYese Habr mune imwe dhatabhesi
Pamusoro 15 nechiyeroYese Habr mune imwe dhatabhesi
Top 15 verengaYese Habr mune imwe dhatabhesi
Top 15 YakakurukurwaYese Habr mune imwe dhatabhesi

Source: www.habr.com

Voeg