Dhammaan Habr hal kayd

Galab wanaagsan. 2 sano ayaa laga joogaa markii la qoray. maqaalkii ugu dambeeyay ku saabsan falanqaynta Habr, oo qodobbada qaar ayaa isbedelay.

Markii aan damcay in aan haysto nuqul Habr ah, waxa aan go'aansaday in aan qoro falanqeeye ka badbaadin doona dhammaan waxa ku jira qorayaasha kaydka xogta. Sida ay u dhacday iyo wixii khaladaad ah ee aan la kulmay - waxaad ka akhrisan kartaa goynta hoosteeda.

TLDR- xogta xogta

Nooca ugu horreeya ee falanqaynta. Hal dun, dhibaatooyin badan

Bilawga, waxaan go'aansaday in aan sameeyo prototype qoraal ah kaas oo maqaalka lagu kala saari doono oo la gelin doono xogta isla markiiba marka la soo dejiyo. Anigoon ka fikirin laba jeer, waxaan isticmaalay sqlite3, sababtoo ah. waxa ay ahayd mid yar oo xoog badan: looma baahna in la helo server maxalli ah, la abuuray-muuqaal la tirtiray iyo wax la mid ah.

hal_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Wax walba waa caadi - waxaan isticmaalnaa maraq qurux badan, codsiyada iyo prototype degdeg ah waa diyaar. Taasi waa uun…

  • Soo dejintu waxay ku jirtaa hal dun

  • Haddii aad joojiso fulinta qoraalka, markaa xogta oo dhan meelna ma aadi doonto. Ka dib oo dhan, go'aanka waxaa la sameeyaa kaliya ka dib dhammaan falanqaynta.
    Dabcan, waxaad samayn kartaa isbeddel ku saabsan xogta ka dib gelinta kasta, laakiin markaa wakhtiga fulinta qoraalka ayaa si weyn u kordhin doona.

  • Falanqaynta 100 ee maqaal ee hore waxay igu qaadatay 000 saacadood.

Marka xigta waxaan helaa maqaalka isticmaalaha isku duubni, kuwaas oo aan akhriyey oo aan helay dhawr hacks nolosha si loo dedejiyo hawshan:

  • Isticmaalka isku-xidhka badan waxa ay kordhisaa soo dejinta mararka qaarkood.
  • Ma heli kartid nooca buuxa ee habr, laakiin ma heli kartid nooca gacanta.
    Tusaale ahaan, haddii maqaal la isku daray oo ku jira nooca desktop-ka uu miisaankiisu yahay 378 KB, ka dibna nooca mobilada wuxuu horeyba u ahaa 126 KB.

Nooca labaad. Duub badan, mamnuuc ku meel gaar ah Habr

Markii aan ku dhex wareegay internetka mawduuca isku-xirka badan ee Python, waxaan doortay ikhtiyaarka ugu fudud ee multiprocessing.dummy, waxaan ogaaday in dhibaatooyin ay la socdaan multithreading.

SQLite3 ma rabto inay ku shaqeyso wax ka badan hal dun.
go'an check_same_thread=False, laakiin qaladkan ma aha ka kaliya, marka la isku dayayo in la geliyo database-ka, mararka qaarkood waxaa dhaca qaladaad aan xalin karin.

Sidaa darteed, waxaan go'aansaday in aan ka tago gelinta degdega ah ee maqaallada si toos ah xogta xogta iyo, xusuusta xalka la isku daray, waxaan go'aansaday in aan isticmaalo faylasha, sababtoo ah ma jiraan wax dhibaatooyin ah oo ku saabsan qorista xargaha badan ee faylka.

Habr wuxuu bilaabay mamnuucida isticmaalka wax ka badan seddex xadhig.
Gaar ahaan isku dayga xamaasadda leh ee lagu gelayo Habr waxay ku dambayn kartaa xayiraad ip ah dhowr saacadood. Markaa waa inaad isticmaashaa 3 threads oo kaliya, laakiin tani mar hore ayay wanaagsan tahay, maadaama wakhtiga lagu celcelinayo in ka badan 100 maqaal laga dhimay 26 ilaa 12 ilbidhiqsi.

Waxaa xusid mudan in noocani uu yahay mid aan xasilloonayn, iyo soo-dejintu si xilliyo ah u dhacaan tiro badan oo maqaallo ah.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Nooca saddexaad. Ugu dambayn

Markii aan khaladayay nooca labaad, waxaan ogaaday in Habr, si lama filaan ah, uu haysto API kaas oo nooca moobilka ee goobta uu galo. Way ka xawli badan tahay nooca mobaylka, maadaama ay tahay json oo keliya, taas oo aan xataa u baahnayn in la kala saaro. Dhammaadkii, waxaan go'aansaday inaan dib u qoro qoraalkayga mar kale.

Markaa, helay xidhiidhkan API, waxaad bilaabi kartaa falanqaynteeda

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Waxay ka kooban tahay qaybo la xidhiidha maqaalka laftiisa iyo qoraaga qoray labadaba.

API.png

Dhammaan Habr hal kayd

Ma daadin json buuxa ee maqaal kasta, laakiin waxaan badbaadiyay kaliya beerihii aan u baahnaa:

  • id
  • waa_tacliin
  • waqti_daabacay
  • horyaalka
  • content
  • comments_count
  • lang waa luqadda maqaalku ku qoran yahay. Ilaa hadda, waxay leedahay oo kaliya en iyo ru.
  • tags_string - dhammaan tags ka boostada
  • akhriska_tirinta
  • qoraaga
  • score - article rating.

Markaa, anigoo isticmaalaya API-ga, waxaan hoos u dhigay wakhtiga fulinta qoraalka ilaa 8 sekan 100kii url.

Ka dib markii aan soo dejinay xogta aan u baahanahay, waxaan u baahanahay inaan ka baaraandegno oo galno kaydka xogta. Middana wax dhib ah kalama kulmin:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Tirakoob

Hagaag, dhaqan ahaan, ugu dambeyntii, waxaad ka soo saari kartaa xoogaa xog ah:

  • 490 ee la filayey in la soo dejiyo, kaliya 406 maqaal ayaa la soo dejiyey. Waxa soo baxday in in ka badan kala badh (228) ee maqaallada Habré la qariyay ama la tirtiray.
  • Dhammaan xogta, oo ka kooban ku dhawaad ​​nus milyan maqaallo, ayaa miisaankeedu yahay 2.95 GB. Qaab cufan - 495 MB.
  • Wadar ahaan, 37804 qof ayaa ah qorayaasha Habré. Waxaan ku xasuusinayaa in tirakoobyadani ay ka yihiin kaliya qoraallada tooska ah.
  • Qoraaga ugu wax soo saarka badan Habré - alizar - 8774 maqaallo.
  • Maqaalka ugu sarreeya - 1448 lagu daray
  • Maqaalka ugu akhriska badan - 1660841 views
  • Maqaalka Inta Badan Laga Hadlay - 2444 faallooyin

Waa hagaag, qaabka ugu sarreeya15-ka qoraa ee ugu sarreeyaDhammaan Habr hal kayd
15ka ugu sarreeya qiimayntaDhammaan Habr hal kayd
15-ta ugu sareysa akhriDhammaan Habr hal kayd
15ka sare ee laga wada hadlayDhammaan Habr hal kayd

Source: www.habr.com

Add a comment