All Habr polokelongtshedimosetso e le 'ngoe

Lumelang. Ho se ho fetile lilemo tse 2 esale e ngoloa. sengoloa sa ho qetela mabapi le ho qhekella Habr, 'me lintlha tse ling li fetohile.

Ha ke batla ho ba le kopi ea Habr, ke ile ka etsa qeto ea ho ngola moqolotsi o tla boloka litaba tsohle tsa bangoli ho database. Ho etsahetse joang le hore na ke ile ka kopana le liphoso life - u ka bala tlas'a sehiloeng.

TLDR- sehokelo sa database

Phetolelo ea pele ea mohlahlobi. Khoele e le 'ngoe, mathata a mangata

Taba ea pele, ke ile ka etsa qeto ea ho etsa script prototype, eo ho eona sengoloa se neng se tla aroloa hang ha se khoasolla le ho beoa polokelong ea litaba. Ntle le ho nahana habeli, ke sebelisitse sqlite3, hobane. e ne e sa sebetse haholo: ha ho hlokahale ho ba le seva sa lehae, se shebahalang se hlakotsoe le lintho tse joalo.

'ngoe_khoele.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Ntho e ngoe le e ngoe ke ea khale - re sebelisa Sopho e Ntle, likopo le prototype e potlakileng e se e lokile. Ke eona feela…

  • Ho khoasolla leqephe ho khoele e le 'ngoe

  • Haeba u sitisa ho etsoa ha script, joale database eohle e ke ke ea ea kae kapa kae. Ha e le hantle, boitlamo bo etsoa feela ka mor'a hore ho behoe likarolo tsohle.
    Ha e le hantle, u ka etsa liphetoho ho database ka mor'a ho kenngoa ka 'ngoe, empa joale nako ea ts'ebetso ea script e tla eketseha haholo.

  • Ho bala lingoliloeng tsa pele tse 100 ho nkile lihora tse 000.

E latelang ke fumana sengoloa sa mosebelisi kopanetsoe, eo ke e balileng mme ka fumana li-hacks tse 'maloa tsa bophelo ho potlakisa ts'ebetso ena:

  • Ho sebelisa multithreading ho potlakisa ho khoasolla ka linako tse ling.
  • U ka se fumane mofuta o felletseng oa habr, empa mofuta oa eona oa mobile.
    Mohlala, haeba sengoloa se kopantsoeng ka har'a mofuta oa desktop se boima ba 378 KB, joale mofuta oa mobile se se se ntse se le 126 KB.

Phetolelo ea bobeli. Likhoele tse ngata, thibelo ea nakoana ho tsoa ho Habr

Ha ke sheba Inthanete ka taba ea ho bala ka bongata ka python, ke ile ka khetha khetho e bonolo ka ho fetisisa ka multiprocessing.dummy, ke hlokometse hore mathata a hlaha hammoho le ho bala ka bongata.

SQLite3 ha e batle ho sebetsa ka likhoele tse fetang bonngoe.
E tsitsitse check_same_thread=False, empa phoso ena hase eona feela, ha u leka ho kenya polokelong ea litaba, ka linako tse ling ho etsahala liphoso tseo ke neng ke sitoa ho li rarolla.

Ka hona, ke etsa qeto ea ho tlohela ho kenngoa hang-hang ha lingoloa ka kotloloho ho database mme, ha ke hopola tharollo e kopaneng, ke etsa qeto ea ho sebelisa lifaele, hobane ha ho na mathata a ho ngola ka likhoele tse ngata faeleng.

Habr o qala ho thibela ho sebelisa likhoele tse fetang tse tharo.
Haholo-holo liteko tse chesehang tsa ho fihlela Habr li ka qetella ka thibelo ea ip ka lihora tse 'maloa. Kahoo o tlameha ho sebelisa likhoele tse 3 feela, empa sena se se se le molemo, kaha nako ea ho pheta-pheta lihlooho tse fetang 100 e fokotsehile ho tloha metsotsoana e 26 ho isa ho e 12.

Ho bohlokoa ho hlokomela hore mofuta ona ha oa tsitsa, 'me likopi hangata li oela ho palo e kholo ea lingoliloeng.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Phetolelo ea boraro. Qetellong

Ha ke ntse ke lokisa phetolelo ea bobeli, ke ile ka fumana hore Habr, ka tšohanyetso, o na le API eo mofuta oa mohala oa sebaka sa marang-rang o fihlang ho oona. E jara ka potlako ho feta mofuta oa selefouno, kaha ke json feela, e sa hlokeng le ho hlalosoa. Qetellong, ke ile ka etsa qeto ea ho ngola lengolo la ka hape.

Kahoo, ho fumana khokahano ena API, o ka qala ho e hlalosa.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

E na le likarolo tse amanang le sengoloa ka bosona le mongoli ea se ngotseng.

API.png

All Habr polokelongtshedimosetso e le 'ngoe

Ha kea lahla json e felletseng ea sengoloa ka seng, empa ke bolokile masimo feela ao ke a hlokang:

  • id
  • ke_thuto
  • nako_e hatisitsoeng
  • tlotla
  • dikahare
  • maikutlo_palo
  • lang ke puo eo sehlooho se ngotsoeng ka eona. Ho fihlela joale, e na le en le ru feela.
  • tags_string - li-tag tsohle ho tsoa posong
  • bala_bala
  • mokwadi
  • lintlha - lintlha tsa sehlooho.

Kahoo, ka ho sebelisa API, ke ile ka fokotsa nako ea ho etsa script ho metsotsoana ea 8 ho 100 url.

Ka mor'a hore re khoasolle data eo re e hlokang, re lokela ho e sebetsana le ho e kenya ka har'a database. Le 'na ha kea ba le bothata ka sena:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Lipalo

Che, ka moetlo, qetellong, o ka ntša lipalo-palo ho data:

  • Har'a tse lebelletsoeng ho khoasolla tse 490, ke lingoloa tse 406 feela tse jarollotsoeng. Hoa etsahala hore karolo e fetang halofo (228) ea lihlooho tse buang ka Habré e ne e patiloe kapa e hlakotsoe.
  • Database eohle, e nang le lihlooho tse ka bang halofo ea milione, e boima ba 2.95 GB. Ka foromo e hatelitsoeng - 495 MB.
  • Ka kakaretso, batho ba 37804 ke bangoli ba Habré. Ke u hopotsa hore lipalo-palo tsena li tsoa feela lipapatsong tse hlahang.
  • Sengoli se hlahisang litholoana haholo ho Habré - alizar - 8774 lingoloa.
  • Sengoloa se fuoeng maemo a holimo - 1448 pluses
  • Sengoloa se baloang haholo – 1660841 maikutlo
  • Sengoliloeng se Builoeng ka ho Fetisisa - maikutlo a 2444

Hantle, ka mokhoa oa litlhōrōBangoli ba 15 ba holimoAll Habr polokelongtshedimosetso e le 'ngoe
Top 15 ka lintlhaAll Habr polokelongtshedimosetso e le 'ngoe
Tse 15 tse holimo li baloaAll Habr polokelongtshedimosetso e le 'ngoe
Top 15 BuisanAll Habr polokelongtshedimosetso e le 'ngoe

Source: www.habr.com

Eketsa ka tlhaloso