Omnes Habr in uno database

Bona dies. II annos ex quo scripta est. ultimum articulum de parsing Habr, et quaedam puncta mutata.

Cum exemplum Habr habere volui, parser scribere decrevi qui omnia argumenta auctorum datorum conservaret. Quomodo et quos errores offendit - legere potes sub incisa.

TLDR- database link

Prima versio Totalis. Unum filum, multae difficultates

In primis statui prototypum facere scriptionem, in qua articulus statim eximendo et in datorum collocatione divideretur. Sine bis, sqlite3 usus sum, quoniam. minus intensiva fuit: non opus est habere loci cultorem, creatum-respectum deletum et supellectilem huiusmodi.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Omnia classica sunt - Pulmentum utimur, petitiones et prototypum velox promptum est. Suus 'iustus…

  • Page download in uno filo

  • Si exsecutionem scriptionis interrumpunt, integra datorum nuspiam ibit. Post omnes parsing commissum fit solum.
    Certe mutationes datorum post singulas insertiones committere potes, sed tempus exsecutionis scriptionis signanter augebit.

  • Parsing prima 100 articulorum me 000 horarum suscepit.

Deinde invenio usoris articulum cointegrated, quas legi et inveni paucas vitae hack ad hunc processum accelerare;

  • Usura multithreading accelerat ut interdum downloading.
  • Non plenam versionem habr, sed mobilis versionis accipere potes.
    Exempli gratia, si articulus cointegratus in escritorio versionis 378 KB ponderat, tunc in versione mobili iam 126 KB est.

Secunda versio. Stamina multa, bannum temporale ab Habr

Cum Internet in thema multithreading in Pythone defricavi, optionem simplicissimam cum multiprocessing.dummy elegi, difficultates apparuisse cum multithreading animadverti.

SQLite3 non vis operari cum pluribus quam filo.
fixum check_same_thread=Falsesed hic error solus non est, dum conanti datorum inserere, interdum occurrunt errores quos solvere non potui.

Ideo statuimus ut instantiam articulorum insertionem directe in datorum relinquas et solutionem cointegratam recordans, tabellam uti statuo, quia nullae difficultates sunt cum multi- plicato scripto ad limam.

Abr incipit prohibendo utendo plus quam tria stamina.
Praesertim studiosos conatus ad Habr pervadendi finem cum ip banno pro duobus horis potest. Ita utendum est solum 3 filis, sed hoc iam bonum est, cum tempus iterare per 100 articulos reducitur ab 26 ad 12 secundis.

Notatu dignum est hanc versionem magis instabilem esse, ac per intervalla notationes in magnum numerum articulorum cadere.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Tertia litera. Final

Cum versionem secundam debugging, deprehendi Habr subito omnes habere API accessiones mobilis versionis situs. Is citius quam versionem mobilem onerat, quia suus 'iustus json', quod ne parsed quidem indiget. In fine iterum scribere decrevi.

Itaque cum haec links API, parsing incipere potes.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Agros continet tam ad ipsum articulum quam ad auctorem qui scripsit.

API.png

Omnes Habr in uno database

Articuli uniuscuiusque plenum json non effundi, sed solum agros quibus opus erat servavi;

  • id
  • is_tutorial
  • time_published
  • Title:
  • content
  • comments_count
  • lang est lingua in qua scriptum est articulus. Hactenus tantum en et ru habet.
  • tags_string - omnia tags ex post
  • reading_count
  • auctor
  • ustulo — article rating.

Ita, API adhibito, scripturae exsecutionis tempus ad 8 seconds per 100 url reduxi.

Postquam notitias quae nobis necessariae sunt receptae sunt, necesse est ut eam expediamus ac eam in datorum aditum ingrediamur. Non habui quaestiones hoc vel:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

statistics

Bene, traditionaliter, denique, statistica quaedam e notitia extrahere potes:

  • Exspectata 490 downloads, tantum 406 vasa recepta sunt. Evenit ut plus quam dimidium (228) articulorum in Habre occulti vel deleti sint.
  • Tota datorum datorum, quae dimidia fere decies centena millia articulorum constant, 2.95 GB ponderat. in forma compressa — 495 MB.
  • In summa, 37804 sunt auctores Habre. Admoneo te has statisticas solum ex postibus vivere.
  • Uberrimus auctor in Habre - alizar — 8774 Articuli.
  • Top rated articulum - MCDXLVIII pluses
  • Most legere articulum - 1660841 views
  • Most Discussed Article - 2444 comments

Bene in forma montiumTop XV auctoresOmnes Habr in uno database
Top XV per ratingOmnes Habr in uno database
Top 15 readOmnes Habr in uno database
Top XV DiscussedOmnes Habr in uno database

Source: www.habr.com

Add a comment