All Habr in one database

Good afternoon. It's been 2 years since it was written. last article about parsing Habr, and some points have changed.

When I wanted to have a copy of Habr, I decided to write a parser that would save all the content of the authors to the database. How it happened and what errors I encountered - you can read under the cut.

TLDR- database link

The first version of the parser. One thread, many problems

To begin with, I decided to make a script prototype in which the article would be parsed and placed in the database immediately upon downloading. Without thinking twice, I used sqlite3, because. it was less labor-intensive: no need to have a local server, created-looked-deleted and stuff like that.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "При парсинге этой странице произошла ошибка."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

Everything is classic - we use Beautiful Soup, requests and a quick prototype is ready. That's just…

  • Page download is in one thread

  • If you interrupt the execution of the script, then the entire database will go nowhere. After all, the commit is performed only after all the parsing.
    Of course, you can commit changes to the database after each insertion, but then the script execution time will increase significantly.

  • Parsing the first 100 articles took me 000 hours.

Next I find the user's article cointegrated, which I read and found a few life hacks to speed up this process:

  • Using multithreading speeds up downloading at times.
  • You can get not the full version of the habr, but its mobile version.
    For example, if a cointegrated article in the desktop version weighs 378 KB, then in the mobile version it is already 126 KB.

Second version. Many threads, temporary ban from Habr

When I scoured the Internet on the topic of multithreading in python, I chose the simplest option with multiprocessing.dummy, I noticed that problems appeared along with multithreading.

SQLite3 doesn't want to work with more than one thread.
fixed check_same_thread=False, but this error is not the only one, when trying to insert into the database, errors sometimes occur that I could not solve.

Therefore, I decide to abandon the instant insertion of articles directly into the database and, remembering the cointegrated solution, I decide to use files, because there are no problems with multi-threaded writing to a file.

Habr starts banning for using more than three threads.
Especially zealous attempts to get through to Habr can end up with an ip ban for a couple of hours. So you have to use only 3 threads, but this is already good, since the time to iterate over 100 articles is reduced from 26 to 12 seconds.

It is worth noting that this version is rather unstable, and downloads periodically fall off on a large number of articles.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # Запись заблокированных запросов на сервер
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # Если поста не существует или он был скрыт
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # Метка, что пост является переводом или туториалом.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "При парсинге этой странице произошла ошибка."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # Записываем статью в json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

Third version. Final

While debugging the second version, I discovered that Habr, all of a sudden, has an API that the mobile version of the site accesses. It loads faster than the mobile version, since it's just json, which doesn't even need to be parsed. In the end, I decided to rewrite my script again.

So, having found this link API, you can start parsing it.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Необходимы параметры min и max. Использование: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # Если потоков >3
    # то хабр банит ipшник на время
    pool = ThreadPool(3)

    # Отсчет времени, запуск потоков
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # После закрытия всех потоков печатаем время
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

It contains fields related both to the article itself and to the author who wrote it.

API.png

All Habr in one database

I did not dump the full json of each article, but saved only the fields I needed:

  • id
  • is_tutorial
  • time_published
  • title
  • content
  • comments_count
  • lang is the language in which the article is written. So far, it has only en and ru.
  • tags_string - all tags from the post
  • reading_count
  • author
  • score — article rating.

Thus, using the API, I reduced the script execution time to 8 seconds per 100 url.

After we have downloaded the data we need, we need to process it and enter it into the database. I didn't have any problems with this either:

parser.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
                # Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

Statistics

Well, traditionally, finally, you can extract some statistics from the data:

  • Of the expected 490 downloads, only 406 articles were downloaded. It turns out that more than half (228) of the articles on Habré were hidden or deleted.
  • The entire database, consisting of almost half a million articles, weighs 2.95 GB. In compressed form - 495 MB.
  • In total, 37804 people are the authors of Habré. I remind you that these statistics are only from live posts.
  • The most productive author on Habré - alize - 8774 articles.
  • Top rated article — 1448 pluses
  • Most read article — 1660841 views
  • Most Discussed Article — 2444 comments

Well, in the form of topsTop 15 authorsAll Habr in one database
Top 15 by ratingAll Habr in one database
Top 15 readAll Habr in one database
Top 15 DiscussedAll Habr in one database

Source: habr.com

Add a comment