How I parsed Habr, part 1: trends

When the New Year's Olivier was finished, I had nothing to do, and I decided to download all the articles from Habrahabr (and related platforms) to my computer and explore.

There were several interesting stories. The first of them is the development of the format and topics of articles over the 12 years of the site's existence. For example, the dynamics of some topics is quite indicative. Continuation - under the cut.

How I parsed Habr, part 1: trends

Parsing process

To understand how Habr developed, it was necessary to go through all his articles and extract meta-information from them (for example, dates). The bypass was easy, because links to all articles look like "habrahabr.ru/post/337722/", and the numbers are given strictly in order. Knowing that the last post has a number slightly less than 350 thousand, I just went through all the possible document id in a loop (Python code):

import numpy as np
from multiprocessing import Pool
with Pool(100) as p:
    docs = p.map(download_document, np.arange(350000))

Function download_document tries to load the page with the corresponding id and tries to extract meaningful information from the html structure.

import requests
from bs4 import BeautifulSoup

def download_document(pid):
    """ Download and process a Habr document and its comments """
    # выгрузка документа
    r = requests.get('https://habrahabr.ru/post/' +str(pid) + '/')
    # парсинг документа
    soup = BeautifulSoup(r.text, 'html5lib') # instead of html.parser
    doc = {}
    doc['id'] = pid
    if not soup.find("span", {"class": "post__title-text"}):
        # такое бывает, если статья не существовала или удалена
        doc['status'] = 'title_not_found'
    else:
        doc['status'] = 'ok'
        doc['title'] = soup.find("span", {"class": "post__title-text"}).text
        doc['text'] = soup.find("div", {"class": "post__text"}).text
        doc['time'] = soup.find("span", {"class": "post__time"}).text
        # create other fields: hubs, tags, views, comments, votes, etc.
        # ...
    # сохранение результата в отдельный файл
    fname = r'files/' + str(pid) + '.pkl'
    with open(fname, 'wb') as f:
        pickle.dump(doc, f)

In the process of parsing, I discovered several new things.

First, they say that creating more processes than there are cores in the processor is useless. But in my case, it turned out that the limiting resource is not the processor, but the network, and 100 processes work out faster than 4 or, say, 20.

Secondly, in some posts there were combinations of special characters - for example, euphemisms like "%&#@". It turned out that html.parser, which I used first, reacts to the combination &# painfully, considering it the beginning of the html entity. I was already going to do black magic, but the forum suggested that you can just change the parser.

Thirdly, I managed to unload all publications, except for three. Documents numbered 65927, 162075, and 275987 were instantly deleted by my antivirus. These are articles, respectively, about a javascript chain that downloads a malicious pdf, SMS ransomware in the form of a set of browser plugins, and the CrashSafari.com site that sends iPhones into a reboot. Antivirus discovered another article later, during a system scan: post 338586 about scripts on the pet store website that use the user's processor to mine cryptocurrency. So we can consider the work of the antivirus is quite adequate.

"Live" articles turned out to be only half of the potential maximum - 166307 pieces. About the rest, Habr gives options "the page is outdated, has been deleted or did not exist at all." Well, anything can happen.

The uploading of articles was followed by technical work: for example, publication dates had to be converted from the format "'21 December 2006 at 10:47 am" to the standard datetime, and "12,8k" views - in 12800. At this stage, a few more incidents got out. The funniest one has to do with vote counts and data types: some old posts had an int overflow and received 65535 votes each.

How I parsed Habr, part 1: trends

As a result, the texts of articles (without pictures) took me 1.5 gigabytes, comments with meta-information - another 3, and about a hundred megabytes - meta-information about articles. This can be completely kept in RAM, which was a pleasant surprise for me.

I started the analysis of articles not from the texts themselves, but from meta-information: dates, tags, hubs, views and likes. It turned out that she could tell a lot.

Habrahabr Development Trends

Articles on the site have been published since 2006; most intensively - in 2008-2016.

How I parsed Habr, part 1: trends

How actively these articles were read at different times is not so easy to assess. Texts from 2012 and younger received more comments and ratings, but newer texts have more views and bookmarks. These metrics behaved the same way (halved) only once, in 2015. Perhaps, in a situation of economic and political crisis, the attention of readers has shifted from IT blogs to more painful issues.

How I parsed Habr, part 1: trends

In addition to the articles themselves, I downloaded more comments to them. There were 6 million comments, however, 240 thousand of them were banned ("a UFO flew in and published this inscription here"). A useful property of comments is that they have a time stamp. By studying the time of comments, you can roughly understand when articles are read at all.

It turned out that most of the articles are both written and commented somewhere from 10 am to 20 pm, i.e. on a typical Moscow working day. This may mean that Habr is read for professional purposes, and that this is a good way to procrastinate at work. By the way, this distribution of the time of day is stable from the very foundation of Habr to the present day.

How I parsed Habr, part 1: trends

However, the main benefit of a comment timestamp is not the time of day, but the duration of the "active life" of the article. I calculated how the time is distributed from the publication of the article to its comment. It turned out that now the median comment (green line) comes in about 20 hours, i.e. on the first day after publication, on average, a little more than half of all comments on the article are left. And in two days they leave 75% of all comments. At the same time, earlier articles were read even faster - for example, in 2010, half of the comments came in the first 6 hours.

How I parsed Habr, part 1: trends

It came as a surprise to me that comments have lengthened: the average number of characters in a comment has almost doubled over the lifetime of Habr!

How I parsed Habr, part 1: trends

Easier feedback than comments are votes. Unlike many other resources, on Habré you can put not only pluses, but also minuses. However, readers do not use the last opportunity so often: the current share of dislikes is about 15% of all votes cast. There used to be more, but over time, readers have become kinder.

How I parsed Habr, part 1: trends

The texts themselves have changed over time. For example, the typical length of the text does not stop growing steadily from the very launch of the site, despite the crises. In a decade, texts have become almost ten times longer!

How I parsed Habr, part 1: trends

The style of the texts (to a first approximation) also changed. During the first years of Habr's existence, for example, the share of code and numbers in texts increased:

How I parsed Habr, part 1: trends

After understanding the overall dynamics of the site, I decided to measure how the popularity of various topics changed. Topics can be automatically selected from texts, but for starters, you can not reinvent the wheel, but use ready-made tags affixed by the authors of each article. I have drawn four typical trends on the chart. The "Google" theme initially dominated (perhaps mainly due to SEO), but has been losing weight over the years. Javascript has been a popular topic and continues to grow slowly, but machine learning has begun to rapidly gain popularity only in recent years. Linux, on the other hand, has remained equally relevant throughout the decade.

How I parsed Habr, part 1: trends

Of course, I became interested in what topics attract more reader activity. I calculated the median number of views, votes and comments in each topic. Here's what happened:

  • Most viewed topics: arduino, web design, web development, digest, links, css, html, html5, nginx, algorithms.
  • The most "liked" topics: vkontakte, humor, jquery, opera, c, html, web development, html5, css, web design.
  • The most discussed topics: opera, skype, freelance, vkontakte, ubuntu, work, nokia, nginx, arduino, firefox.

By the way, since I'm comparing topics, you can rank them by frequency (and compare the results with similar article from 2013).

  • For all the years of Habr's existence, the most popular tags (in descending order) are google, android, javascript, microsoft, linux, php, apple, java, python, programming, startups, development, ios, startup, social networks
  • In 2017, the most popular were javascript, python, java, android, development, linux, c++, programming, php, c#, ios, machine learning, information security, microsoft, react

When comparing these ratings, one can pay attention, for example, to the victorious march of Python and the extinction of php, or to the "sunset" of startup topics and the rise of machine learning.

Not all tags on Habré have such an obvious thematic coloring. For example, here are a dozen tags that met only once, but just seemed funny to me. So: "idea is the driving force of progress", "boot from a floppy disk image", "Iowa State", "drama", "superalesh", "steam engine", "things to do on Saturday", "I have a fox in a meat grinder", "a it turned out as always", "we couldn't come up with funny tags". To determine the subject of such articles, tags are not enough - you will have to carry out thematic modeling on the texts of the articles.

A more detailed analysis of the content of the articles will be in the next post. First, I'm going to build a model that predicts the number of page views for an article based on its content. Secondly, I want to teach the neural network to generate texts in the same style as the authors of Habr. So subscribe 🙂

PS And here is the beeped dataset.

Source: habr.com

Add a comment