Habrastatistics: how Habr lives without geektimes

Hey Habr.

This article is a logical continuation of the rating The best articles of Habr for 2018. And although the year is not over yet, but as you know, there were changes in the rules in the summer, so it became interesting to see if it affected anything.

Habrastatistics: how Habr lives without geektimes

In addition to the actual statistics, there will also be an updated rating of articles, as well as some sources for those who are interested in how it works.

For those who are interested in what happened, continued under the cut. Those who are interested in a more detailed analysis of the sections of the site can also look at next part.

Initial data

This rating is unofficial, and I have no insider data. As it is easy to see by looking at the address bar of the browser, all articles on Habré have continuous numbering. Then it's a matter of technique, just in a cycle we read all the articles in a row (in one thread and with pauses so as not to load the server). The values ​​themselves were obtained by a simple Python parser (there are sources here) and saved in a csv file like this:

2019-08-11T22:36Z,https://habr.com/ru/post/463197/,"Blazor + MVVM = Silverlight наносит ответный удар, потому что древнее зло непобедимо",votes:11,votesplus:17,votesmin:6,bookmarks:40,views:5300,comments:73
2019-08-11T05:26Z,https://habr.com/ru/news/t/463199/,"В NASA испытали систему автономного управления одного микроспутника другим",votes:15,votesplus:15,votesmin:0,bookmarks:2,views:1700,comments:7

Performing the shaping

For parsing, we will use Python, Pandas and Matplotlib. Those who are not interested in statistics can skip this part and go straight to the articles.

First you need to load the dataset into memory and select the data for the desired year.

import pandas as pd
import datetime
import matplotlib.dates as mdates
from matplotlib.ticker import FormatStrFormatter
from pandas.plotting import register_matplotlib_converters


df = pd.read_csv("habr.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
df['datetime'] = dates
year = 2019
df = df[(df['datetime'] >= pd.Timestamp(datetime.date(year, 1, 1))) & (df['datetime'] < pd.Timestamp(datetime.date(year+1, 1, 1)))]

print(df.shape)

It turns out that this year (although it is not over yet) at the time of writing, 12715 articles have been published. For comparison, for the whole of 2018 - 15904. In general, a lot - this is about 43 articles per day (and this is only with a positive rating, how many articles are loaded in total that have gone negative or have been deleted, one can only guess or roughly estimate by omissions among identifiers).

Select the required fields from the dataset. As metrics, we will use the number of views, comments, rating values ​​and the number of bookmarks.

def to_float(s):
    # "bookmarks:22" => 22.0
    num = ''.join(i for i in s if i.isdigit())
    return float(num)

def to_int(s):
    # "bookmarks:22" => 22
    num = ''.join(i for i in s if i.isdigit())
    return int(num)

def to_date(dt):
    return dt.date() 

date = dates.map(to_date, na_action=None)
views = df["views"].map(to_int, na_action=None)
bookmarks = df["bookmarks"].map(to_int, na_action=None)
votes = df["votes"].map(to_float, na_action=None)
votes_up = df["up"].map(to_float, na_action=None)
votes_down = df["down"].map(to_float, na_action=None)
comments = df["comments"].map(to_int, na_action=None)

df['date'] = date
df['views'] = views
df['votes'] = votes
df['bookmarks'] = bookmarks
df['up'] = votes_up
df['down'] = votes_down

Now the data has been added to the dataset and we can use it. Let's group the data by days and take the average values.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.median().reset_index()
grouped['counts'] = days_count['counts']
counts_per_day = grouped['counts'].values
counts_per_day_avg = grouped['counts'].rolling(window=20).mean()
view_per_day = grouped['views'].values
view_per_day_avg = grouped['views'].rolling(window=20).mean()
votes_per_day = grouped['votes'].values
votes_per_day_avg = grouped['votes'].rolling(window=20).mean()
bookmarks_per_day = grouped['bookmarks'].values
bookmarks_per_day_avg = grouped['bookmarks'].rolling(window=20).mean()

Now the fun part, we can look at the graphs.

Let's see the number of publications on Habré in 2019.

import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (16, 8)
fig, ax = plt.subplots()

plt.bar(year_days, counts_per_day, label='Articles/day')
plt.plot(year_days, counts_per_day_avg, 'g-', label='Articles avg/day')
plt.xticks(rotation=45)
ax.xaxis.set_major_formatter(mdates.DateFormatter("%d-%m-%Y"))  
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=1))
plt.legend(loc='best')
plt.tight_layout()
plt.show()

The result is interesting. As you can see, Habr was slightly "sausage" during the year. I don't know the reason.

Habrastatistics: how Habr lives without geektimes

For comparison, 2018 looks a little "smoother":

Habrastatistics: how Habr lives without geektimes

In general, I did not see any drastic decrease in the number of published articles in 2019 on the chart. Moreover, on the contrary, it seems to have even grown a little since the summer.

But the next two graphs depress me a little more.

Average views per article:

Habrastatistics: how Habr lives without geektimes

Average rating per article:

Habrastatistics: how Habr lives without geektimes

As you can see, the average number of views during the year is slightly reduced. This can be explained by the fact that new articles have not yet been indexed by search engines, and they are not found so often. But the decrease in the average rating per article is more incomprehensible. The feeling is that readers either simply do not have time to view so many articles or do not pay attention to the ratings. From the point of view of the author encouragement program, this trend is very unpleasant.

By the way, this was not the case in 2018, and the schedule is more or less even.

Habrastatistics: how Habr lives without geektimes

In general, resource owners have something to think about.

But let's not talk about sad things. In general, we can say that Habr “survived” the summer changes quite successfully, and the number of articles on the site has not decreased.

Rating

Now actually, the rating. Congratulations to those who got into it. Let me remind you once again that the rating is unofficial, maybe I missed something, and if some article should definitely be here, but it is not, write, I will add it manually. As a rating, I use calculated metrics, which I think turned out to be quite interesting.

Top articles by number of views

Top articles by rating-to-views ratio

Top articles by comments to views ratio

Top most controversial articles

Top articles by rating

Top articles by number of bookmarks

Top by bookmarks to views ratio

Top articles by number of comments

And finally, the last Antitop by the number of dislikes

Uff. I have a few more interesting selections, but I won't bore readers.

Conclusion

When constructing the rating, I paid attention to two points that seemed interesting.

Firstly, after all, 60% of the top are articles of the “geektimes” genre. Whether there will be fewer of them next year, and how Habr will look like without articles about beer, space, medicine, and so on - I don’t know. Definitely, readers will miss something. Let's see.

Secondly, the bookmark top turned out to be unexpectedly high quality. This is psychologically understandable, readers may not pay attention to the rating, and if the article is needed, then it will be added to the bookmarks. And here is just the largest concentration of useful and serious articles. I think the site owners should somehow think about the connection between the number of bookmarks and the reward program if they want to increase this particular category of articles here on Habré.

Something like this. Hope it was informative.

The list of articles turned out to be long, well, it's probably for the best. Happy reading everyone.

Source: habr.com

Add a comment