ProHoster > Blog > internet news > Habrastatistics: we explore the most and least visited sections of the site
Habrastatistics: we explore the most and least visited sections of the site
Hey Habr.
В previous section Habr's attendance was analyzed by the main parameters - the number of articles, their views and ratings. However, the question of the popularity of the sections of the site was not considered. It became interesting to look at this in more detail, and find the most popular and most unpopular hubs. Finally, I'll look at the "geektimes effect" in more detail, and at the end, readers will get a new selection of the best articles on the new ratings.
Who cares what happened, continued under the cut.
Let me remind you once again that the statistics and rating are not official, I have no insider information. It is also not guaranteed that I did not make a mistake somewhere or did not miss something. But still, I think it's interesting. We will start with the code first, for whom this is irrelevant, you can skip the first sections.
Сбор данных
In the first version of the parser, only the number of views, comments, and rating of articles were taken into account. This is not bad, but it does not allow you to make more complex queries. It's time to analyze the thematic sections of the site, this will allow you to do quite interesting research, for example, to see how the popularity of the "C ++" section has changed over several years.
The article parser has been improved, now it returns the hubs to which the article belongs, as well as the author's nickname and rating (a lot of interesting things can also be done here, but that's later). The data is saved in a csv file that looks something like this:
2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...
Get a list of the main thematic hubs of the site.
def get_as_str(link: str) -> Str:
try:
r = requests.get(link)
return Str(r.text)
except Exception as e:
return Str("")
def get_hubs():
hubs = []
for p in range(1, 12):
page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
# page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p) # Geektimes
# page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p) # Develop
# page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p) # Admin
for hub in page_html.split("media-obj media-obj_hub"):
info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags')
if "*</span>" in info:
hub_name = info.find_between('/', '/"')
if len(hub_name) > 0 and len(hub_name) < 32:
hubs.append(hub_name)
print(hubs)
The find_between function and the Str class allocate a string between two tags, I used them earlier. Topic hubs are marked with "*" so they are easy to highlight, you can also uncomment the corresponding lines to get sections of other categories.
At the output of the get_hubs function, we get a fairly impressive list, which we save as a dictionary. I am quoting the list in its entirety so that you can estimate its volume.
Other hubs were similarly preserved. Now it's easy to write a function that returns a result, whether the article belongs to geektimes or to a profile hub.
Plot the number of published articles using Matplotlib:
I've separated the "geektimes" and "geektimes only" articles in the graph. an article can belong to both sections at the same time (eg "DIY" + "microcontrollers" + "C++"). With the designation “profile”, I highlighted the profile articles of the site, although it is possible that the English term profile for this is not entirely correct.
In the previous part, we asked about the “geektimes effect” associated with the change in the rules for paying articles for geektimes from this summer. Let's display geektimes articles separately:
The result is interesting. The approximate ratio of geektimes article views to the total is about 1:5. But if the total number of views fluctuated noticeably, then the views of "entertainment" articles remained approximately at the same level.
You can also notice that the total number of views of articles in the "geektimes" section after the change in the rules still fell, but "by eye", no more than 5% of the total values.
It is interesting to look at the average number of views per article:
For "entertainment" articles, it is about 40% higher than average. Perhaps this is not surprising. The failure at the beginning of April is incomprehensible to me, maybe it was, or it's some kind of parsing error, or maybe one of the geektimes authors went on vacation;).
By the way, the graph shows two more noticeable peaks in the number of article views - the New Year and May holidays.
Hubs
Let's move on to the promised analysis of hubs. Let's display the top 20 hubs by the number of views:
Surprisingly, the “Information Security” hub turned out to be the most popular in terms of views, and “Programming” and “Popular science” are also in the top 5 leaders.
Antitop takes Gtk and Cocoa.
I'll tell you a secret, top hubs can also be seen and here, although the number of views is not shown there.
Rating
And finally, the promised rating. Using the hub analysis data, we can infer the most popular articles across the most popular hubs for this year 2019.
And finally, so that no one is offended, I will give a rating of the least visited gtk hub. It published within a year one article, it is "automatically" takes the first line of the rating.