Habrastatistics: analyzing readers' comments

Hello Habr. IN previous section the popularity of various sections of the site was analyzed, and at the same time the question arose - what data can be extracted from comments on articles. I also wanted to test one hypothesis, which I will discuss below.
Habrastatistics: analyzing readers' comments

The data turned out to be quite interesting, we also managed to make a small “mini-rating” of commentators. Continued under the cut.

Сбор данных

For analysis, we will use data for this year, 2019, especially since I already received a list of articles in the form of csv. It remains to extract comments from each article, fortunately for us, they are stored there, and no additional requests need to be made.

To extract comments from an article, the following code is sufficient:

r = requests.get("https://habr.com/ru/post/467453/")
data_html = r.text
comments = data_html.split('<div class="comment" id=')

comments_list = []
for comment in comments:
    body = Str(comment).find_between('<div class="comment__message', '<div class="comment__footer"').find_between('>', '</div>')# .replace('n', '-')
    if len(body) < 4: continue

    body = body.translate(str.maketrans(dict.fromkeys("tnrvf")))
    body = body.replace('"', "'").replace(',', " ").replace('<br>', ' ').replace('<p>', '').replace('</p>', '').replace('  ', ' ')

    user = Str(comment).find_between('data-user-login', '>').find_between('"', '"')
    date_str = Str(comment).find_between('<time class="comment__date-time comment__date-time_published', 'time>').find_between('>', '<')
    vote = Str(comment).find_between('<div class="voting-wjt', '</div>').find_between('<span', 'span>').find_between('>', '<')
    date = dateparser.parse(date_str)

    csv_data = "{},{},{},{}".format(user, date, vote, body)
    comments_list.append(csv_data)

This allows us to get a list of comments like this (nicknames removed for privacy reasons):

xxxxxxx,2019-02-06 11:50:00,0,А можно пример как именно?
xxxxxxx-02-24 16:15:00,+1,Побольше читайте независимые официальные источники чтобы таких вопросов не было.
xxxxxxx,2019-02-23 20:15:00,–5,А не важно главное в итоге в плюсе оказаться

As you can see, for each comment we can get the user's name, date, rating, and the actual text. Let's see what we can get from this.

By the way, initially, the idea of ​​collecting a rating was a little different - to see what ratings users give. For example, you can look at youtube - even the most ideal video, even a video that does not carry any subjective information, a purely reference or news release, still gains a certain number of minuses. The hypothesis was that there are users who, purely clinically, do not like everything at all, maybe serotonin is not produced in the brain or something else. Maybe a person no longer needs to sit on Habré, but treat depression ... But as it turned out, I can’t check this here, because. the list of rated ones is not saved in the comment or article. Well, that is, that is, we will work with the available data. As a result, we got a “reverse” rating - you can see what ratings _receive_ users. Which, in principle, is also interesting.

Performing the shaping

For starters, the traditional disclaimer. This rating, like all previous ones, is unofficial. I do not guarantee that I did not make a mistake anywhere. For those who are interested in technical details, more detailed code is given in the previous part.

So let's get started. For analysis, comments were taken for this year, 2019 (which has not yet ended). At the time of writing, users have written 448533 comment, the size of the csv file is 288MB. Powerful, inspiring.

Time writing

Let's group comments by hours, separating weekdays and weekends separately.

Habrastatistics: analyzing readers' comments

Here we are not interested in absolute values, but relative ones. If you just look "as is", then it turns out thatоMost of the comments were written during working hours from 10 to 18 😉 On the other hand, time zones are not taken into account here, so the question is still open.

Let's see the distribution of comments during the year:

Habrastatistics: analyzing readers' comments

And yet, it spins, a surge is clearly visible on weekdays - the weekly frequency is clearly visible, so with sufficient confidence we can say that people read and comment on Habr from work (but this is not accurate).

By the way, there was an idea to test the hypothesis whether the number of minuses or pluses received differs from day or time of day, but we could not find a relationship - the time of scoring is not saved, and there is no direct connection with the time of the comment.

Members

Of course, I do not know the exact number of users on the site. But those who left at least one comment this year turned out to be about 25000 people.

The graph of the number of messages left by users looks quite interesting:

Habrastatistics: analyzing readers' comments

At the beginning I did not believe it myself, but there seems to be no mistake. 5% of users leave 60% of messages. 10% - 74% of all messages (which I recall, for this year, 450 thousand). Most people just read the site, leaving very rarely comments, or not leaving them at all (naturally, such ones were not included in my list).

Ratings

Let's move on to the last and most fun part of the statistics - ratings. For privacy reasons, I will not give the full nicknames of users, whoever wants, I think, will recognize himself.

On number of comments for this year, the top 5 are VoXXXX (3377 comments), 0xdXXXXX (3286 comments), strXXXX (3043 comments), AmXXXX (2897 comments) and khXXXX (2748 comments).

On the number of benefits received, top 5 is taken by amXXXX (1395 comments, ratings +3231/-309), tvXXXX (1544 comments, ratings +3231/-97), WhuXXXX (921 comments, ratings +2288/-13), MTXXXX (1328 comments, +1383 /-7) and amaXXXX (736 comments, rating +1340/-16).

On absolute positive rating (no one negatively rated comment) the top of the top is occupied by Milfgard и Boomburum. As an exception, I give their nicknames in full, I think they deserve it.

The cons are also interesting. Top by the number of minuses scored for this year are siXX (473 pluses, 699 minuses), khXX (1915 pluses, 573 minuses) and nicXXXXX (456 pluses, 487 minuses). But as you can see, these users have enough positive comments. But according to absolute minus vladXXXX (55 comments, 84 downvotes, 0 upvotes), ekoXXXX (77 comments, 92 downvotes, 1 upvote) and iMXXXX (225 comments, 205 downvotes, 12 upvotes) get into the antitop.

Conclusion

It was not possible to calculate everything planned, but I hope it was interesting.

As you can see, even a dataset with such a small number of fields can provide interesting data for analysis. There is still far to dig, from building a “word cloud” to text analysis. If there are any interesting results, they will be published.

Source: habr.com

Add a comment