Habrastatistics: nazarin maganganun masu karatu

Hello Habr. IN bangaren da ya gabata An yi nazarin shahararrun sassan sassa daban-daban na shafin, kuma a lokaci guda tambaya ta taso - abin da za a iya fitar da bayanai daga sharhi a kan labaran. Na kuma so in gwada hasashe ɗaya, wanda zan tattauna a ƙasa.
Habrastatistics: nazarin maganganun masu karatu

Bayanan sun zama mai ban sha'awa sosai; mun kuma sami nasarar ƙirƙirar ƙaramin "ƙananan ƙima" na masu sharhi. Ci gaba a ƙarƙashin yanke.

Tarin bayanai

Don bincike, za mu yi amfani da bayanai don wannan shekara, 2019, musamman tun da na riga na sami jerin labaran a cikin csv form. Abin da ya rage shi ne cire sharhi daga kowane labarin; sa'a a gare mu, an adana su a wurin, kuma ba a buƙatar ƙarin buƙatun.

Don haskaka tsokaci daga labarin, lambar mai zuwa ta isa:

r = requests.get("https://habr.com/ru/post/467453/")
data_html = r.text
comments = data_html.split('<div class="comment" id=')

comments_list = []
for comment in comments:
    body = Str(comment).find_between('<div class="comment__message', '<div class="comment__footer"').find_between('>', '</div>')# .replace('n', '-')
    if len(body) < 4: continue

    body = body.translate(str.maketrans(dict.fromkeys("tnrvf")))
    body = body.replace('"', "'").replace(',', " ").replace('<br>', ' ').replace('<p>', '').replace('</p>', '').replace('  ', ' ')

    user = Str(comment).find_between('data-user-login', '>').find_between('"', '"')
    date_str = Str(comment).find_between('<time class="comment__date-time comment__date-time_published', 'time>').find_between('>', '<')
    vote = Str(comment).find_between('<div class="voting-wjt', '</div>').find_between('<span', 'span>').find_between('>', '<')
    date = dateparser.parse(date_str)

    csv_data = "{},{},{},{}".format(user, date, vote, body)
    comments_list.append(csv_data)

Wannan yana ba mu damar samun jerin ra'ayoyin masu kama da wannan (an cire sunayen laƙabi don dalilai na sirri):

xxxxxxx,2019-02-06 11:50:00,0,А можно пример как именно?
xxxxxxx-02-24 16:15:00,+1,Побольше читайте независимые официальные источники чтобы таких вопросов не было.
xxxxxxx,2019-02-23 20:15:00,–5,А не важно главное в итоге в плюсе оказаться

Kamar yadda kuke gani, ga kowane sharhi muna iya samun sunan mai amfani, kwanan wata, ƙima, da ainihin rubutun. Bari mu ga abin da za mu iya samu daga wannan.

Af, da farko, ra'ayin tattara ratings ya ɗan bambanta - don ganin abin da masu amfani ke bayarwa. Misali, zaku iya kallon YouTube - har ma da mafi kyawun bidiyo, har ma da bidiyon da ba ya ɗaukar kowane bayani na zahiri, kawai don tunani ko sakin labarai, har yanzu yana samun takamaiman adadin ragi. Maganar ita ce akwai masu amfani waɗanda, kawai a asibiti, ba sa son komai kwata-kwata, watakila ba a samar da serotonin a cikin kwakwalwa ko wani abu dabam ba. Wataƙila mutum baya buƙatar zama a kan Habré, amma don magance baƙin ciki ... Amma kamar yadda ya faru, ba zan iya duba wannan a nan ba, saboda ... jerin waɗanda suka ba da ratings ba a ajiye su a cikin sharhi ko labarin. To, wato, za mu yi aiki da bayanan da ake da su. Sakamakon shine kima na "juyawa" - zaku iya ganin menene ƙimar _receive_ ta masu amfani. Wanda, bisa ka'ida, kuma yana da ban sha'awa.

Tsarin aiki

Don farawa da, rashin yarda na gargajiya. Wannan ƙimar, kamar duk waɗanda suka gabata, ba na hukuma bane. Ban bada tabbacin cewa ban yi kuskure a ko'ina ba. Ga masu sha'awar cikakkun bayanai na fasaha, an ba da ƙarin cikakken lambar a bangaren da ya gabata.

Don haka mu fara. An ɗauki sharhi na wannan shekara, 2019 (wanda bai ƙare ba tukuna), don bincike. A lokacin rubutawa, masu amfani sun rubuta 448533 sharhi, girman fayil ɗin csv shine 288MB. Mai ƙarfi, ban sha'awa.

Lokacin rubutu

Mu rika yin tsokaci ta hanyar sa'a, muna raba ranakun mako da karshen mako daban.

Habrastatistics: nazarin maganganun masu karatu

A nan ba mu da sha'awar cikakkiyar dabi'u, amma ga dangi. Idan ka kawai kalle shi "kamar yadda yake", to ya zama hakaоYawancin maganganun an rubuta su ne a lokacin aiki daga 10 zuwa 18 😉 A daya bangaren kuma, ba a la'akari da yankunan lokaci a nan, don haka tambaya a bude take.

Bari mu kalli yadda ake rarraba ra'ayoyin a cikin shekara:

Habrastatistics: nazarin maganganun masu karatu

Amma duk da haka yana jujjuya; ana iya ganin karuwa a ranakun mako - a bayyane yake a bayyane lokaci-lokaci na mako-mako, don haka muna iya cewa da kwarin gwiwa cewa mutane suna karantawa da sharhi kan Habr daga aiki (amma wannan bai tabbata ba).

Af, akwai wani ra'ayin don gwada hasashe ko adadin minuses ko pluses samu ya bambanta da rana ko lokaci na rana, amma ba zai yiwu a sami dangantaka - lokacin da rating da aka ba da ceto, da kuma. babu alaka kai tsaye da lokacin sharhi.

Masu amfani

Tabbas, ban san ainihin adadin masu amfani da shafin ba. Amma waɗanda suka bar aƙalla sharhi ɗaya a wannan shekara sun kasance kusan 25000 mutane.

Hoton adadin saƙonnin da masu amfani suka bari yayi kyau sosai:

Habrastatistics: nazarin maganganun masu karatu

Da farko ban yarda da kaina ba, amma da alama babu kuskure. 5% na masu amfani suna barin 60% na saƙonni. 10% - 74% na duk saƙonni (wanda, bari in tunatar da ku, wannan shekara, 450 dubu). Yawancin kawai suna karanta rukunin yanar gizon, suna barin tsokaci da wuya, ko kuma ba sa barin su kwata-kwata (wadanda, a zahiri, ba a haɗa su cikin jerina ba).

Ratings

Bari mu matsa zuwa ɓangaren ƙarshe kuma mafi daɗi na ƙididdiga - ƙididdiga. Saboda dalilai na sirri, ba zan ba da cikakkun sunayen laƙabi na masu amfani ba, duk wanda yake so, ina tsammanin, zai gane kansu.

By yawan sharhi don wannan shekara, manyan 5 suna shagaltar da VoXXXX (3377 comments), 0xdXXXXX (3286 comments), strXXXX (3043 comments), AmXXXX (2897 comments) da khXXXX (2748 comments).

By yawan amfanin da aka samu, saman 5 suna shagaltar da amXXXX (1395 comments, ratings +3231/-309), tvXXXX (1544 comments, ratings +3231/-97), WhuXXXX (921 comments, ratings +2288/-13), MTXXXX (1328 comments, +1383 /-7) da amaXXXX (736 sharhi, rating +1340/-16).

By cikakken tabbatacce rating (ba guda ba korau rated comment) saman saman yana shagaltar da shi Milfgard и Boomburum. A matsayin ban da, na gabatar da sunayen laƙabi a cikakke, ina ganin sun cancanci hakan.

Abubuwan da ke ƙasa suna da ban sha'awa. Na sama adadin minuses da aka tattara na wannan shekara suna shagaltar da su da siXX (473 pluses, 699 minuses), khXX (1915 pluses, 573 minuses) da nicXXXXX (456 pluses, 487 minuses). Amma kamar yadda kuke gani, waɗannan masu amfani suna da isassun maganganu masu kyau. Amma a cewar cikakken ragi Maganganun ya haɗa da vladXXXX (sharruɗɗa 55, minuses 84, 0 pluses), ekoXXXX ( sharhin 77, 92 minuses, 1 ƙari) da iMXXXX ( sharhi 225, 205 minuses, 12 ƙari).

ƙarshe

Ban iya lissafin duk abin da aka shirya ba, amma ina fata yana da ban sha'awa.

Kamar yadda kake gani, ko da bayanan da ke da irin wannan ƙananan filayen na iya samar da bayanai masu ban sha'awa don bincike. Har yanzu akwai abubuwa da yawa da za a tono, daga gina “girgijen kalma” zuwa nazarin rubutu. Idan wani sakamako mai ban sha'awa ya fito, za a buga su.

source: www.habr.com

Add a comment