IHabrastatistics: ukuhlalutya amagqabantshintshi omfundi

Molo Habr. IN inxalenye yangaphambili Ukuthandwa kwamacandelo ahlukeneyo esayithi kwahlalutywa, kwaye kwangaxeshanye kwavela umbuzo - yeyiphi idatha enokukhutshwa kumazwana kumanqaku. Ndandifuna nokuvavanya i-hypothesis enye, endiya kuxoxa ngayo ngezantsi.
IHabrastatistics: ukuhlalutya amagqabantshintshi omfundi

Idatha yajika yaba nomdla kakhulu, kwaye siye sakwazi nokuqulunqa "umlinganiselo omncinci" wabahlaziyi. Qhubeka phantsi kokusikwa.

Ukuqokelelwa kwedatha

Ukuhlalutya, siya kusebenzisa idatha yalo nyaka, ka-2019, ngakumbi kuba sele ndifumene uluhlu lwamanqaku kwifom ye-csv. Konke okuseleyo kukukhupha izimvo kwinqaku ngalinye ngethamsanqa kuthi, zigcinwe apho, kwaye akukho zicelo zongezelelweyo kufuneka zenziwe.

Ukugqamisa amagqabantshintshi kwinqaku, le khowudi ilandelayo yanele:

r = requests.get("https://habr.com/ru/post/467453/")
data_html = r.text
comments = data_html.split('<div class="comment" id=')

comments_list = []
for comment in comments:
    body = Str(comment).find_between('<div class="comment__message', '<div class="comment__footer"').find_between('>', '</div>')# .replace('n', '-')
    if len(body) < 4: continue

    body = body.translate(str.maketrans(dict.fromkeys("tnrvf")))
    body = body.replace('"', "'").replace(',', " ").replace('<br>', ' ').replace('<p>', '').replace('</p>', '').replace('  ', ' ')

    user = Str(comment).find_between('data-user-login', '>').find_between('"', '"')
    date_str = Str(comment).find_between('<time class="comment__date-time comment__date-time_published', 'time>').find_between('>', '<')
    vote = Str(comment).find_between('<div class="voting-wjt', '</div>').find_between('<span', 'span>').find_between('>', '<')
    date = dateparser.parse(date_str)

    csv_data = "{},{},{},{}".format(user, date, vote, body)
    comments_list.append(csv_data)

Oku kusivumela ukuba sifumane uluhlu lwezimvo ezijongeka ngolu hlobo (iziteketiso zisusiwe ngenxa yezizathu zabucala):

xxxxxxx,2019-02-06 11:50:00,0,А ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡ€ΠΈΠΌΠ΅Ρ€ ΠΊΠ°ΠΊ ΠΈΠΌΠ΅Π½Π½ΠΎ?
xxxxxxx-02-24 16:15:00,+1,ПобольшС Ρ‡ΠΈΡ‚Π°ΠΉΡ‚Π΅ нСзависимыС ΠΎΡ„ΠΈΡ†ΠΈΠ°Π»ΡŒΠ½Ρ‹Π΅ источники Ρ‡Ρ‚ΠΎΠ±Ρ‹ Ρ‚Π°ΠΊΠΈΡ… вопросов Π½Π΅ Π±Ρ‹Π»ΠΎ.
xxxxxxx,2019-02-23 20:15:00,–5,А Π½Π΅ Π²Π°ΠΆΠ½ΠΎ Π³Π»Π°Π²Π½ΠΎΠ΅ Π² ΠΈΡ‚ΠΎΠ³Π΅ Π² плюсС ΠΎΠΊΠ°Π·Π°Ρ‚ΡŒΡΡ

Njengoko ubona, kwinkcazo nganye sinokufumana igama lomsebenzisi, umhla, umlinganiselo, kunye nombhalo ochanekileyo. Makhe sibone ukuba yintoni esinokuyifumana kule nto.

Ngendlela, ekuqaleni, imbono yokuqokelela amanqaku yayahluke kancinci - ukubona ukuba abasebenzisi banika ntoni na. Umzekelo, unokujonga kuYouTube - neyona vidiyo ifanelekileyo, nokuba yividiyo engathwaliyo naluphi na ulwazi oluphathekayo, kuphela ireferensi okanye ukukhutshwa kweendaba, isazuza inani elithile lemizuzu. I-hypothesis yayikukuba kukho abasebenzisi abathi, ngokweklinikhi kuphela, abangathandi yonke into, mhlawumbi i-serotonin ayiveliswanga kwingqondo okanye enye into. Mhlawumbi umntu akasadingeki ukuba ahlale kuHabrΓ©, kodwa ukunyanga ukudakumba ... Kodwa njengoko kwavela, andinakuyijonga le nto apha, kuba ... uluhlu lwabo banike iireyithingi alugcinwanga kwizimvo okanye kwinqaku. Ewe, oko kukuthi, siya kusebenza ngedatha ekhoyo. Isiphumo sisikalo "sokubuyela umva" - ungabona ukuba yeyiphi ireyithingi _receive_ ngabasebenzisi. Yiyiphi, ngokomgaqo, nayo inomdla.

Ukuqhubekeka

Ukuqala, i-disclaimer yemveli. Olu thelekelelo, njengazo zonke ezidlulileyo, alukho semthethweni. Andiqinisekisi ukuba andenzanga mpazamo naphi na. Kwabo banomdla kwiinkcukacha zobugcisa, ikhowudi eneenkcukacha ezininzi inikezelwa kwinxalenye engaphambili.

Ngoko masiqalise. Amagqabantshintshi alo nyaka ka-2019 (ongekapheli okwangoku), athatyathelwa uhlalutyo. Ngexesha lokubhala, abasebenzisi babhala 448533 izimvo, ubungakanani befayile ye csv yi 288MB. Inamandla, iyachukumisa.

Ixesha lokubhala

Masenze amaqela amagqabantshintshi ngeyure, sahlule iintsuku zeveki neempelaveki ngokwahlukeneyo.

IHabrastatistics: ukuhlalutya amagqabantshintshi omfundi

Apha asinamdla kumaxabiso apheleleyo, kodwa kwizinto ezizalanayo. Ukuba ujonga nje "njengoko kunjalo", ngoko kuvela okoΠΎUninzi lwamagqabantshintshi abhalwe ngeeyure zokusebenza ukusuka kwi-10 ukuya kwi-18 πŸ˜‰ Ngakolunye uhlangothi, iindawo zexesha azithathelwa ngqalelo apha, ngoko umbuzo usavuliwe.

Makhe sijonge ukuhanjiswa kwamagqabantshintshi unyaka wonke:

IHabrastatistics: ukuhlalutya amagqabantshintshi omfundi

Kwaye kunjalo iyajikeleza; utyando lubonakala ngokucacileyo phakathi evekini - i-periodicity yeveki ibonakala ngokucacileyo, ngoko sinokuthi ngokuzithemba okuphezulu ukuba abantu bafunda kwaye bahlomle ngoHabr emsebenzini (kodwa oku akuqinisekanga).

Ngendlela, bekukho ingcamango yokuvavanya i-hypothesis ukuba inani leminus okanye i-pluses efunyenweyo iyahluka kwimini okanye ngexesha lemini, kodwa kwakungenakwenzeka ukufumana ubudlelwane - ixesha elinikwe ukulinganisa aligcinwanga, kwaye akukho nxibelelwano ngqo kunye nexesha lokuphawula.

Abasebenzisi

Ewe kunjalo, andilazi inani elichanekileyo labasebenzisi kwindawo. Kodwa abo bashiye uluvo olunye kulo nyaka baye baba malunga 25000 abantu.

Igrafu yenani lemiyalezo eshiywe ngabasebenzisi ibonakala inomdla kakhulu:

IHabrastatistics: ukuhlalutya amagqabantshintshi omfundi

Ekuqaleni andizange ndikholelwe, kodwa kwakubonakala kungekho mpazamo. I-5% yabasebenzisi ishiya i-60% yemiyalezo. I-10% - i-74% yayo yonke imiyalezo (apho, makhe ndikukhumbuze, kulo nyaka, i-450 lamawaka). Uninzi lufunda nje indawo, lushiya izimvo kunqabile kakhulu, okanye lingazishiyi kwaphela (ezo, ​​ngokwendalo, azifakwanga kuluhlu lwam).

Ukulinganisa

Masiqhubele phambili kwinxalenye yokugqibela kunye neyona nto imnandi yamanani - ukulinganisa. Ngenxa yezizathu zabucala, andizukunika iziqhulo ezipheleleyo zabasebenzisi, nabani na ofuna, ndiyacinga, uya kuzazi.

Ngu inani lamagqabantshintshi kulo nyaka, i-5 ephezulu ihlaliswe yiVoXXXX (i-3377 izimvo), i-0xdXXXXX (i-3286 izimvo), i-strXXXX (i-3043 izimvo), i-AmXXXX (ii-2897 izimvo) kunye ne-khXXXX (ii-2748 izimvo).

Ngu inani leenzuzo ezifunyenweyo, phezulu 5 zihlalwa amXXXX (1395 izimvo, amanqaku +3231/-309), tvXXXX (1544 izimvo, iireyithingi +3231/-97), WhuXXXX (921 izimvo, ratings +2288/-13), MTXXXX (1328 izimvo, +1383 /-7) kunye namaXXXX (736 izimvo, ukulinganisa +1340/-16).

Ngu ukureyitha okulungileyo ngokupheleleyo (akukho namnye negatively rated comment) phezulu kuhleli Milfgard ΠΈ Boomrum. Ngokukodwa, ndibonisa iziteketiso zabo ngokupheleleyo, ndicinga ukuba zifanelekile.

I-downsides nayo inomdla. Phezulu nge inani lemizuzu eqokelelweyo kulo nyaka bahlala siXX (473 pluses, 699 minuses), khXX (1915 pluses, 573 minuses) kunye ne-nicXXXXX (456 pluses, 487 minuses). Kodwa njengoko ubona, aba basebenzisi banezimvo ezaneleyo ezifanelekileyo. Kodwa ngokutsho thabatha ngokupheleleyo I-antitopic iquka i-vladXXXX (i-55 izimvo, i-84 minuses, i-0 pluses), i-ekoXXXX (i-77 izimvo, i-92 minuses, i-1 plus) kunye ne-iMXXXX (i-225 izimvo, i-205 minuses, i-12 pluses).

isiphelo

Andikwazanga ukubala yonke into ecwangcisiweyo, kodwa ndiyathemba ukuba yayinomdla.

Njengoko ubona, nokuba i-dataset enenani elincinci lemimandla inokubonelela ngedatha enomdla yokuhlalutya. Kusekho into eninzi yokumba, ukusuka ekwakheni "ilifu lamagama" ukuya kuhlalutyo lombhalo. Ukuba kukho naziphi na iziphumo ezinomdla ezivelayo, ziya kupapashwa.

umthombo: www.habr.com

Yongeza izimvo