Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Xa i-Olivier yoNyaka omtsha iphelile, andizange ndiyenze, kwaye ndagqiba ekubeni ndikhuphe onke amanqaku avela kwi-Habrahabr (kunye namaqonga ahambelanayo) kwikhompyutheni yam kwaye ndihlolisise.

Kwakukho amabali amaninzi anika umdla. Eyokuqala kubo kukuphuhliswa kwefomathi kunye nezihloko zamanqaku kwiminyaka eyi-12 yobukho besayithi. Umzekelo, i-dynamics yezinye izihloko ibonisa. Ukuqhubela phambili - phantsi kokunqunyulwa.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Inkqubo yokwahlulahlula

Ukuqonda indlela uHabr aphuhliswe ngayo, kuye kwafuneka ukuba uye kuwo onke amanqaku akhe kwaye ukhuphe ulwazi lwemeta kubo (umzekelo, imihla). I-bypass yayilula, kuba amakhonkco kuwo onke amanqaku abukeka ngathi "habrahabr.ru/post/337722/", kwaye amanani anikwe ngokungqongqo ngokulandelelana. Ukwazi ukuba isithuba sokugqibela sinenani elingaphantsi kwama-350 amawaka, ndiye ndadlula kuyo yonke i-id yoxwebhu olunokwenzeka kwilophu (ikhowudi yePython):

import numpy as np
from multiprocessing import Pool
with Pool(100) as p:
    docs = p.map(download_document, np.arange(350000))

Umsebenzi download_document izama ukulayisha iphepha nge id ehambelanayo kwaye izama ukukhupha ulwazi olunentsingiselo kulwakhiwo lwe html.

import requests
from bs4 import BeautifulSoup

def download_document(pid):
    """ Download and process a Habr document and its comments """
    # выгрузка документа
    r = requests.get('https://habrahabr.ru/post/' +str(pid) + '/')
    # парсинг документа
    soup = BeautifulSoup(r.text, 'html5lib') # instead of html.parser
    doc = {}
    doc['id'] = pid
    if not soup.find("span", {"class": "post__title-text"}):
        # такое бывает, если статья не существовала или удалена
        doc['status'] = 'title_not_found'
    else:
        doc['status'] = 'ok'
        doc['title'] = soup.find("span", {"class": "post__title-text"}).text
        doc['text'] = soup.find("div", {"class": "post__text"}).text
        doc['time'] = soup.find("span", {"class": "post__time"}).text
        # create other fields: hubs, tags, views, comments, votes, etc.
        # ...
    # сохранение результата в отдельный файл
    fname = r'files/' + str(pid) + '.pkl'
    with open(fname, 'wb') as f:
        pickle.dump(doc, f)

Kwinkqubo yokwahlulahlula, ndiye ndafumanisa izinto ezininzi ezintsha.

Okokuqala, bathi ukwenza iinkqubo ezininzi kunokuba kukho ii-cores kwiprosesa ayinamsebenzi. Kodwa kwimeko yam, kuye kwavela ukuba umthombo wokunciphisa awuyiyo iprosesa, kodwa inethiwekhi, kunye neenkqubo eziyi-100 zisebenza ngokukhawuleza kune-4 okanye, yithi, i-20.

Okwesibini, kwezinye izithuba kwakukho imidibaniso yabalinganiswa abakhethekileyo - umzekelo, ii-euphemisms ezifana "%&#@". Kwavela ukuba html.parser, endiyisebenzisileyo kuqala, isabela kwindibaniselwano &# kabuhlungu, uyithathela ingqalelo isiqalo sequmrhu le html. Ndandisele ndizokwenza umlingo omnyama, kodwa iforamu iphakamise ukuba ungatshintsha nje i-parser.

Okwesithathu, ndakwazi ukothula zonke iimpapasho, ngaphandle kwezithathu. Amaxwebhu anenombolo 65927, 162075, kunye ne-275987 acinywa ngoko nangoko yi-antivirus yam. La ngamanqaku, ngokulandelelanayo, malunga nekhonkco le-javascript elikhuphela i-pdf ekhohlakeleyo, i-SMS ye-ransomware ngendlela yeseti yeeplagi zebrawuza, kunye nesiza se-CrashSafari.com esithumela ii-iPhones kwi-reboot. I-Antivirus ifumene elinye inqaku kamva, ngexesha lokuskena inkqubo: thumela i-338586 malunga nezikripthi kwiwebhusayithi yevenkile yezilwanyana ezisebenzisa iprosesa yomsebenzisi ukwenza i-cryptocurrency yam. Ke sinokujonga ukuba umsebenzi we-antivirus wanele.

Amanqaku athi "Live" ajika abe sisiqingatha sobuninzi obunokubakho - iziqwenga eziyi-166307. Ngokumalunga nabanye, uHabr unika iinketho "iphepha liphelelwe lixesha, licinyiwe okanye alikho kwaphela." Ewe, nantoni na ingenzeka.

Ukufakwa kwamanqaku kwalandelwa ngumsebenzi wobugcisa: umzekelo, imihla yokupapashwa kuye kwafuneka itshintshwe isuka kwifomathi "'21 December 2006 ngo-10:47 am" ibe semgangathweni. datetime, kunye ne "12,8k" iimbono - kwi-12800. Kweli nqanaba, ezinye iziganeko ezimbalwa zaphuma. Eyona nto ihlekisayo inento yokwenza nokubalwa kwevoti kunye neentlobo zedatha: ezinye izithuba ezindala zine-int overflow kwaye zafumana iivoti ze-65535 nganye.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Ngenxa yoko, iitekisi zamanqaku (ngaphandle kweemifanekiso) zandithatha i-gigabytes eyi-1.5, izimvo nge-meta-information - enye i-3, kwaye malunga nekhulu le-megabytes - ulwazi lwe-meta malunga namanqaku. Oku kunokugcinwa ngokupheleleyo kwi-RAM, eyayiyinto eyothusayo kum.

Ndaqalisa uhlalutyo lwamanqaku kungekhona kwiitekisi ngokwazo, kodwa kwi-meta-information: imihla, iithegi, ii-hubs, iimbono kunye nokuthanda. Kwafumaniseka ukuba wayekwazi ukuxela okuninzi.

Iindlela zoPhuhliso lweHabrahabr

Amanqaku kwisayithi ashicilelwe ukususela ngo-2006; kakhulu kakhulu - ngo-2008-2016.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Indlela la manqaku afundwa ngayo ngamaxesha ahlukeneyo akulula kangako ukuyihlola. Imibhalo esuka kwi-2012 kunye nabancinci bafumana izimvo kunye nemilinganiselo emininzi, kodwa iitekisi ezintsha zineembono ezininzi kunye neebhukhimakhi. Ezi metrics ziziphathe ngendlela efanayo (isiqingatha) kanye kuphela, ngo-2015. Mhlawumbi, kwimeko yeengxaki zezoqoqosho kunye nezopolitiko, ingqalelo yabafundi ishintshile kwiiblogi ze-IT ukuya kwimiba ebuhlungu kakhulu.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Ukongeza kumanqaku ngokwawo, ndakhuphela amagqabaza amaninzi kuwo. Kwakukho izimvo ezizigidi ezi-6, nangona kunjalo, i-240 lamawaka kubo yavalwa ("i-UFO yangena kwaye yapapasha lo mbhalo apha"). Ipropathi eluncedo yezimvo kukuba banesitampu sexesha. Ngokufunda ixesha lamagqabaza, unokuqonda xa amanqaku efundwa konke konke.

Kuye kwavela ukuba amanqaku amaninzi abhalwe zombini kwaye anike izimvo kwindawo ethile ukusuka kwi-10 ukuya kwi-20 pm, i.e. ngosuku oluqhelekileyo lokusebenza eMoscow. Oku kusenokuthetha ukuba uHabr ufundelwa iinjongo zobungcali, kwaye le yindlela elungileyo yokulibazisa emsebenzini. Ngendlela, oku kusasazwa kwexesha lemini kuzinzile ukusuka kwisiseko sikaHabr ukuza kuthi ga ngoku.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Nangona kunjalo, inzuzo ephambili yesitampu sexesha lokuphawula ayilona ixesha lemini, kodwa ixesha "lobomi obusebenzayo" benqaku. Ndibale indlela elisasazwa ngayo ixesha ukusuka kwimpapasho yenqaku ukuya kumagqabaza alo. Kwavela ukuba ngoku i-median comment (umgca oluhlaza) ifika malunga neeyure ze-20, i.e. kusuku lokuqala emva kokupapashwa, ngokomlinganiselo, ngaphezulu kancinane kwesiqingatha sawo onke amagqabaza akweli nqaku ashiywe. Kwaye ngeentsuku ezimbini bashiya i-75% yazo zonke izimvo. Ngelo xesha, amanqaku angaphambili afundwa ngokukhawuleza - umzekelo, ngo-2010, isiqingatha samazwana safika kwiiyure ze-6 zokuqala.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Yandimangalisa into yokuba amagqabantshintshi ande: umndilili yenani labalinganiswa kwizimvo uphantse waphinda kabini kubomi bukaHabr!

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Impendulo elula kunezimvo ziivoti. Ngokungafaniyo nezinye izixhobo ezininzi, kwi-Habré awukwazi ukubeka i-pluses kuphela, kodwa kunye ne-minuses. Nangona kunjalo, abafundi abalisebenzisi ithuba lokugqibela rhoqo: isabelo sangoku sokungathandwa malunga ne-15% yazo zonke iivoti ezifakiwe. Bekukho ngaphezulu, kodwa ekuhambeni kwexesha, abafundi baye baba nobubele.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Izicatshulwa ngokwazo zitshintshile ngokuhamba kwexesha. Umzekelo, ubude obuqhelekileyo bombhalo abuyeki ukukhula ngokuthe ngcembe ukusuka ekuqalisweni kwesiza, nangona kukho iingxaki. Kwiminyaka elishumi, imibhalo iye yande ngokuphindwe kalishumi!

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Isimbo sezicatshulwa (ukuya kuqikelelo lokuqala) sitshintshile. Ngexesha leminyaka yokuqala yobukho bukaHabr, umzekelo, isabelo sekhowudi kunye namanani kwiitekisi zanda:

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Emva kokuqonda yonke i-dynamics yesayithi, ndaye ndagqiba ekubeni ndilinganise indlela ukuthandwa kwezihloko ezahlukeneyo ezitshintshile ngayo. Izihloko zinokukhethwa ngokuzenzekelayo kwiitekisi, kodwa kubaqalayo, awukwazi ukuphinda uqalise ivili, kodwa sebenzisa iithegi esele zenziwe zifakwe ngababhali benqaku ngalinye. Ndizobe iindlela ezine eziqhelekileyo kwitshathi. Umxholo othi "Google" ekuqaleni wawulawula (mhlawumbi ngenxa ye-SEO), kodwa uye wehla ubunzima kwiminyaka. I-Javascript ibe sisihloko esithandwayo kwaye siyaqhubeka sikhula ngokucothayo, kodwa ukufunda ngomatshini sele kuqalisile ukufumana ukuthandwa ngokukhawuleza kwiminyaka yakutshanje. I-Linux, kwelinye icala, ihlala ifanelekile ngokulinganayo kuyo yonke ishumi leminyaka.

Indlela endamohlula ngayo uHabr, icandelo 1: iintsingiselo

Ewe kunjalo, ndaba nomdla kwizihloko ezitsala umdla wabafundi abaninzi. Ndibale inani eliphakathi kweembono, iivoti kunye nezimvo kwisihloko ngasinye. Nantsi into eyenzekayo:

  • Izihloko ezijongwe kakhulu: i-arduino, uyilo lwewebhu, uphuhliso lwewebhu, ukugaya, amakhonkco, i-css, i-html, i-html5, i-nginx, i-algorithms.
  • Eyona "ethandiweyo" izihloko: vkontakte, uburharha, jquery, opera, c, html, uphuhliso web, html5, css, uyilo web.
  • Ezona zihloko kuxoxwe ngazo: opera, skype, freelance, vkontakte, ubuntu, work, nokia, nginx, arduino, firefox.

Ngendlela, kuba ndithelekisa izihloko, ungazihlela ngokuphindaphindiweyo (kwaye uthelekise iziphumo kunye inqaku elifanayo ukusuka kwi-2013).

  • Kuyo yonke iminyaka yobukho bukaHabr, ezona tags zidumileyo (kwi-odolo ehlayo) zi-google, i-android, i-javascript, i-microsoft, i-linux, i-php, i-apula, i-java, i-python, i-programming, i-startups, uphuhliso, i-ios, i-startup, i-social networks.
  • Ngo-2017, ezona zidumileyo zaziyi-javascript, python, java, android, development, linux, c++, programming, php, c#, ios, ukufunda ngomatshini, ukhuseleko lolwazi, microsoft, react

Xa uthelekisa ezi zilinganiso, umntu unokunikela ingqalelo, umzekelo, kwimashi yoloyiso yePython kunye nokuphela kwephp, okanye "ekutshoneni kwelanga" kwezihloko zokuqalisa kunye nokunyuka kokufundwa komatshini.

Ayizizo zonke iithegi kuHabré ezinombala ocacileyo onjalo. Umzekelo, nazi iithegi ezilishumi elinesibini ezidibene kanye kuphela, kodwa zabonakala zihlekisa kum. Ngoko: "umbono ngumqhubi wenkqubela phambili", "i-boot ukusuka kumfanekiso we-floppy", "i-Iowa State", "idrama", "i-superalesh", "i-injini ye-steam", "izinto zokwenza ngoMgqibelo", "ndinazo impungutye kwigrinder yenyama", "a kwavela njengesiqhelo", "asikwazanga ukuza neethegi ezihlekisayo". Ukumisela umxholo wamanqaku anjalo, iithegi azanelanga - kuya kufuneka uqhube imodeli yesihloko kwisicatshulwa samanqaku.

Uhlalutyo olunzulu ngakumbi lomxholo wamanqaku luya kuba kwisithuba esilandelayo. Okokuqala, ndiza kwakha imodeli eqikelela inani leembono zephepha kwinqaku elisekelwe kumxholo walo. Okwesibini, ndifuna ukufundisa inethiwekhi ye-neural ukuvelisa izicatshulwa ngendlela efanayo nababhali bakaHabr. Ngoko bhalisa 🙂

PS Kwaye nantsi i-beep iseti yedatha.

umthombo: www.habr.com

Yongeza izimvo