Yintoni enokonakala ngeNzululwazi yeDatha? Ukuqokelelwa kwedatha

Yintoni enokonakala ngeNzululwazi yeDatha? Ukuqokelelwa kwedatha
Namhlanje kukho iikhosi ze-100500 zeSayensi yeDatha kwaye kudala kwaziwa ukuba imali eninzi kwiNzululwazi yeDatha inokufunyanwa ngeekhosi zeSayensi yeDatha (kutheni umbe xa unokuthengisa iifosholo?). Ukungalungi okuphambili kwezi zifundo kukuba azinanto yakwenza nomsebenzi wangempela: akukho mntu uya kukunika idatha ecocekileyo, ecwangcisiweyo kwifomathi efunekayo. Kwaye xa ushiya ikhosi kwaye uqale ukusombulula ingxaki yokwenyani, amaninzi ama-nuances avela.

Ngoko ke, siqala uluhlu lwamanqaku "Yintoni enokungahambi kakuhle kwiSayensi yeDatha", ngokusekelwe kwiziganeko zangempela ezenzeka kum, amaqabane am kunye nabalingane. Siza kuhlalutya imisebenzi eqhelekileyo yeNzululwazi yeDatha sisebenzisa imizekelo yokwenyani: ukuba oku kwenzeka njani ngokwenene. Masiqale namhlanje ngomsebenzi wokuqokelela idatha.

Kwaye into yokuqala abantu bayakhubeka xa beqala ukusebenza ngedatha yokwenyani kukuqokelela le datha eyona nto ibalulekileyo kuthi. Umyalezo ongundoqo weli nqaku:

Silijongela phantsi ixesha, izixhobo, kunye neenzame ezifunekayo zokuqokelela, ukucoca, nokulungisa idatha.

Kwaye okona kubaluleke kakhulu, siza kuxoxa ngento emasiyenze ukuthintela oku.

Ngokutsho uqikelelo ezahlukeneyo, ukucoca, inguqu, data processing, ubunjineli isici, njl ukuthatha 80-90% ixesha, kunye nohlalutyo 10-20%, lo gama phantse zonke izinto zemfundo igxile kuphela kuhlalutyo.

Makhe sijonge ingxaki elula yohlalutyo kwiinguqulelo ezintathu njengomzekelo oqhelekileyo kwaye sibone ukuba yintoni "iimeko ezimandundu".

Kwaye njengomzekelo, kwakhona, siya kuthathela ingqalelo ukwahluka okufanayo komsebenzi wokuqokelela idatha kunye nokuthelekisa uluntu:

  1. Ii-subreddits ezimbini zeReddit
  2. Amacandelo amabini kaHabr
  3. Amaqela amabini Odnoklassniki

Indlela enemiqathango kwithiyori

Vula isayithi kwaye ufunde imizekelo, ukuba icacile, bekela bucala iiyure ezimbalwa zokufunda, iiyure ezimbalwa zekhowudi usebenzisa imizekelo kunye nokulungiswa. Yongeza iiyure ezimbalwa zokuqokelela. Phosa kwiiyure ezimbalwa kwindawo yokugcina (phinda kabini kwaye wongeze iiyure ezi-N).

Inqaku eliPhambili: Uqikelelo lwexesha lusekwe kwingqikelelo kunye noqikelelo malunga nokuba liza kuthatha ixesha elingakanani.

Kufuneka uqale uhlalutyo lwexesha lakho ngokuqikelela ezi parameters zilandelayo zengxaki enemiqathango echazwe ngasentla:

  • Bubuphi ubungakanani bedatha kunye nokuba ingakanani na ekufuneka iqokelelwe ngokwasemzimbeni (* jonga ngezantsi *).
  • Liliphi ixesha lokuqokelela irekhodi enye kwaye kufuneka ulinde ixesha elingakanani ngaphambi kokuba uqokelele okwesibini?
  • Cinga ikhowudi yokubhala egcina urhulumente kwaye iqale ukuqalisa kwakhona xa (kungekhona ukuba) yonke into iyasilela.
  • Bonisa ukuba ngaba sifuna ugunyaziso kwaye sisete ixesha lokufumana ukufikelela nge-API.
  • Misela inani leempazamo njengomsebenzi wobunzima bedatha -vavanya umsebenzi othile: ubume, zingaphi iinguqu, yintoni kunye nendlela yokukhupha.
  • Lungisa iimpazamo zothungelwano kunye neengxaki kunye nokuziphatha kweprojekthi okungaqhelekanga.
  • Vavanya ukuba ngaba imisebenzi efunekayo ikumaxwebhu kwaye ukuba akunjalo, ngoko ke, njani kwaye yimalini efunekayo kwi-workaround.

Eyona nto ibaluleke kakhulu kukuba ukuze uqikelele ixesha - ngokwenene kufuneka uchithe ixesha kunye nomgudu "kwi-reconnaissance in force" - kuphela emva koko ukucwangcisa kwakho kuya kwanela. Ke ngoko, kungakhathaliseki ukuba utyhalwa kangakanani ukuba uthi "kuthatha ixesha elingakanani ukuqokelela idatha" - zithengele ixesha lohlalutyo lokuqala kwaye uxoxe ngokuba lingakanani ixesha eliya kuhluka ngokuxhomekeke kwiiparamitha zangempela zengxaki.

Kwaye ngoku siza kubonisa imizekelo ethile apho ezo parameters ziya kutshintsha.

Inqaku eliphambili: Uqikelelo lusekelwe kuhlalutyo lwezinto eziphambili ezichaphazela ububanzi kunye nobunzima bomsebenzi.

Uqikelelo olusekwe kwingqikelelo yindlela elungileyo xa izinto ezisebenzayo zincinci ngokwaneleyo kwaye akukho zinto zininzi ezinokuchaphazela kakhulu ukuyilwa kwengxaki. Kodwa kwimeko yenani leengxaki zeNzululwazi yeDatha, ezo zinto ziba zininzi kakhulu kwaye indlela enjalo ayinakwanela.

Ukuthelekiswa koluntu lweReddit

Masiqale ngeyona meko ilula (njengoko kusenzeka kamva). Ngokubanzi, ukunyaniseka ngokupheleleyo, sinemeko ephantse ifanelekele, makhe sijonge uluhlu lwethu lobunzima:

  • Kukho i-API ecocekileyo, ecacileyo kwaye ebhaliweyo.
  • Kulula kakhulu kwaye okona kubaluleke kakhulu, ithokheni ifumaneka ngokuzenzekelayo.
  • kukho isisongelo sepython - ngemizekelo emininzi.
  • Uluntu oluhlalutya kwaye luqokelele idatha kwi-reddit (kwanakwiividiyo zeYouTube ezichaza indlela yokusebenzisa i-python wrapper) Umzekelo.
  • Iindlela esizidinga kakhulu zikho kwi-API. Ngaphezu koko, ikhowudi ibonakala ihlangene kwaye ihlambulukile, ngezantsi ngumzekelo womsebenzi oqokelela izimvo kwisithuba.

def get_comments(submission_id):
    reddit = Reddit(check_for_updates=False, user_agent=AGENT)
    submission = reddit.submission(id=submission_id)
    more_comments = submission.comments.replace_more()
    if more_comments:
        skipped_comments = sum(x.count for x in more_comments)
        logger.debug('Skipped %d MoreComments (%d comments)',
                     len(more_comments), skipped_comments)
    return submission.comments.list()

Ithatyathwe kwi oku ukhetho lwezinto eziluncedo zokusonga.

Ngaphandle kwento yokuba le yeyona meko ilungileyo, kusafuneka kuthathelwe ingqalelo inani lezinto ezibalulekileyo kubomi bokwenyani:

  • Imida ye-API - siphoqeleka ukuba sithathe idatha kwiibhetshi (ubuthongo phakathi kwezicelo, njl.).
  • Ixesha lokuqokelela - uhlalutyo olupheleleyo kunye nothelekiso, kuya kufuneka ubeke bucala ixesha elibalulekileyo ukuze nje isigcawu sihambe kwi-subreddit.
  • I-bot kufuneka isebenze kwiseva - awukwazi ukuyiqhuba kwilaptop, uyibeke kubhaka wakho kwaye uye kwishishini. Ngoko ndagijima yonke into kwiVPS. Ukusebenzisa ikhowudi yokuthengisa i-habrahabr10 ungagcina enye i-10% yeendleko.
  • Ukungafikeleleki ngokomzimba kwedatha ethile (ibonakala kubalawuli okanye kunzima kakhulu ukuyiqokelela) - oku kufuneka kuthathelwe ingqalelo; ngokomgaqo, ayizizo zonke iinkcukacha ezinokuqokelelwa ngexesha elaneleyo.
  • Iimpazamo zenethiwekhi: Unxibelelwano lubuhlungu.
  • Le yidatha yokwenyani ephilayo - ayinakuze inyulu.

Kakade ke, kuyimfuneko ukubandakanya ezi nuances kuphuhliso. Iiyure / iintsuku ezikhethekileyo zixhomekeke kumava ophuhliso okanye amava asebenza kwimisebenzi efanayo, nangona kunjalo, siyabona ukuba apha umsebenzi ubunjineli kuphela kwaye awufuni ukunyakaza okungaphezulu komzimba ukuxazulula - yonke into inokuhlolwa kakuhle, icwangciswe kwaye yenziwe.

Ukuthelekiswa kwamacandelo eHabr

Masiqhubele phambili kwimeko enomdla ngakumbi nengeyonto encinci yokuthelekisa imisonto kunye/okanye amacandelo kaHabr.

Makhe sijonge uluhlu lwethu oluntsonkothileyo - apha, ukuze uqonde inqaku ngalinye, kuya kufuneka ugrumbe kancinci kumsebenzi ngokwawo kwaye ulinge.

  • Ekuqaleni ucinga ukuba kukho i-API, kodwa ayikho. Ewe, ewe, uHabr une-API, kodwa ayifumaneki nje kubasebenzisi (okanye mhlawumbi ayisebenzi kwaphela).
  • Emva koko uqala ukwahlulahlula i-html - "izicelo zokungenisa", yintoni enokonakala?
  • Indlela yokwahlula? Eyona ndlela ilula nesetyenziswa rhoqo kukuphindaphinda ii-IDs, qaphela ukuba ayisiyiyo eyona isebenzayo kwaye kuya kufuneka ijongane namatyala ahlukeneyo - nanku umzekelo wokuxinana kwee-ID zokwenyani phakathi kwazo zonke ezikhoyo.

    Yintoni enokonakala ngeNzululwazi yeDatha? Ukuqokelelwa kwedatha
    Ithatyathwe kwi oku amanqaku.

  • Idatha ekrwada esongelwe kwi-HTML ngaphezulu kwewebhu yintlungu. Umzekelo, ufuna ukuqokelela kwaye ugcine ukalisho lwenqaku: ukrazule amanqaku kwi-html kwaye wagqiba ekubeni uyigcine njengenombolo yokuqhubekeka phambili: 

    1) int (inqaku) iphosa impazamo: kuba kuHabrΓ© kukho uthabatha, njengoko, umzekelo, kumgca β€œβ€“5” - le yi-en dash, hayi uphawu lokuthabatha (ngokungalindelekanga, akunjalo?), ngoko ke Inqaku elithile kuye kwafuneka ndiphakamisele umntu ebomini ngolungiso oloyikekayo.

    try:
          score_txt = post.find(class_="score").text.replace(u"–","-").replace(u"+","+")
          score = int(score_txt)
          if check_date(date):
            post_score += score
    

    Kusenokungabikho umhla, i-pluses kunye ne-minuses konke konke (njengoko sibona ngasentla kwi-check_date function, oku kwenzeka).

    2) Abalinganiswa abakhethekileyo abangenakubaleka - baya kuza, kufuneka ulungele.

    3) Isakhiwo sitshintsha ngokuxhomekeke kuhlobo lwesithuba.

    4) Izithuba ezidala zinokuba **isakhiwo esingaqhelekanga**.

  • Ngokusisiseko, ukuphathwa kweempazamo kunye nokuba yintoni enokuthi yenzeke okanye enokuthi yenzeke kuya kufuneka iphathwe kwaye awukwazi ukuqikelela ngokuqinisekileyo ukuba yintoni eya kungahambi kakuhle kwaye enye indlela inokuba yintoni na kwaye yintoni eya kuwa apho - kuya kufuneka uzame kwaye uthathele ingqalelo. iimpazamo eziphoswa ngumhlalutyi.
  • Emva koko uyaqonda ukuba kufuneka ucazulule kwimisonto eliqela, kungenjalo ukwahlula-hlula komnye kuyakuthatha iiyure ezingama-30+ (eli lixesha lokwenziwa kwesahlukanisi esinomsonto omnye esele sisebenza, esilala singaweli phantsi kwayo nayiphi na imiqobo). IN oku Inqaku, oku kukhokelele kwixesha elithile kwisicwangciso esifanayo:

Yintoni enokonakala ngeNzululwazi yeDatha? Ukuqokelelwa kwedatha

Uluhlu olupheleleyo lokujonga ngokuntsonkotha:

  • Ukusebenza kunye nenethiwekhi kunye ne-html yokwahlulahlula ngokuphindaphinda kunye nokukhangela nge-ID.
  • Amaxwebhu esakhiwo esingafaniyo.
  • Kukho iindawo ezininzi apho ikhowudi inokuwa lula.
  • Kuyimfuneko ukubhala || ikhowudi.
  • Amaxwebhu ayimfuneko, imizekelo yekhowudi, kunye/okanye uluntu alukho.

Ixesha eliqikelelweyo lalo msebenzi liya kuba ngamaxesha angama-3-5 aphezulu kunokuqokelela idatha kwi-Reddit.

Ukuthelekiswa kwamaqela e-Odnoklassniki

Masiqhubele phambili kwelona tyala linika umdla ngokobuchwephesha elichazwe. Kum, ibinomdla ngokuchanekileyo kuba xa undijonga kuqala, ijongeka incinci, kodwa ayijiki ibe njalo-kamsinyane nje ukuba uyibethele intonga.

Masiqale ngoluhlu lwethu lobunzima kwaye siqaphele ukuba uninzi lwazo luya kuba nzima kakhulu kunokuba lujonge ekuqaleni:

  • Kukho i-API, kodwa iphantse yasilela ngokupheleleyo imisebenzi eyimfuneko.
  • Kwimisebenzi ethile kufuneka ucele ufikelelo ngeposi, oko kukuthi, unikezelo lofikelelo alukho ngoko nangoko.
  • Ibhalwe ngokubi (ukuqala, amagama aseRashiya kunye nesiNgesi axutywe kuyo yonke indawo, kwaye ngokungahambelani ngokupheleleyo - ngamanye amaxesha kufuneka ucinge ukuba bafuna ntoni kuwe kwenye indawo) kwaye, ngaphezu koko, uyilo alufanelekanga ukufumana idatha, umzekelo. , umsebenzi esiwudingayo.
  • Ifuna iseshoni kumaxwebhu, kodwa ayisebenzisi ngokwenene - kwaye akukho ndlela yokuqonda yonke intsonkotheko yeendlela ze-API ngaphandle kokujikeleza kunye nethemba lokuba kukho into eza kusebenza.
  • Akukho mizekelo kwaye akukho noluntu; inqaku lenkxaso ekuqokeleleni ulwazi lincinci Isisongeli kwiPython (ngaphandle kwemizekelo emininzi yokusetyenziswa).
  • I-Selenium ibonakala iyeyona ndlela isebenzayo, kuba idatha eninzi efunekayo ivaliwe.
    I-1) Oko kukuthi, ukugunyaziswa kwenzeka ngokusebenzisa umsebenzisi okhohlisayo (kunye nokubhaliswa ngesandla).

    2) Nangona kunjalo, nge-Selenium akukho ziqinisekiso zomsebenzi ochanekileyo kunye nokuphindaphindiweyo (ubuncinci kwimeko ye-ok.ru ngokuqinisekileyo).

    3) Iwebhusayithi ye-Ok.ru iqulethe iimpazamo zeJavaScript kwaye ngamanye amaxesha iziphatha ngendlela engaqhelekanga kwaye ingahambelani.

    4) Kufuneka wenze i-pagination, ukulayisha izinto, njl.

    5) Iimpazamo ze-API ezinikezwa ngumsonga kuya kufuneka ziphathwe ngokungathandekiyo, umzekelo, njengale (iqhekeza lekhowudi yovavanyo):

    def get_comments(args, context, discussions):
        pause = 1
        if args.extract_comments:
            all_comments = set()
    #makes sense to keep track of already processed discussions
            for discussion in tqdm(discussions): 
                try:
                    comments = get_comments_from_discussion_via_api(context, discussion)
                except odnoklassniki.api.OdnoklassnikiError as e:
                    if "NOT_FOUND" in str(e):
                        comments = set()
                    else:
                        print(e)
                        bp()
                        pass
                all_comments |= comments
                time.sleep(pause)
            return all_comments
    

    Impazamo yam endiyithandayo yaba:

    OdnoklassnikiError("Error(code: 'None', description: 'HTTP error', method: 'discussions.getComments', params: …)”)

    6) Ekugqibeleni, i-Selenium + API ibonakala njengeyona ndlela ifanelekileyo.

  • Kuyimfuneko ukugcina urhulumente kwaye uqalise kwakhona inkqubo, ukusingatha iimpazamo ezininzi, kuquka ukuziphatha okungahambelaniyo kwendawo - kwaye ezi mpazamo zinzima kakhulu ukuzicingela (ngaphandle kokuba ubhala abahlalutyi ngokufanelekileyo, kunjalo).

Uqikelelo lwexesha olunemiqathango kulo msebenzi luya kuba ngamaxesha angama-3-5 aphezulu kunokuqokelela idatha evela kuHabr. Ngaphandle kwento yokuba kwimeko kaHabr sisebenzisa indlela yangaphambili kunye ne-HTML yokwahlulahlula, kwaye kwimeko ye-OK sinokusebenza kunye ne-API kwiindawo ezibalulekileyo.

ezifunyanisiweyo

Kungakhathaliseki ukuba ufuneka kangakanani ukuba uqikelele imihla ebekiweyo "kwindawo" (siyayicwangcisa namhlanje!) yemodyuli yombhobho wokusetyenzwa kwedatha eninzi, ixesha lokuphumeza phantse alinakwenzeka ukuba liqikelele nokuba ngokomgangatho ngaphandle kokuhlalutya iiparamitha zomsebenzi.

Kwinqaku lentanda-bulumko ngakumbi, amaqhinga oqikelelo olude asebenza kakuhle kwimisebenzi yobunjineli, kodwa iingxaki ezivavanya ngakumbi kwaye, ngandlel’ ithile, β€œzokuyila” kunye nokuhlola, o.k.t., azinakuqikelelwa, zinobunzima, njengakwimizekelo yezihloko ezifanayo , esixoxe ngayo apha.

Ewe, ukuqokelelwa kwedatha ngumzekelo obalaseleyo- idla ngokuba ngumsebenzi olula ngokumangalisayo nongantsonkothanga kangako, kwaye usathana uhlala esezinkcukacha. Kwaye kukulo msebenzi ngokuchanekileyo apho sinokubonisa lonke uluhlu lweenketho ezinokuthi zingahambi kakuhle kwaye kanye ukuba umsebenzi unokuthatha ixesha elingakanani.

Ukuba ujonga kwiimpawu zomsebenzi ngaphandle kovavanyo olongezelelweyo, ke i-Reddit kunye ne-OK zibukeka zifana: kukho i-API, i-python wrapper, kodwa ngokwenene, umahluko mkhulu. Ngokujonga ezi parameters, iipari zikaHabr zikhangeleka zintsonkothile ngakumbi kuno-Kulungile-kodwa xa kusenziwa kuchasene noko, kwaye yile nto kanye inokufunyanwa ngokuqhuba imifuniselo elula yokuhlalutya iiparamitha zengxaki.

Ngamava am, eyona ndlela isebenzayo kukuqikelela ixesha oza kulidinga kuhlalutyo lwangaphambili ngokwalo kunye neemvavanyo ezilula zokuqala, ukufunda amaxwebhu - oku kuya kukuvumela ukuba unike uqikelelo oluchanekileyo lomsebenzi wonke. Ngokubhekiselele kwindlela ethandwayo ye-agile, ndikucela ukuba wenze itikiti "lokuqikelela iparameters zemisebenzi", ngesiseko endinokuthi ndinike uvavanyo lwento enokwenziwa ngaphakathi "kwesprint" kwaye ndinike uqikelelo oluchanekileyo kwinto nganye. umsebenzi.

Ke ngoko, eyona ngxoxo isebenzayo ibonakala iyiyo ebonisa ingcali "engeyiyo eyobugcisa" ukuba lingakanani ixesha kunye nezixhobo eziya kwahluka ngokuxhomekeke kwiiparamitha ekusafuneka zivavanywe.

Yintoni enokonakala ngeNzululwazi yeDatha? Ukuqokelelwa kwedatha

umthombo: www.habr.com

Yongeza izimvo