Chii chinogona kukanganisa neData Science? Data collection

Chii chinogona kukanganisa neData Science? Data collection
Nhasi kune 100500 Data Science makosi uye zvagara zvichizivikanwa kuti mari yakawanda muData Science inogona kuwanikwa kuburikidza neData Science makosi (sei uchichera iwe uchigona kutengesa mafoshoro?). Chinhu chikuru chakakanganisika cheaya makosi ndechekuti haana chekuita nebasa chairo: hapana anokupa data rakachena, rakagadziridzwa mune inodiwa fomati. Uye kana iwe ukasiya kosi uye wotanga kugadzirisa dambudziko chairo, akawanda nuances anobuda.

Nokudaro, tiri kutanga mutsara wezvinyorwa "Chii chinogona kukanganisa neData Science", zvichienderana nezviitiko chaizvo zvakaitika kwandiri, shamwari dzangu uye vandinoshanda navo. Isu tichaongorora akajairwa Data Sayenzi mabasa tichishandisa mienzaniso chaiyo: kuti izvi zvinoitika sei chaizvo. Ngatitange nhasi nebasa rekuunganidza data.

Uye chinhu chekutanga vanhu vanogumburwa pavanotanga kushanda nedata chaiyo ndeyekuunganidza iyi data inonyanya kukosha kwatiri. Mharidzo yakakosha yechinyorwa ichi:

Isu tinoronga pasi nguva, zviwanikwa, uye kushanda nesimba kunodiwa kuunganidza, kuchenesa uye kugadzira data.

Uye zvinonyanya kukosha, tichakurukura zvekuita kudzivirira izvi.

Zvinoenderana nekufungidzira kwakasiyana-siyana, kuchenesa, kushandurwa, kugadziridzwa kwedata, chimiro cheinjiniya, nezvimwe zvinotora 80-90% yenguva, uye ongororo 10-20%, nepo zvinenge zvese zvedzidzo zvinotarisa chete pakuongorora.

Ngatitarisei dambudziko rekuongorora rakapfava mumavhezheni matatu semuenzaniso wenguva uye tione kuti "mamiriro ezvinhu anowedzera" chii.

Uye semuenzaniso, zvakare, isu tichafunga kwakafanana mutsauko webasa rekuunganidza data uye kuenzanisa nharaunda dze:

  1. Maviri Reddit subreddits
  2. Zvikamu zviviri zvaHabr
  3. Mapoka maviri eOdnoklassniki

Conditional approach mune dzidziso

Vhura saiti uye uverenge mienzaniso, kana yakajeka, isa parutivi maawa mashoma ekuverenga, maawa mashoma ekodhesi uchishandisa mienzaniso uye debugging. Wedzera maawa mashoma ekuunganidza. Kanda mumaawa mashoma wakachengeterwa (wedzera neaviri uye wowedzera N maawa).

Key Poindi: Nguva yekufungidzira inoenderana nefungidziro uye fungidziro yekuti ichatora nguva yakareba sei.

Izvo zvinodikanwa kuti utange ongororo yenguva nekufungidzira zvinotevera paramita yedambudziko remamiriro rakatsanangurwa pamusoro apa:

  • Ndeupi hukuru hwe data uye kuti yakawanda sei inoda kuunganidzwa mumuviri (* ona pazasi *).
  • Ndeipi nguva yekuunganidza rekodhi imwe uye inguva yakareba sei yaunofanira kumirira usati watora yechipiri?
  • Funga kunyora kodhi inoponesa nyika uye inotanga kutangazve kana (kwete kana) zvese zvakundikana.
  • Tarisa uone kana isu tichida mvumo uye isa nguva yekuwana mukana kuburikidza neAPI.
  • Seta nhamba yezvikanganiso sebasa rekuomarara kwedata - ongorora kune rimwe basa: chimiro, mangani shanduko, chii uye sei kubvisa.
  • Gadzirisa zvikanganiso zvetiweki uye matambudziko neasina-standard purojekiti maitiro.
  • Ongorora kana mabasa anodiwa ari muzvinyorwa uye kana zvisiri, saka sei uye yakawanda sei inodiwa kuti workaround.

Chinonyanya kukosha ndechekuti kuitira kufungidzira nguva - iwe unotofanira kushandisa nguva nesimba pa "reconnaissance in force" - ipapo chete kuronga kwako kuchave kwakakwana. Nokudaro, zvisinei kuti unosundwa zvakadini kuti uti "zvinotora nguva yakareba sei kuunganidza data" - zvitengere imwe nguva yekuongorora kwekutanga uye kupokana kuti inguva yakawanda sei ichasiyana zvichienderana nemamiriro chaiwo edambudziko.

Uye zvino ticharatidza mienzaniso chaiyo apo parameters yakadaro ichachinja.

Chinokosha Chinokosha: Kufungidzira kunobva pakuongorora kwezvinhu zvakakosha zvinokurudzira chiyero uye kuoma kwebasa racho.

Guess-based estimation inzira yakanaka kana zvinhu zvinoshanda zviri zvidiki zvakakwana uye pasina zvinhu zvakawanda zvinogona kukanganisa dhizaini yedambudziko. Asi pane akati wandei eData Science matambudziko, zvinhu zvakadaro zvinova zvakawandisa uye maitiro akadaro anova asina kukwana.

Kuenzanisa kweReddit nharaunda

Ngatitangei neyakareruka kesi (sezvazvinozoitika gare gare). Kazhinji, kutaura chokwadi chose, isu tine nyaya ingangove yakakodzera, ngatitarise yedu yakaoma yekutarisa:

  • Kune yakashambidzika, yakajeka uye yakanyorwa API.
  • Zviri nyore uye zvakanyanya kukosha, chiratidzo chinowanikwa otomatiki.
  • kune python wrapper - nemienzaniso yakawanda.
  • Nharaunda inoongorora uye kuunganidza data pa reddit (kunyangwe kuYouTube mavhidhiyo anotsanangura mashandisiro epython wrapper) Semuyenzaniso.
  • Nzira dzatinoda kazhinji dziripo muAPI. Uyezve, iyo kodhi inotaridzika compact uye yakachena, pazasi muenzaniso webasa rinounganidza mhinduro pane positi.

def get_comments(submission_id):
    reddit = Reddit(check_for_updates=False, user_agent=AGENT)
    submission = reddit.submission(id=submission_id)
    more_comments = submission.comments.replace_more()
    if more_comments:
        skipped_comments = sum(x.count for x in more_comments)
        logger.debug('Skipped %d MoreComments (%d comments)',
                     len(more_comments), skipped_comments)
    return submission.comments.list()

Yakatorwa kubva izvi sarudzo yezvishandiso zviri nyore zvekuputira.

Kunyangwe chokwadi chekuti iyi ndiyo yakanakisa kesi, zvichiri kukosha kufunga akati wandei akakosha zvinhu kubva kuhupenyu chaihwo:

  • API miganho - tinomanikidzwa kutora data mumabheti (kurara pakati pezvikumbiro, nezvimwewo).
  • Nguva yekuunganidza - yekuongorora kwakazara uye kuenzanisa, iwe uchafanirwa kusevha yakakosha nguva yekuti spider ifambe nepakati pe subreddit.
  • Iyo bhoti inofanirwa kumhanya pane sevha-haugone kungoimhanyisa palaptop yako, woiisa mubhegi rako, woenda nezve bhizinesi rako. Saka ndakamhanya zvese paVPS. Uchishandisa kodhi yekusimudzira habrahabr10 unogona kuchengetedza imwe 10% yemutengo.
  • Iko kusasvikika kwemuviri kweimwe data (inoonekwa kune vatariri kana yakanyanya kuoma kuunganidza) - izvi zvinofanirwa kuverengerwa; musimboti, haisi data rese rinogona kuunganidzwa munguva yakakwana.
  • Network kukanganisa: Networking inorwadza.
  • Iyi ihupenyu data chaiyo - haina kumbochena.

Ehe, zvinodikanwa kusanganisa aya nuances mukuvandudza. Maawa chaiwo/mazuva anoenderana neruzivo rwebudiriro kana ruzivo rwekushanda pamabasa akafanana, zvisinei, tinoona kuti pano basa racho ndere engineering uye haridi mamwe mafambiro emuviri kugadzirisa - zvese zvinogona kunyatsoongororwa, kurongwa uye kuitwa.

Kuenzanisa kweHabr zvikamu

Ngatienderei kune imwe nyaya inonakidza uye isiri-diki yekuenzanisa shinda uye/kana zvikamu zveHabr.

Ngatitarise yedu kuomarara rondedzero - pano, kuti unzwisise imwe neimwe pfungwa, iwe uchafanirwa kuchera zvishoma mubasa racho pachako uye kuedza.

  • Pakutanga iwe unofunga kuti kune API, asi hapana. Hongu, hongu, Habr ine API, asi haingosviki kune vashandisi (kana kuti pamwe haishande zvachose).
  • Wobva watanga kupatsanura html - "import zvikumbiro", chii chingatadza?
  • Nzira yekusiyanisa zvakadaro? Iyo yakapfava uye inowanzo shandiswa nzira ndeyekudzokorodza maID, cherechedza kuti haisi iyo inonyanya kushanda uye ichafanirwa kubata nyaya dzakasiyana - heino muenzaniso wehuremu hwezvitupa chaiwo pakati pese aripo.

    Chii chinogona kukanganisa neData Science? Data collection
    Yakatorwa kubva izvi zvinyorwa.

  • Raw data yakaputirwa muHTML pamusoro pewebhu inorwadza. Semuenzaniso, iwe unoda kuunganidza uye kuchengetedza chiyero chechinyorwa: wakabvarura zvibodzwa kubva muhtml uye wafunga kuzvichengeta senhamba yekuenderera mberi nekugadzirisa: 

    1) int(score) inokanda kukanganisa: sezvo pana HabrΓ© pane minus, semuenzaniso, mumutsara "-5" - iyi in dash, kwete minus sign (zvisingatarisirwi, handiti?), saka pa imwe nguva ndaifanira kusimudza parser kuhupenyu nekugadzirisa kunotyisa kudaro.

    try:
          score_txt = post.find(class_="score").text.replace(u"–","-").replace(u"+","+")
          score = int(score_txt)
          if check_date(date):
            post_score += score
    

    Panogona kunge pasina zuva, pluses uye minuses zvachose (sezvatinoona pamusoro mucheki_date basa, izvi zvakaitika).

    2) Vasina kupukunyuka mavara akakosha - ivo vachauya, iwe unofanirwa kuve wakagadzirira.

    3) Chimiro chinoshanduka zvichienderana nerudzi rwepositi.

    4) Zvinyorwa zvekare zvinogona kuva ne ** weird structure **.

  • Chaizvoizvo, kubata kukanganisa uye izvo zvingave kana kusaitika zvinofanirwa kubatwa uye haugone kufanotaura chokwadi kuti chii chichakanganisika uye kuti chimwe chimiro chingave sei uye chii chinowira kupi - iwe unozongoyedza uye kufunga. zvikanganiso izvo muparidzi anokanda.
  • Ipapo iwe unoona kuti iwe unofanirwa kupatsanura mune akati wandei tambo, kana zvikasadaro kupatsanura mune imwe kunozotora 30+ maawa (iyi ingori nguva yekuuraya yeatove ari kushanda ane tambo-parser, inorara uye isingawire pasi pese kurambidzwa). IN izvi chinyorwa, izvi zvakatungamira pane imwe nguva kune yakafanana chirongwa:

Chii chinogona kukanganisa neData Science? Data collection

Yese yekutarisa nekuoma:

  • Kushanda netiweki uye html parsing ne iteration uye kutsvaga neID.
  • Zvinyorwa zvezvakasiyana-siyana chimiro.
  • Kune nzvimbo dzakawanda uko kodhi inogona kudonha nyore.
  • Zvakakosha kunyora || code.
  • Zvinyorwa zvinodikanwa, mienzaniso yekodhi, uye / kana nharaunda haipo.

Iyo inofungidzirwa nguva yebasa iri ichave 3-5 nguva yakakwirira kupfuura yekuunganidza data kubva kuReddit.

Kuenzanisa kwemapoka eOdnoklassniki

Ngatiendererei kune yakanyanya kunakidza nyaya inotsanangurwa. Kwandiri, zvainakidza chaizvo nekuti pekutanga kuona, zvinoita kunge zvidiki, asi hazvina kuzoitika zvachose - kana ukangorova tsvimbo pazviri.

Ngatitangei negwaro redu rekuomerwa uye ticherechedze kuti mazhinji acho anozove akaoma kupfuura zvavanotarisa pakutanga:

  • Pane API, asi inenge isina zvachose mabasa anodiwa.
  • Kune mamwe mabasa iwe unofanirwa kukumbira kupinda netsamba, ndiko kuti, kupihwa kwekuwana hakusi pakarepo.
  • Izvo zvakanyorwa zvakanyanya (kutanga, mazwi echiRussia neChirungu akasanganiswa kwese kwese, uye zvisingaenderani zvachose - dzimwe nguva unongoda kufungidzira zvavanoda kubva kwauri kumwe kunhu) uyezve, dhizaini haina kukodzera kuwana data, semuenzaniso. , basa ratinoda.
  • Inoda chikamu mune zvinyorwa, asi hainyatso kuishandisa - uye hapana nzira yekunzwisisa zvese zvakaomarara zveiyo API modes kunze kwekupoterera uye kutarisira kuti chimwe chinhu chichashanda.
  • Iko hakuna mienzaniso uye hakuna nharaunda; iyo chete poindi yerutsigiro mukuunganidza ruzivo idiki kuputira muPython (pasina mienzaniso yakawanda yekushandisa).
  • Selenium inoratidzika seyo inonyanya kushanda sarudzo, sezvo akawanda e data anodiwa akavharirwa.
    1) Kureva, mvumo inoitika kuburikidza nemushandisi wekunyepedzera (uye kunyoreswa nemaoko).

    2) Zvisinei, neSelenium hapana zvivimbiso zvebasa rakarurama uye rinodzokororwa (zviri nani munyaya ye ok.ru zvechokwadi).

    3) Webhusaiti yeOk.ru ine JavaScript zvikanganiso uye dzimwe nguva inoita zvisingaite uye zvisingaenderani.

    4) Iwe unofanirwa kuita pagination, kurodha zvinhu, nezvimwe ...

    5) API zvikanganiso izvo wrapper inopa ichafanirwa kubatwa zvisina kunaka, semuenzaniso, seizvi (chidimbu cheyedzo kodhi):

    def get_comments(args, context, discussions):
        pause = 1
        if args.extract_comments:
            all_comments = set()
    #makes sense to keep track of already processed discussions
            for discussion in tqdm(discussions): 
                try:
                    comments = get_comments_from_discussion_via_api(context, discussion)
                except odnoklassniki.api.OdnoklassnikiError as e:
                    if "NOT_FOUND" in str(e):
                        comments = set()
                    else:
                        print(e)
                        bp()
                        pass
                all_comments |= comments
                time.sleep(pause)
            return all_comments
    

    Kukanganisa kwangu kwandinofarira kwaive:

    OdnoklassnikiError("Error(code: 'None', description: 'HTTP error', method: 'discussions.getComments', params: …)”)

    6) Pakupedzisira, Selenium + API inotaridzika seyakanyanya musoro sarudzo.

  • Izvo zvinodikanwa kuchengetedza nyika uye kutangazve sisitimu, kubata zvikanganiso zvakawanda, kusanganisira hunhu husingaenderane hwesaiti - uye zvikanganiso izvi zvakaoma kufungidzira (kunze kwekunge iwe ukanyora vaparadzi nehunyanzvi, hongu).

Iyo inomisikidzwa yenguva yekufungidzira yebasa iri ichave 3-5 nguva dzakakwirira pane yekuunganidza data kubva kuHabr. Pasinei nokuti munyaya yaHabr tinoshandisa nzira yepamberi neHTML parsing, uye munyaya yeOK tinogona kushanda neAPI munzvimbo dzakaoma.

zvakawanikwa

Hazvina mhosva kuti ingani iwe unodikanwa kuti ufungidzire mazuva ekupedzisira "papo" (tiri kuronga nhasi!) yevoluminous data processing pombi module, nguva yekuuraya haitombogone kufungidzira kunyangwe nemhando pasina kuongorora maparamendi ebasa.

Pane imwe pfungwa yehuzivi, nzira dzekufungidzira dzakasimba dzinoshanda nemazvo kumabasa einjiniya, asi matambudziko ari kuyedza uye, neimwe nzira, "kusika" uye kuongorora, i.e., zvisingatarisike, zvine matambudziko, semumienzaniso yemisoro yakafanana , zvatakurukura pano.

Ehe, kuunganidza data ingori muenzaniso wepamusoro - kazhinji ibasa risingaite rakapfava uye risingaomeseki, uye dhiabhori anowanzo mune ruzivo. Uye ndizvo chaizvo pane basa iri kuti tinogona kuratidza huwandu hwese hwesarudzo dzinogona kuitika kune izvo zvinogona kukanganisa uye chaizvo kuti basa rinogona kutora nguva yakareba sei.

Kana iwe ukatarisa pahunhu hwebasa pasina kumwe kuedza, ipapo Reddit uye OK zvinotaridzika zvakafanana: kune API, python wrapper, asi muchidimbu, mutsauko wakakura. Tichitarisa neaya ma paramita, maHabr's pars anotaridzika zvakanyanya kuomarara kupfuura OK - asi mukuita zvakapesana, uye izvi ndizvo chaizvo zvinogona kuwanikwa nekuita zviyedzo zviri nyore kuongorora maparamita edambudziko.

Mune ruzivo rwangu, iyo inonyanya kushanda nzira ndeyekungoita kufungidzira nguva yauchazoda yekutanga ongororo pachayo uye nyore yekutanga kuedza, kuverenga zvinyorwa - izvi zvinokutendera iwe kupa fungidziro yakarurama yebasa rose. Panyaya yeiyo yakakurumbira agile methodology, ndinokukumbira iwe kuti ugadzire tikiti re "yekufungidzira basa paramita", pahwaro hwandinogona kupa ongororo yezvingaitwa mukati me "sprint" uye kupa fungidziro yakanyatsojeka kune yega yega. basa.

Naizvozvo, gakava rinonyanya kushanda rinoita serimwe raizoratidza "asiri tekinoroji" nyanzvi kuti inguva yakawanda sei uye zviwanikwa zvichasiyana zvichienderana nemaparamita asati aongororwa.

Chii chinogona kukanganisa neData Science? Data collection

Source: www.habr.com

Voeg