Yini engahambi kahle ngeDatha Science? Ukuqoqwa kwedatha

Yini engahambi kahle ngeDatha Science? Ukuqoqwa kwedatha
Namuhla kunezifundo ze-Data Science ezingu-100500 futhi kade kwaziwa ukuthi imali eningi ku-Data Science ingatholwa ngezifundo ze-Data Science (kungani umbe lapho ungathengisa amafosholo?). Ububi obuyinhloko balezi zifundo ukuthi azihlangene nomsebenzi wangempela: akekho ozokunikeza idatha ehlanzekile, ecutshunguliwe ngefomethi edingekayo. Futhi lapho ushiya inkambo futhi uqala ukuxazulula inkinga yangempela, ama-nuances amaningi avela.

Ngakho-ke, siqala uchungechunge lwamanothi "Yini engase ingahambi kahle ngeSayensi Yedatha", ngokusekelwe ezenzakalweni zangempela ezenzeka kimi, amaqabane ami kanye nozakwethu. Sizohlaziya imisebenzi ejwayelekile Yesayensi Yedatha sisebenzisa izibonelo zangempela: ukuthi lokhu kwenzeka kanjani ngempela. Ake siqale namuhla ngomsebenzi wokuqoqa idatha.

Futhi into yokuqala abantu abakhubeka ngayo lapho beqala ukusebenza ngedatha yangempela empeleni ukuqoqa le datha ebaluleke kakhulu kithi. Umlayezo obalulekile wale ndatshana:

Silulaza ngokuhlelekile isikhathi, izinsiza, nomzamo odingekayo ukuze kuqoqwe, kuhlanzwe, futhi kulungiswe idatha.

Futhi okubaluleke kakhulu, sizoxoxa ngokuthi yini okufanele yenziwe ukuvimbela lokhu.

Ngokwezilinganiso ezihlukahlukene, ukuhlanza, ukuguqulwa, ukucutshungulwa kwedatha, ubunjiniyela besici, njll. kuthatha u-80-90% wesikhathi, nokuhlaziya u-10-20%, kuyilapho cishe zonke izinto zemfundo zigxile kuphela ekuhlaziyeni.

Ake sibheke inkinga yokuhlaziya elula ezinguqulweni ezintathu njengesibonelo esijwayelekile futhi sibone ukuthi ziyini “izimo ezimbi”.

Futhi njengesibonelo, futhi, sizocubungula ukuhluka okufanayo komsebenzi wokuqoqa idatha nokuqhathanisa imiphakathi:

  1. Ama-subreddits amabili e-Reddit
  2. Izigaba ezimbili zikaHabr
  3. Amaqembu amabili e-Odnoklassniki

Indlela enemibandela ngokombono

Vula isayithi futhi ufunde izibonelo, uma kucacile, beka eceleni amahora ambalwa okufunda, amahora ambalwa wekhodi usebenzisa izibonelo nokulungisa iphutha. Engeza amahora ambalwa ukuze uqoqwe. Phonsa emahoreni ambalwa ubeke eceleni (phindaphinda ngamabili bese wengeza amahora angu-N).

Iphuzu Elibalulekile: Izilinganiso zesikhathi zisekelwe ekuqageleni nasekuqageleni mayelana nokuthi kuzothatha isikhathi esingakanani.

Kuyadingeka ukuqalisa ukuhlaziya isikhathi ngokulinganisa amapharamitha alandelayo enkinga enemibandela echazwe ngenhla:

  • Ingakanani usayizi wedatha nokuthi ingakanani okufanele iqoqwe ngokoqobo (*bona ngezansi*).
  • Singakanani isikhathi sokuqoqwa kwerekhodi elilodwa futhi kufanele ulinde isikhathi esingakanani ngaphambi kokuthi uqoqe elesibili?
  • Cabangela ukubhala ikhodi elondoloza isimo futhi iqale kabusha lapho (hhayi uma) yonke into ihluleka.
  • Thola ukuthi siyakudinga yini ukugunyazwa futhi usethe isikhathi sokuthola ukufinyelela nge-API.
  • Setha inani lamaphutha njengomsebenzi wedatha eyinkimbinkimbi - hlolela umsebenzi othile: isakhiwo, zingaki izinguquko, yini futhi kanjani ukukhipha.
  • Lungisa amaphutha enethiwekhi nezinkinga ngokuziphatha kwephrojekthi okungajwayelekile.
  • Hlola ukuthi ingabe imisebenzi edingekayo ikumadokhumenti futhi uma kungenjalo, khona-ke kudingeka kanjani futhi malini ukuze kulungiswe.

Okubaluleke kakhulu ukuthi ukuze ulinganise isikhathi - empeleni udinga ukuchitha isikhathi nomzamo "wokuphenya ngokusebenza" - yilapho ukuhlela kwakho kuyoba okwanele. Ngakho-ke, kungakhathaliseki ukuthi uphushwa kangakanani ukuthi uthi "kuthatha isikhathi esingakanani ukuqoqa idatha" - zithenge isikhathi sokuhlaziya kokuqala futhi uphikisane ngokuthi singakanani isikhathi sizohluka kuye ngemingcele yangempela yenkinga.

Futhi manje sizobonisa izibonelo ezithile lapho imingcele izoshintsha.

Iphuzu Elibalulekile: Isilinganiso sisekelwe ekuhlaziyweni kwezinto ezibalulekile ezithonya ububanzi nokuba yinkimbinkimbi komsebenzi.

Ukulinganisa okususelwe ekuqageleni kuyindlela enhle lapho izici zokusebenza zizincane ngokwanele futhi zingekho izici eziningi ezingaba nomthelela omkhulu ekwakhiweni kwenkinga. Kodwa endabeni yezinkinga eziningi zeSayensi Yedatha, izici ezinjalo ziba ziningi kakhulu futhi indlela enjalo iba enganele.

Ukuqhathaniswa kwemiphakathi ye-Reddit

Ake siqale ngecala elilula (njengoba livela kamuva). Ngokuvamile, uma sikhuluma iqiniso ngokuphelele, sinecala elicishe lifane, ake sihlole uhlu lwethu lokuhlola oluyinkimbinkimbi:

  • Kukhona i-API ehlanzekile, ecacile nebhaliwe.
  • Kulula kakhulu futhi okubaluleke kakhulu, ithokheni itholakala ngokuzenzakalelayo.
  • Zikhona i-python wrapper - ngezibonelo eziningi.
  • Umphakathi ohlaziya futhi uqoqe idatha ku-reddit (ngisho nakumavidiyo e-YouTube achaza indlela yokusebenzisa i-python wrapper) Ngokwesibonelo.
  • Izindlela esizidingayo cishe zikhona ku-API. Ngaphezu kwalokho, ikhodi ibukeka ihlangene futhi ihlanzekile, ngezansi isibonelo somsebenzi oqoqa amazwana kokuthunyelwe.

def get_comments(submission_id):
    reddit = Reddit(check_for_updates=False, user_agent=AGENT)
    submission = reddit.submission(id=submission_id)
    more_comments = submission.comments.replace_more()
    if more_comments:
        skipped_comments = sum(x.count for x in more_comments)
        logger.debug('Skipped %d MoreComments (%d comments)',
                     len(more_comments), skipped_comments)
    return submission.comments.list()

Ithathwe ku lokhu ukukhethwa kwezinsiza ezilula zokugoqa.

Naphezu kweqiniso lokuthi leli yilona cala elihle kakhulu, kusafanele kubhekwe izici eziningi ezibalulekile ezivela empilweni yangempela:

  • Imikhawulo ye-API - siphoqeleka ukuthi sithathe idatha ngamaqoqo (ukulala phakathi kwezicelo, njll.).
  • Isikhathi sokuqoqa - ukuze uthole ukuhlaziya okuphelele nokuqhathanisa, kuzodingeka ubeke eceleni isikhathi esibalulekile ukuze nje isicabucabu sihambe phakathi kwe-subreddit.
  • I-bot kufanele isebenze kuseva—awukwazi ukuvele uyisebenzise kukhompyutha yakho ephathekayo, uyifake esikhwameni sakho, bese uqhubeka nebhizinisi lakho. Ngakho ngagijima yonke into ku-VPS. Usebenzisa ikhodi yephromoshini ethi habrahabr10 ungagcina omunye u-10% wezindleko.
  • Ukungafinyeleleki ngokomzimba kwedatha ethile (ibonakala kubaphathi noma kunzima kakhulu ukuyiqoqa) - lokhu kufanele kubhekwe; empeleni, akuyona yonke idatha engaqoqwa ngesikhathi esanele.
  • Amaphutha enethiwekhi: Inethiwekhi ibuhlungu.
  • Lena idatha yangempela ephilayo - ayilokothi ibe msulwa.

Yiqiniso, kuyadingeka ukufaka lawa ma-nuances ekuthuthukisweni. Amahora/izinsuku ezithile zincike kokuhlangenwe nakho kwentuthuko noma isipiliyoni sokusebenza emisebenzini efanayo, nokho, siyabona ukuthi lapha umsebenzi ubunjiniyela kuphela futhi awudingi ukunyakaza okungeziwe komzimba ukuze kuxazululwe - yonke into ingahlolwa kahle kakhulu, ihlelwe futhi yenziwe.

Ukuqhathaniswa kwezigaba ze-Habr

Masiqhubekele esimweni esithakazelisa kakhulu nesingeyona into encane yokuqhathanisa imicu kanye/noma izigaba zikaHabr.

Ake sihlole uhlu lwethu lokuhlola oluyinkimbinkimbi - lapha, ukuze uqonde iphuzu ngalinye, kuzodingeka umbe kancane emsebenzini ngokwawo futhi ulinge.

  • Ekuqaleni ucabanga ukuthi kukhona i-API, kodwa ayikho. Yebo, yebo, i-Habr ine-API, kodwa ayifinyeleleki kubasebenzisi (noma mhlawumbe ayisebenzi nhlobo).
  • Bese uqala ukuncozulula i-html - “izicelo zokungenisa”, yini engase yonakale?
  • Indlela yokuhlaziya noma kunjalo? Indlela elula nesetshenziswa kakhulu ukuphindaphinda ama-ID, qaphela ukuthi akuyona ephumelela kakhulu futhi kuzodingeka isingathe amacala ahlukene - nasi isibonelo sokuminyana komazisi bangempela phakathi kwawo wonke akhona.

    Yini engahambi kahle ngeDatha Science? Ukuqoqwa kwedatha
    Ithathwe ku lokhu izindatshana.

  • Idatha eluhlaza esongwe nge-HTML phezulu kuwebhu ibuhlungu. Isibonelo, ufuna ukuqoqa futhi ulondoloze isilinganiso se-athikili: udabule isikolo ku-html futhi wanquma ukusilondoloza njengenombolo ukuze kuqhutshekwe nokucutshungulwa: 

    1) int(amaphuzu) iphonsa iphutha: njengoba ku-Habré kukhona ukususa, njengesibonelo, emgqeni "-5" - lena ideshi ye-en, hhayi uphawu lokususa (ngokungalindelekile, akunjalo?), ngakho-ke ngesinye isikhathi kwadingeka ngikhulise umhlaseli ekuphileni ngokulungisa okubi kangaka.

    try:
          score_txt = post.find(class_="score").text.replace(u"–","-").replace(u"+","+")
          score = int(score_txt)
          if check_date(date):
            post_score += score
    

    Kungase kungabi khona usuku, ama-pluses nama-minuses nhlobo (njengoba sibona ngenhla kumsebenzi we-check_date, lokhu kwenzekile).

    2) Izinhlamvu ezikhethekile ezingaphunyuki - zizofika, udinga ukuzilungiselela.

    3) Isakhiwo siyashintsha kuye ngohlobo lokuthunyelwe.

    4) Okuthunyelwe okudala kungase kube **isakhiwo esiyinqaba**.

  • Empeleni, ukuphatha amaphutha kanye nalokho okungenzeka noma okungenzeki kuyodingeka kusingathwe futhi awukwazi ukubikezela ngokuqinisekile ukuthi yini ezokonakala nokuthi isakhiwo singaba kanjani nokuthi yini ezowa lapho - kuzodingeka uzame futhi ucabangele. amaphutha ajikijelwa wumhlahleli.
  • Khona-ke uyabona ukuthi udinga ukuhlaziya emicu eminingi, ngaphandle kwalokho ukuhlukanisa kowodwa kuzothatha amahora angu-30+ (lesi isikhathi sokwenza somhlahleli owucucu owodwa osevele usebenza, olala futhi ongawi ngaphansi kwanoma yikuphi ukuvinjelwa). IN lokhu i-athikili, lokhu kuholele esikhathini esithile esimisweni esifanayo:

Yini engahambi kahle ngeDatha Science? Ukuqoqwa kwedatha

Uhlu lokuhlola oluphelele ngobunkimbinkimbi:

  • Ukusebenza ngenethiwekhi kanye ne-html ehlukanisa ngokuphindaphinda nokusesha nge-ID.
  • Amadokhumenti esakhiwo esingafani.
  • Kunezindawo eziningi lapho ikhodi ingawa kalula.
  • Kudingeka ukubhala || ikhodi.
  • Amadokhumenti adingekayo, izibonelo zekhodi, kanye/noma umphakathi akukho.

Isikhathi esilinganisiwe salo msebenzi sizoba phezulu izikhathi ezi-3-5 kunesokuqoqa idatha ku-Reddit.

Ukuqhathaniswa kwamaqembu e-Odnoklassniki

Asiqhubekele ecaleni elithakazelisa kakhulu lobuchwepheshe elichazwe. Kimina, bekuthakazelisa impela ngoba uma uthi nhlá, kubukeka kuncane kakhulu, kodwa akwenzekanga kanjalo nhlobo - lapho nje uhlohla induku kukho.

Ake siqale ngohlu lwethu lokuhlola ubunzima futhi siqaphele ukuthi eziningi zazo zizoba nzima kakhulu kunalokho ezikubuka kuqala:

  • Kukhona i-API, kodwa cishe ayinayo imisebenzi edingekayo.
  • Emisebenzini ethile udinga ukucela ukufinyelela ngeposi, okungukuthi, ukunikeza ukufinyelela akupholi.
  • Kubhalwe phansi kakhulu (ukuqala, amagama aseRussia nesiNgisi axubene yonke indawo, futhi ngokungahambisani ngokuphelele - ngezinye izikhathi udinga nje ukuqagela ukuthi bafunani kuwe kwenye indawo) futhi, ngaphezu kwalokho, umklamo awufanelekile ukuthola idatha, isibonelo. , umsebenzi esiwudingayo.
  • Idinga iseshini ekubhalweni, kodwa empeleni ayisebenzisi - futhi ayikho indlela yokuqonda zonke izinkimbinkimbi zezindlela ze-API ngaphandle kokuphenya bese uthemba ukuthi kukhona okuzosebenza.
  • Azikho izibonelo nomphakathi; okuwukuphela kwephuzu lokusekelwa ekuqoqweni kolwazi lincane wrapper kuPython (ngaphandle kwezibonelo eziningi zokusetshenziswa).
  • I-Selenium ibonakala iyinketho esebenza kakhulu, njengoba idatha eminingi edingekayo ivaliwe.
    1) Okusho ukuthi, ukugunyazwa kwenzeka ngokusebenzisa umsebenzisi oqanjiwe (nokubhaliswa ngesandla).

    2) Kodwa-ke, nge-Selenium azikho iziqinisekiso zomsebenzi olungile futhi ophindaphindwayo (okungenani esimweni se-ok.ru ngokuqinisekile).

    3) Iwebhusayithi ye-Ok.ru iqukethe amaphutha e-JavaScript futhi ngezinye izikhathi iziphatha ngendlela exakile futhi engahambisani.

    4) Udinga ukwenza i-pagination, ukulayisha izakhi, njll.

    5) Amaphutha e-API anikezwa yi-wrapper kuzodingeka aphathwe kabi, ngokwesibonelo, njengalokhu (ucezu lwekhodi yokuhlola):

    def get_comments(args, context, discussions):
        pause = 1
        if args.extract_comments:
            all_comments = set()
    #makes sense to keep track of already processed discussions
            for discussion in tqdm(discussions): 
                try:
                    comments = get_comments_from_discussion_via_api(context, discussion)
                except odnoklassniki.api.OdnoklassnikiError as e:
                    if "NOT_FOUND" in str(e):
                        comments = set()
                    else:
                        print(e)
                        bp()
                        pass
                all_comments |= comments
                time.sleep(pause)
            return all_comments
    

    Iphutha lami engilithanda kakhulu kwaba:

    OdnoklassnikiError("Error(code: 'None', description: 'HTTP error', method: 'discussions.getComments', params: …)”)

    6) Ekugcineni, i-Selenium + API ibukeka njengenketho enengqondo kakhulu.

  • Kuyadingeka ukusindisa isimo futhi uqale kabusha uhlelo, uphathe amaphutha amaningi, okuhlanganisa nokuziphatha okungahambisani kwesayithi - futhi lawa maphutha anzima kakhulu ukuwacabanga (ngaphandle kwalapho ubhala abahlaziyi ngokomsebenzi, kunjalo).

Isilinganiso sesikhathi esinemibandela salo msebenzi sizoba phezulu izikhathi ezi-3-5 kunesokuqoqa idatha ku-Habr. Naphezu kweqiniso lokuthi esimweni sikaHabr sisebenzisa indlela yangaphambili nge-HTML yokuhlukanisa, futhi esimweni se-OK singasebenza ne-API ezindaweni ezibucayi.

okutholakele

Kungakhathaliseki ukuthi kudingeka kangakanani ukuze ulinganisele iminqamulajuqu “ngaso leso sikhathi” (siyayihlela namuhla!) yemojula yephayiphi yokucubungula idatha enamandla, isikhathi sokwenza cishe ngeke sikwazi ukulinganisa ngisho nangekhwalithi ngaphandle kokuhlaziya imingcele yomsebenzi.

Ngokwefilosofi ethe xaxa, amasu okulinganisa ashesha asebenza kahle emisebenzini yobunjiniyela, kodwa izinkinga ezivivinywa kakhulu futhi, ngomqondo othile, “zokudala” nokuhlola, okungukuthi, ezingabikezeleki kangako, zinobunzima, njengezibonelo zezihloko ezifanayo , esixoxe ngayo lapha.

Kunjalo, ukuqoqwa kwedatha kuyisibonelo esihle nje - imvamisa kuwumsebenzi olula ngendlela emangalisayo futhi ongeyona inkimbinkimbi, futhi udeveli uvame ukuba nemininingwane. Futhi kukulo msebenzi ngqo lapho singabonisa khona lonke uhla lwezinketho ezingaba khona zokuthi yini engahambi kahle kanye nokuthi umsebenzi ungathatha isikhathi esingakanani.

Uma ubheka izici zomsebenzi ngaphandle kokuhlolwa okwengeziwe, i-Reddit ne-OK ibukeka ngokufanayo: kukhona i-API, i-python wrapper, kodwa empeleni, umehluko mkhulu. Uma sibheka ngale mingcele, amapharamitha kaHabr abukeka eyinkimbinkimbi kakhulu kune-OK - kodwa ekusebenzeni kuphambene impela, futhi yilokhu kanye okungatholakala ngokwenza izivivinyo ezilula zokuhlaziya imingcele yenkinga.

Ngokuhlangenwe nakho kwami, indlela ephumelela kakhulu ukulinganisa cishe isikhathi oyosidinga ukuze uzihlaziyele ngokwazo kanye nokuhlola kokuqala okulula, ukufunda imibhalo - lokhu kuzokuvumela ukuthi unikeze isilinganiso esinembile somsebenzi wonke. Ngokuya nge-agile methodology ethandwayo, ngikucela ukuthi udale ithikithi "lemingcele yomsebenzi wokulinganisa", ngesisekelo engingakwazi ukunikeza ukuhlolwa kwalokho okungafezwa ngaphakathi "kwe-sprint" futhi unikeze isilinganiso esinembe kakhulu ngasinye. umsebenzi.

Ngakho-ke, impikiswano ephumelela kakhulu ibonakala ingeyokungabonisa uchwepheshe "ongewona ochwepheshe" ukuthi singakanani isikhathi nezinsiza ezizohluka kuye ngamapharamitha asazohlolwa.

Yini engahambi kahle ngeDatha Science? Ukuqoqwa kwedatha

Source: www.habr.com

Engeza amazwana