Kodi chingachitike ndi chiyani ndi Data Science? Kusonkhanitsa deta

Kodi chingachitike ndi chiyani ndi Data Science? Kusonkhanitsa deta
Masiku ano pali maphunziro a 100500 Data Science ndipo akhala akudziwika kale kuti ndalama zambiri mu Data Science zikhoza kupezedwa kudzera mu maphunziro a Data Science (chifukwa chiyani mukukumba pamene mungathe kugulitsa mafosholo?). Choyipa chachikulu cha maphunzirowa ndikuti alibe chochita ndi ntchito yeniyeni: palibe amene angakupatseni deta yoyera, yokonzedwa mwanjira yofunikira. Ndipo mukasiya maphunzirowo ndikuyamba kuthetsa vuto lenileni, ma nuances ambiri amatuluka.

Choncho, tikuyamba mndandanda wa zolemba "Zomwe zingawonongeke ndi Data Science", kutengera zochitika zenizeni zomwe zinandichitikira ine, anzanga ndi anzanga. Tisanthula ntchito za Sayansi ya Data pogwiritsa ntchito zitsanzo zenizeni: momwe izi zimachitikira. Tiyeni tiyambe lero ndi ntchito yosonkhanitsa deta.

Ndipo chinthu choyamba chimene anthu amapunthwa pamene ayamba kugwira ntchito ndi deta yeniyeni ndikusonkhanitsa deta iyi yomwe ili yofunika kwambiri kwa ife. Uthenga wofunikira m'nkhaniyi:

Timapeputsa nthawi, zothandizira, ndi khama zomwe zimafunika kusonkhanitsa, kuyeretsa, ndi kukonza deta.

Ndipo chofunika kwambiri, tikambirana zomwe tingachite kuti tipewe izi.

Malinga ndi zoyerekeza zosiyanasiyana, kuyeretsa, kusintha, processing deta, mbali zomangamanga, etc. kutenga 80-90% ya nthawi, ndi kusanthula 10-20%, pamene pafupifupi zinthu zonse maphunziro imangoganizira kusanthula.

Tiyeni tiwone vuto losavuta losanthula m'matembenuzidwe atatu monga zitsanzo zenizeni ndikuwona kuti "zowonjezereka" ndi chiyani.

Ndipo mwachitsanzo, tiwonanso kusiyanasiyana kofanana kwa ntchito yosonkhanitsa deta ndikufananiza madera a:

  1. Mitundu iwiri ya Reddit
  2. Zigawo ziwiri za Habr
  3. Magulu awiri a Odnoklassniki

Conditional njira mu chiphunzitso

Tsegulani malowa ndikuwerenga zitsanzo, ngati zili zomveka, patulani maola angapo kuti muwerenge, maola angapo pa code pogwiritsa ntchito zitsanzo ndi kusokoneza. Onjezani maola angapo kuti mutolere. Ponyani maola angapo posungira (chulukitsani ndi awiri ndikuwonjezera ma N maola).

Mfundo yofunika: Kuyerekeza kwa nthawi kumatengera zongoganiza komanso zongoyerekeza za nthawi yayitali bwanji.

Ndikofunikira kuti muyambe kusanthula nthawi ndikuyerekeza magawo otsatirawa pavuto lomwe lafotokozedwa pamwambapa:

  • Kodi kukula kwa deta ndi kuchuluka kwake komwe kumafunika kusonkhanitsidwa mwakuthupi (* onani pansipa *).
  • Kodi ndi nthawi yanji yomwe mungatolere mbiri imodzi ndipo mudikire nthawi yayitali bwanji musanatenge yachiwiri?
  • Ganizirani zolembera zomwe zimasunga dziko ndikuyamba kuyambiranso pamene (osati ngati) chirichonse chikulephera.
  • Onani ngati tikufuna chilolezo ndikukhazikitsa nthawi yopezera mwayi kudzera pa API.
  • Khazikitsani kuchuluka kwa zolakwika ngati ntchito yazovuta za data - pendani ntchito inayake: kapangidwe, kusintha kungati, chiyani komanso momwe mungachotsere.
  • Konzani zolakwika za netiweki ndi zovuta ndi machitidwe osagwirizana ndi projekiti.
  • Unikani ngati ntchito zofunika zili muzolemba ndipo ngati sichoncho, ndiye kuti ndi zingati zomwe zikufunika pakukonzekera.

Chinthu chofunika kwambiri ndi chakuti kuti muyese nthawi - muyenera kuthera nthawi ndi khama pa "kuzindikira mu mphamvu" - pokhapo pamene kukonzekera kwanu kudzakhala kokwanira. Choncho, ziribe kanthu kuti mukukakamizika bwanji kunena kuti "zimatenga nthawi yayitali bwanji kuti musonkhanitse deta" - dzigulireni nthawi yowunikira koyambirira ndikutsutsa kuti nthawiyo idzasiyana bwanji malinga ndi magawo enieni a vutolo.

Ndipo tsopano tiwonetsa zitsanzo zenizeni zomwe magawowa adzasintha.

Mfundo Yofunika Kwambiri: Kuyerekezaku kumachokera pakuwunika kwazinthu zazikulu zomwe zimakhudza kukula ndi zovuta za ntchitoyi.

Kuyerekeza koyerekeza ndi njira yabwino pamene zinthu zogwirira ntchito ndizochepa mokwanira ndipo palibe zinthu zambiri zomwe zingakhudze kwambiri mapangidwe a vutoli. Koma pankhani yamavuto angapo a Data Science, zinthu zotere zimakhala zochulukira kwambiri ndipo njira yotereyi imakhala yosakwanira.

Kuyerekeza kwa magulu a Reddit

Tiyeni tiyambe ndi vuto losavuta (monga momwe zimakhalira pambuyo pake). Mwachidule, kunena zoona kwathunthu, tili ndi vuto lalikulu, tiyeni tiwone mndandanda wathu wovuta:

  • Pali API yowoneka bwino, yomveka komanso yolembedwa.
  • Ndiosavuta kwambiri ndipo chofunikira kwambiri, chizindikiro chimapezeka chokha.
  • pali python wrapper - ndi zitsanzo zambiri.
  • Gulu lomwe limasanthula ndikusonkhanitsa zambiri pa reddit (ngakhale mpaka makanema a YouTube akufotokoza momwe mungagwiritsire ntchito python wrapper) Mwachitsanzo.
  • Njira zomwe timafunikira ndizopezeka mu API. Komanso, kachidindoyo ikuwoneka ngati yaying'ono komanso yoyera, pansipa ndi chitsanzo cha ntchito yomwe imasonkhanitsa ndemanga pa positi.

def get_comments(submission_id):
    reddit = Reddit(check_for_updates=False, user_agent=AGENT)
    submission = reddit.submission(id=submission_id)
    more_comments = submission.comments.replace_more()
    if more_comments:
        skipped_comments = sum(x.count for x in more_comments)
        logger.debug('Skipped %d MoreComments (%d comments)',
                     len(more_comments), skipped_comments)
    return submission.comments.list()

Kutengedwa kuchokera izi kusankha kwa zinthu zothandiza kukulunga.

Ngakhale kuti iyi ndiye nkhani yabwino kwambiri, ndikofunikira kuganizira zinthu zingapo zofunika pamoyo weniweni:

  • Malire a API - timakakamizika kutenga deta m'magulu (kugona pakati pa zopempha, etc.).
  • Nthawi yosonkhanitsa - kuti muwunikenso ndi kuyerekeza kwathunthu, muyenera kupatula nthawi yofunikira kuti kangaude azitha kudutsa mu subreddit.
  • Botolo liyenera kuthamanga pa seva - simungangoyiyendetsa pa laputopu yanu, kuiyika m'chikwama chanu, ndikuchita bizinesi yanu. Kotero ndinathamanga chirichonse pa VPS. Pogwiritsa ntchito nambala yotsatsira habrahabr10 mutha kusunganso 10% ya mtengowo.
  • Kusafikika kwakuthupi kwa data ina (zimawoneka kwa olamulira kapena zovuta kuzisonkhanitsa) - izi ziyenera kuganiziridwa; kwenikweni, sizinthu zonse zomwe zingasonkhanitsidwe munthawi yokwanira.
  • Zolakwika pamaneti: Kulumikizana ndizovuta.
  • Izi ndi zenizeni zenizeni - sizili zoyera.

Zachidziwikire, ndikofunikira kuphatikiza ma nuances awa pakukula. Maola / masiku enieni amadalira zochitika zachitukuko kapena zochitika zomwe zimagwira ntchito zofanana, komabe, tikuwona kuti apa ntchitoyo ndi yaumisiri chabe ndipo safuna kusuntha kwa thupi kowonjezera kuti athetse - chirichonse chikhoza kuyesedwa bwino, kukonzedwa ndi kuchitidwa.

Kuyerekeza kwa magawo a Habr

Tiyeni tipitirire ku nkhani yosangalatsa komanso yosakhala yachidule yofanizira ulusi ndi/kapena zigawo za Habr.

Tiyeni tiwone mndandanda wathu wazovuta - apa, kuti mumvetsetse mfundo iliyonse, muyenera kukumba pang'ono muntchitoyo ndikuyesa.

  • Poyamba mumaganiza kuti pali API, koma palibe. Inde, inde, Habr ali ndi API, koma sichipezeka kwa ogwiritsa ntchito (kapena mwina sichigwira ntchito konse).
  • Kenako mungoyamba kugawa html - "zopempha zoitanitsa", kodi chitha kulakwika ndi chiyani?
  • Momwe mungasinthire? Njira yosavuta komanso yomwe imagwiritsidwa ntchito pafupipafupi ndikubwereza ma ID, zindikirani kuti siwothandiza kwambiri ndipo iyenera kuthana ndi milandu yosiyanasiyana - apa pali chitsanzo cha kuchuluka kwa ma ID enieni pakati pa onse omwe alipo.

    Kodi chingachitike ndi chiyani ndi Data Science? Kusonkhanitsa deta
    Kutengedwa kuchokera izi zolemba.

  • Zomwe zidakulungidwa mu HTML pamwamba pa intaneti ndizowawa. Mwachitsanzo, mukufuna kusonkhanitsa ndikusunga mavoti a nkhani: mudang'amba ma html ndikusankha kuti musunge ngati nambala kuti muyikonzenso: 

    1) int(score) iponya cholakwika: popeza pa HabrΓ© pali minus, monga, mwachitsanzo, pamzere "-5" - iyi ndi dash, osati chizindikiro (mosayembekezereka, chabwino?), kotero pa nthawi ina ndidayenera kuukitsa munthu wamoyo ndi kukonza koyipa.

    try:
          score_txt = post.find(class_="score").text.replace(u"–","-").replace(u"+","+")
          score = int(score_txt)
          if check_date(date):
            post_score += score
    

    Pakhoza kukhala palibe deti, kuphatikiza ndi minuses konse (monga tikuwonera pamwambapa mu check_date ntchito, izi zidachitika).

    2) Otchulidwa apadera omwe sanapulumutsidwe - adzabwera, muyenera kukhala okonzeka.

    3) Mapangidwe amasintha kutengera mtundu wa positi.

    4) Zolemba zakale zitha kukhala ndi **mapangidwe odabwitsa **.

  • M'malo mwake, kukonza zolakwika ndi zomwe zingachitike kapena zomwe sizingachitike ziyenera kuyendetsedwa ndipo simungadziwiretu zomwe zidzasokonekera komanso momwe dongosololi lingakhalire komanso zomwe zingagwere - muyenera kungoyesa ndikuganizira. zolakwika zomwe wophatikiza amaponya.
  • Kenako mumazindikira kuti muyenera kusanthula ulusi wambiri, apo ayi kuyika imodzi kumatenga maola 30+ (iyi ndi nthawi yophatikizika ya wojambula wina yemwe akugwira ntchito kale, yemwe amagona ndipo samagwa pansi pa ziletso zilizonse). MU izi Nkhani, izi zidapangitsa kuti pakhale ndondomeko yofananira:

Kodi chingachitike ndi chiyani ndi Data Science? Kusonkhanitsa deta

Mndandanda wathunthu malinga ndi zovuta:

  • Kugwira ntchito ndi netiweki ndi html pophatikiza ndi kubwereza ndikusaka ndi ID.
  • Zolemba zamitundu yosiyanasiyana.
  • Pali malo ambiri kumene code mosavuta kugwa.
  • Ndikofunikira kulemba || kodi.
  • Zolemba zofunika, zitsanzo zama code, ndi/kapena dera zikusowa.

Nthawi yoyerekeza ya ntchitoyi idzakhala nthawi 3-5 kuposa kusonkhanitsa deta kuchokera ku Reddit.

Kuyerekeza kwa magulu a Odnoklassniki

Tiyeni tipitirire ku nkhani yosangalatsa kwambiri yomwe yafotokozedwa. Kwa ine, zinali zosangalatsa ndendende chifukwa poyang'ana koyamba, zikuwoneka ngati zazing'ono, koma sizikhala choncho konse - mutangomenya ndodo.

Tiyeni tiyambe ndi mndandanda wathu wazovuta ndikuwona kuti ambiri a iwo adzakhala ovuta kwambiri kuposa momwe amawonera poyamba:

  • Pali API, koma pafupifupi alibe ntchito zofunika.
  • Kuzinthu zina zomwe muyenera kupempha kuti mulowe ndi makalata, ndiye kuti, kupereka mwayi sikungochitika nthawi yomweyo.
  • Zalembedwa kwambiri (poyamba, mawu achi Russia ndi Chingerezi amasakanizidwa paliponse, ndipo mosagwirizana - nthawi zina mumangofunika kulingalira zomwe akufuna kuchokera kwa inu kwinakwake) ndipo, kuwonjezera apo, mapangidwewo si oyenera kupeza deta, mwachitsanzo. , ntchito yomwe tikufuna.
  • Pamafunika gawo pazolembedwa, koma sazigwiritsa ntchito - ndipo palibe njira yomvetsetsa zovuta zonse zamitundu ya API kupatula kungoyang'ana mozungulira ndikuyembekeza kuti china chake chitha kugwira ntchito.
  • Palibe zitsanzo ndipo palibe dera; nsonga yokhayo yothandizira pakusonkhanitsa chidziwitso ndi yaing'ono wrapper mu Python (popanda zitsanzo zambiri zogwiritsira ntchito).
  • Selenium ikuwoneka ngati njira yothandiza kwambiri, popeza zambiri zofunikira zimatsekedwa.
    1) Ndiko kuti, chilolezo chimachitika kudzera mwa wogwiritsa ntchito wabodza (ndi kulembetsa ndi dzanja).

    2) Komabe, ndi Selenium palibe zitsimikizo za ntchito yolondola komanso yobwerezabwereza (osachepera ok.ru motsimikiza).

    3) Tsamba la Ok.ru lili ndi zolakwika za JavaScript ndipo nthawi zina amachita modabwitsa komanso mosagwirizana.

    4) Muyenera kupanga pagination, kutsitsa zinthu, ndi zina ...

    5) Zolakwa za API zomwe wrapper ikupereka ziyenera kuyendetsedwa movutikira, mwachitsanzo, monga chonchi (chidutswa choyesera):

    def get_comments(args, context, discussions):
        pause = 1
        if args.extract_comments:
            all_comments = set()
    #makes sense to keep track of already processed discussions
            for discussion in tqdm(discussions): 
                try:
                    comments = get_comments_from_discussion_via_api(context, discussion)
                except odnoklassniki.api.OdnoklassnikiError as e:
                    if "NOT_FOUND" in str(e):
                        comments = set()
                    else:
                        print(e)
                        bp()
                        pass
                all_comments |= comments
                time.sleep(pause)
            return all_comments
    

    Cholakwika chomwe ndimakonda chinali:

    OdnoklassnikiError("Error(code: 'None', description: 'HTTP error', method: 'discussions.getComments', params: …)”)

    6) Pamapeto pake, Selenium + API ikuwoneka ngati njira yabwino kwambiri.

  • Ndikofunikira kupulumutsa boma ndikuyambitsanso dongosolo, kuthana ndi zolakwika zambiri, kuphatikiza machitidwe osagwirizana ndi tsambalo - ndipo zolakwika izi ndizovuta kuzilingalira (pokhapokha mutalemba zolemba mwaukadaulo, inde).

Kuyerekeza kwanthawi yokhazikika kwa ntchitoyi kudzakhala kuwirikiza 3-5 kuposa kusonkhanitsa deta kuchokera kwa Habr. Ngakhale kuti pankhani ya Habr timagwiritsa ntchito njira yakutsogolo ndi HTML parsing, ndipo pankhani ya OK titha kugwira ntchito ndi API m'malo ovuta.

anapezazo

Ziribe kanthu kuchuluka kwa zomwe mukufunikira kuti muyerekeze masiku omalizira "pomwepo" (tikukonzekera lero!) pa module yochuluka yokonza mapaipi, nthawi yokonzekera sikutheka kuyerekeza ngakhale moyenerera popanda kusanthula magawo a ntchito.

Pazanzeru pang'ono, njira zowerengera zakale zimagwira ntchito bwino pantchito zauinjiniya, koma zovuta zomwe zimakhala zoyeserera komanso, mwanjira ina, "zopanga" komanso zofufuza, mwachitsanzo, zosadziwikiratu, zimakhala ndi zovuta, monga momwe zilili m'zitsanzo za mitu yofananira , zomwe takambirana apa.

Zoonadi, kusonkhanitsa deta ndi chitsanzo chabe - nthawi zambiri ndi ntchito yosavuta komanso yosavuta, ndipo satana nthawi zambiri amakhala mwatsatanetsatane. Ndipo ndi ntchito iyi yomwe titha kuwonetsa zosankha zonse zomwe zingasokonekera komanso momwe ntchitoyo ingatengere nthawi yayitali.

Mukayang'ana mawonekedwe a ntchitoyi popanda kuyesa kowonjezera, ndiye kuti Reddit ndi OK zikuwoneka zofanana: pali API, python wrapper, koma kwenikweni, kusiyana kwake ndi kwakukulu. Kutengera magawo awa, ma pars a Habr amawoneka ovuta kwambiri kuposa OK - koma m'machitidwe ndizosiyana, ndipo izi ndizomwe zingadziwike poyesa kuyesa kosavuta kusanthula magawo avuto.

Muzochitika zanga, njira yothandiza kwambiri ndiyo kuyerekezera nthawi yomwe mungafunikire kuti muyambe kufufuza nokha ndi zoyesera zoyamba zosavuta, kuwerenga zolemba - izi zidzakulolani kuti mupereke chiwerengero cholondola cha ntchito yonse. Pankhani ya njira yodziwika bwino ya agile, ndikukupemphani kuti mupange tikiti ya "kuyerekeza magawo a ntchito", pamaziko omwe nditha kuwunika zomwe zingatheke mu "sprint" ndikupereka kuyerekeza kolondola kwa chilichonse. ntchito.

Chifukwa chake, mkangano wothandiza kwambiri ukuwoneka kuti ndi womwe ungawonetse katswiri "wosakhala waukadaulo" kuchuluka kwa nthawi ndi zida zomwe zingasinthe malinga ndi magawo omwe sanayesedwebe.

Kodi chingachitike ndi chiyani ndi Data Science? Kusonkhanitsa deta

Source: www.habr.com

Kuwonjezera ndemanga