Menene zai iya yin kuskure tare da Kimiyyar Bayanai? Tarin bayanai

Menene zai iya yin kuskure tare da Kimiyyar Bayanai? Tarin bayanai
A yau akwai darussa 100500 Data Science kuma an dade da sanin cewa ana iya samun mafi yawan kuɗi a Kimiyyar Data ta hanyar darussan Kimiyyar Data (me yasa ake tono lokacin da za ku iya siyar da shebur?). Babban hasara na waɗannan darussan shine cewa basu da alaƙa da aiki na gaske: babu wanda zai ba ku tsabta, bayanan da aka sarrafa a cikin tsarin da ake buƙata. Kuma lokacin da kuka bar karatun kuma ku fara magance matsala ta gaske, yawancin nuances suna fitowa.

Sabili da haka, muna fara jerin bayanin kula "Abin da zai iya faruwa ba daidai ba tare da Kimiyyar Bayanai", dangane da ainihin abubuwan da suka faru da ni, abokan aiki da abokan aiki. Za mu bincika ayyukan Kimiyyar Bayanai na yau da kullun ta amfani da misalai na gaske: yadda hakan ke faruwa a zahiri. Bari mu fara yau da aikin tattara bayanai.

Kuma abu na farko da mutane ke tuntuɓe lokacin da suka fara aiki da ainihin bayanan shine a zahiri tattara wannan bayanan da suka fi dacewa da mu. Babban sakon wannan labarin:

Muna ƙididdige lokaci, albarkatu, da ƙoƙarin da ake buƙata don tattarawa, tsaftacewa, da shirya bayanai.

Kuma mafi mahimmanci, za mu tattauna abin da za mu yi don hana wannan.

Dangane da ƙididdiga daban-daban, tsaftacewa, canji, sarrafa bayanai, injiniyan fasali, da sauransu suna ɗaukar 80-90% na lokaci, da bincike 10-20%, yayin da kusan dukkanin kayan ilimi suna mai da hankali ne kawai akan bincike.

Bari mu kalli matsala mai sauƙi ta nazari a cikin nau'ikan guda uku a matsayin misali na yau da kullun kuma mu ga menene “lalamai masu tada hankali”.

Kuma a matsayin misali, kuma, za mu yi la'akari da irin wannan bambance-bambancen na aikin tattara bayanai da kwatanta al'ummomi don:

  1. Biyu Reddit subreddit
  2. Kashi biyu na Habr
  3. Rukuni biyu na Odnoklassniki

Hanyar sharadi a ka'idar

Bude rukunin yanar gizon kuma karanta misalan, idan ya bayyana a sarari, keɓe ƴan sa'o'i don karantawa, 'yan sa'o'i kaɗan don lambar ta amfani da misalan da gyara kuskure. Ƙara sa'o'i kaɗan don tarin. Jefa cikin ƴan sa'o'i kaɗan a ajiyar ( ninka ta biyu kuma ƙara sa'o'i N).

Mabuɗin Mahimmanci: Ƙididdiga na lokaci sun dogara ne akan zato da zato game da tsawon lokacin da zai ɗauka.

Wajibi ne a fara nazarin lokaci ta hanyar ƙididdige sigogi masu zuwa don matsalar yanayin da aka kwatanta a sama:

  • Menene girman bayanan kuma nawa ne ake buƙatar tattara su ta jiki (*duba ƙasa*).
  • Menene lokacin tattarawa don rikodin ɗaya kuma tsawon lokacin da za ku jira kafin ku iya tattara na biyu?
  • Yi la'akari da rubuta lambar da ke adana jihar kuma fara sake farawa lokacin (ba idan) komai ya kasa ba.
  • Nuna ko muna buƙatar izini kuma saita lokaci don samun dama ta API.
  • Saita adadin kurakurai a matsayin aikin rikitarwa na bayanai - kimantawa don takamaiman aiki: tsari, sau nawa canje-canje, menene kuma yadda ake cirewa.
  • Gyara kurakurai na hanyar sadarwa da matsaloli tare da halayen aikin da ba daidai ba.
  • Yi la'akari idan ayyukan da ake buƙata suna cikin takardun kuma idan ba haka ba, to, nawa da nawa ake buƙata don aikin aiki.

Abu mafi mahimmanci shi ne cewa don ƙididdige lokaci - a zahiri kuna buƙatar kashe lokaci da ƙoƙari don "bincike cikin ƙarfi" - sai kawai shirinku zai isa. Saboda haka, ko da nawa aka tura ka ka ce "lokacin da ake ɗauka don tattara bayanai" - saya kanka ɗan lokaci don bincike na farko kuma ku yi jayayya da nawa lokacin zai bambanta dangane da ainihin sigogi na matsalar.

Kuma yanzu za mu nuna takamaiman misalai inda irin waɗannan sigogi zasu canza.

Mahimmin Magana: Ƙididdiga ta dogara ne akan nazarin mahimman abubuwan da ke tasiri ga iyawa da rikitarwa na aikin.

Ƙididdiga na tushen ƙima shine kyakkyawar hanya lokacin da abubuwa masu aiki suka yi ƙanƙanta kuma babu abubuwa da yawa waɗanda zasu iya tasiri sosai akan ƙirar matsalar. Amma a yanayin matsalolin kimiyyar bayanai da yawa, irin waɗannan abubuwan suna da yawa da yawa kuma irin wannan hanyar ba ta isa ba.

Kwatanta al'ummomin Reddit

Bari mu fara da mafi sauƙi (kamar yadda ya bayyana daga baya). Gabaɗaya, don yin gaskiya gabaɗaya, muna da kusan shari'ar da ta dace, bari mu bincika jerin abubuwan bincike masu rikitarwa:

  • Akwai API mai tsafta, bayyananne da rubuce-rubuce.
  • Yana da matukar sauƙi kuma mafi mahimmanci, ana samun alamar ta atomatik.
  • Akwai python wrapper - tare da misalai da yawa.
  • Al'ummar da ke yin nazari da tattara bayanai akan reddit (har zuwa bidiyon YouTube da ke bayanin yadda ake amfani da python wrapper) Misali.
  • Hanyoyin da muke buƙata sun fi kasancewa a cikin API. Bugu da ƙari, lambar tana kama da ƙarami kuma mai tsabta, a ƙasa misali ne na aikin da ke tattara ra'ayoyin akan post.

def get_comments(submission_id):
    reddit = Reddit(check_for_updates=False, user_agent=AGENT)
    submission = reddit.submission(id=submission_id)
    more_comments = submission.comments.replace_more()
    if more_comments:
        skipped_comments = sum(x.count for x in more_comments)
        logger.debug('Skipped %d MoreComments (%d comments)',
                     len(more_comments), skipped_comments)
    return submission.comments.list()

An ɗauko daga wannan zaɓi na dacewa kayan aiki don nannade.

Duk da cewa wannan shine mafi kyawun shari'ar, har yanzu yana da daraja la'akari da wasu mahimman abubuwan daga rayuwa ta ainihi:

  • Iyakokin API - an tilasta mana ɗaukar bayanai cikin batches (barci tsakanin buƙatun, da sauransu).
  • Lokacin tattarawa - don cikakken bincike da kwatance, dole ne ku ware lokaci mai mahimmanci kawai don gizo-gizo don tafiya ta cikin subreddit.
  • Dole ne bot ɗin ya gudana akan sabar - ba za ku iya sarrafa shi kawai akan kwamfutar tafi-da-gidanka ba, saka shi cikin jakar baya, kuma ku ci gaba da kasuwancin ku. Don haka na gudanar da komai akan VPS. Amfani da lambar talla habrahabr10 zaka iya ajiye wani kashi 10% na farashi.
  • Rashin isa ga wasu bayanai na jiki (suna iya gani ga masu gudanarwa ko kuma suna da wahalar tattarawa) - dole ne a yi la'akari da wannan; a ka'ida, ba duk bayanai ba ne za a iya tattara su cikin isasshen lokaci.
  • Kurakurai na hanyar sadarwa: Sadarwar yana da zafi.
  • Wannan bayanai ne na ainihi mai rai - ba shi da tsarki.

Tabbas, wajibi ne a haɗa waɗannan nuances a cikin ci gaba. Takamaiman sa'o'i/kwanaki sun dogara da ƙwarewar haɓakawa ko ƙwarewar aiki akan ayyuka iri ɗaya, duk da haka, mun ga cewa a nan aikin injiniya ne kawai kuma baya buƙatar ƙarin motsin jiki don warwarewa - ana iya kimanta komai da kyau, tsarawa da aikatawa.

Kwatanta sassan Habr

Mu ci gaba zuwa wani lamari mai ban sha'awa da mara hankali na kwatanta zaren da/ko sassan Habr.

Bari mu bincika jerin abubuwan bincikenmu masu rikitarwa - a nan, don fahimtar kowane batu, dole ne ku ɗan tono kaɗan cikin aikin da kansa kuma kuyi gwaji.

  • Da farko kuna tunanin akwai API, amma babu. Ee, a, Habr yana da API, amma ba kawai ga masu amfani ba (ko watakila baya aiki kwata-kwata).
  • Sa'an nan kawai ka fara yin nazarin html - "buƙatun shigo da kaya", menene zai iya faruwa ba daidai ba?
  • Yadda za a tantance ko ta yaya? Hanya mafi sauƙi kuma mafi yawan amfani da ita ita ce ƙira akan IDs, lura cewa ba shine mafi inganci ba kuma dole ne a gudanar da shari'o'i daban-daban - ga misalin yawan adadin ID na ainihi tsakanin duk waɗanda suke.

    Menene zai iya yin kuskure tare da Kimiyyar Bayanai? Tarin bayanai
    An ɗauko daga wannan labarai.

  • Danyen bayanan da aka nannade cikin HTML a saman gidan yanar gizo yana da zafi. Misali, kuna son tattarawa da adana ƙimar labarin: kun yayyage makin daga cikin html kuma kuka yanke shawarar adana shi azaman lamba don ƙarin aiki: 

    1) int (maki) yana jefa kuskure: tun da a Habré akwai ragi, kamar yadda, alal misali, a cikin layin "-5" - wannan shi ne en dash, ba alamar ragi (ba zato ba tsammani, daidai?), haka a Wani lokaci sai da na tayar da parser zuwa rayuwa tare da irin wannan mummunan gyara.

    try:
          score_txt = post.find(class_="score").text.replace(u"–","-").replace(u"+","+")
          score = int(score_txt)
          if check_date(date):
            post_score += score
    

    Wataƙila babu kwanan wata, ƙari da abubuwan cirewa kwata-kwata (kamar yadda muke gani a sama a cikin aikin check_date, wannan ya faru).

    2) Haruffa na musamman waɗanda ba a ɓoye ba - za su zo, kuna buƙatar shirya.

    3) Tsarin yana canzawa dangane da nau'in sakon.

    4) Tsoffin posts na iya samun ** tsarin ban mamaki **.

  • Ainihin, sarrafa kurakurai da abin da zai iya faruwa ko ba zai faru ba dole ne a magance su kuma ba za ku iya yin hasashen tabbas abin da zai faru ba da kuma yadda tsarin zai kasance da abin da zai faɗi a inda - kawai dole ne ku gwada kuma kuyi la'akari. kurakuran da mai binciken ke jefawa.
  • Sa'an nan kuma ku gane cewa kuna buƙatar yin nazari a cikin zaren da yawa, in ba haka ba yin la'akari da ɗaya zai ɗauki 30+ hours (wannan shine kawai lokacin aiwatar da parser-threaded wanda ya riga ya yi aiki, wanda yake barci kuma baya fada a ƙarƙashin kowane bans). IN wannan labarin, wannan ya haifar da wani lokaci zuwa wani tsari irin wannan:

Menene zai iya yin kuskure tare da Kimiyyar Bayanai? Tarin bayanai

Jimlar jerin abubuwan dubawa ta hanyar rikitarwa:

  • Yin aiki tare da hanyar sadarwa da html parsing tare da maimaitawa da bincike ta ID.
  • Takaddun tsarin tsari iri-iri.
  • Akwai wurare da yawa inda lambar zata iya faɗuwa cikin sauƙi.
  • Wajibi ne a rubuta || code.
  • Takaddun da ake buƙata, misalan lamba, da/ko al'umma sun ɓace.

Ƙidayacin lokacin wannan aikin zai kasance sau 3-5 sama da na tattara bayanai daga Reddit.

Kwatanta ƙungiyoyin Odnoklassniki

Bari mu matsa zuwa mafi fasaha mai ban sha'awa da aka kwatanta. A gare ni, ya kasance mai ban sha'awa daidai domin a kallon farko, yana kama da maras muhimmanci, amma sam bai zama haka ba - da zaran kun buga sanda.

Bari mu fara da jerin matsalolinmu kuma mu lura cewa da yawa daga cikinsu za su zama mafi wahala fiye da yadda suke kallo da farko:

  • Akwai API, amma kusan gaba ɗaya ya rasa ayyukan da ake buƙata.
  • Zuwa wasu ayyuka kuna buƙatar buƙatar samun dama ta hanyar wasiku, wato, ba da damar shiga ba nan take ba.
  • Yana da matukar rubuce-rubuce (don farawa, Rashanci da Ingilishi suna gauraye a ko'ina, kuma gaba ɗaya ba daidai ba - wani lokacin kawai kuna buƙatar tsammani abin da suke so daga gare ku a wani wuri) kuma, haka ma, ƙirar ba ta dace da samun bayanai ba, misali. , aikin da muke bukata.
  • Yana buƙatar zama a cikin takaddun, amma a zahiri baya amfani da shi - kuma babu wata hanya ta fahimtar duk rikitattun hanyoyin API ban da yin wasa da fatan wani abu zai yi aiki.
  • Babu misalai kuma babu al'umma; kawai abin da ake tallafawa wajen tattara bayanai shine ƙarami nannade a cikin Python (ba tare da misalai da yawa na amfani ba).
  • Selenium da alama shine mafi kyawun zaɓi, tunda yawancin bayanan da ake buƙata suna kulle.
    1) Wato izini yana faruwa ta hanyar ƙwararrun mai amfani (da rajista da hannu).

    2) Koyaya, tare da Selenium babu garanti don daidaitaccen aiki mai maimaitawa (aƙalla a yanayin ok.ru tabbas).

    3) Gidan yanar gizon Ok.ru yana ƙunshe da kurakurai na JavaScript kuma wani lokacin yana nuna hali na ban mamaki da rashin daidaituwa.

    4) Kana bukatar ka yi pagination, loading abubuwa, da dai sauransu ...

    5) Kurakurai API waɗanda abin nadi ya bayar dole ne a sarrafa su da wulakanci, misali, kamar wannan (wani yanki na lambar gwaji):

    def get_comments(args, context, discussions):
        pause = 1
        if args.extract_comments:
            all_comments = set()
    #makes sense to keep track of already processed discussions
            for discussion in tqdm(discussions): 
                try:
                    comments = get_comments_from_discussion_via_api(context, discussion)
                except odnoklassniki.api.OdnoklassnikiError as e:
                    if "NOT_FOUND" in str(e):
                        comments = set()
                    else:
                        print(e)
                        bp()
                        pass
                all_comments |= comments
                time.sleep(pause)
            return all_comments
    

    Kuskuren da na fi so shi ne:

    OdnoklassnikiError("Error(code: 'None', description: 'HTTP error', method: 'discussions.getComments', params: …)”)

    6) Daga ƙarshe, Selenium + API yayi kama da mafi kyawun zaɓi.

  • Wajibi ne a ceci jihar da zata sake farawa da tsarin, rike da yawa kurakurai, ciki har da m hali na shafin - kuma wadannan kurakurai ne quite wuya a yi tunanin (sai dai idan ka rubuta parsers da fasaha, ba shakka).

Ƙimar lokacin sharadi na wannan aikin zai kasance sau 3-5 sama da na tattara bayanai daga Habr. Duk da cewa a cikin yanayin Habr muna amfani da tsarin gaba tare da fassarar HTML, kuma a cikin yanayin OK za mu iya aiki tare da API a wurare masu mahimmanci.

binciken

Komai nawa ake buƙata don ƙididdige ƙayyadaddun ƙayyadaddun lokaci "a kan tabo" (muna shirin yau!) Na babban tsarin sarrafa bututun bayanai, lokacin aiwatarwa ba zai taɓa yiwuwa a ƙididdigewa ko da inganci ba tare da nazarin sigogin aikin ba.

A kan ɗan ƙaramin ilimin falsafanci, dabarun ƙididdigewa agile suna aiki da kyau don ayyukan injiniya, amma matsalolin da suka fi gwaji kuma, a cikin ma'ana, "halitta" da bincike, watau, ƙasa da tsinkaya, suna da matsaloli, kamar yadda a cikin misalan batutuwa masu kama. wanda muka tattauna anan.

Tabbas, tattara bayanai babban misali ne kawai - yawanci aiki ne mai sauƙi mai sauƙi kuma ba tare da fasaha ba, kuma shaidan yana cikin cikakkun bayanai. Kuma a kan wannan aikin ne kawai za mu iya nuna dukkanin zaɓuɓɓukan da za a iya yi don abin da zai iya faruwa ba daidai ba da kuma tsawon lokacin da aikin zai iya ɗauka.

Idan ka kalli halayen aikin ba tare da ƙarin gwaje-gwajen ba, to Reddit da OK suna kama da kama: akwai API, python wrapper, amma a zahiri, bambancin yana da girma. Yin la'akari da waɗannan sigogi, Habr's pars ya fi rikitarwa fiye da Ok - amma a aikace yana da akasin haka, kuma wannan shine ainihin abin da za'a iya ganowa ta hanyar gudanar da gwaje-gwaje masu sauƙi don nazarin sigogi na matsalar.

A cikin kwarewata, hanya mafi mahimmanci ita ce ta kimanta lokacin da za ku buƙaci bincike na farko da kanta da gwaje-gwajen farko masu sauƙi, karanta takardun - waɗannan za su ba ku damar ba da ƙima mai kyau ga dukan aikin. Dangane da shahararrun hanyoyin agile, Ina tambayarka ka ƙirƙiri tikiti don "ƙididdigar ma'auni na ayyuka", a kan abin da zan iya ba da ƙima na abin da za a iya samu a cikin "gudu" da kuma ba da ƙarin madaidaicin ƙididdiga ga kowane. aiki.

Sabili da haka, hujja mafi tasiri kamar ita ce wacce za ta nuna ƙwararren "marasa fasaha" nawa lokaci da albarkatun za su bambanta dangane da sigogi waɗanda ba a tantance su ba.

Menene zai iya yin kuskure tare da Kimiyyar Bayanai? Tarin bayanai

source: www.habr.com

Add a comment