I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba

Umbhali walesi sihloko, ukuhumusha esikushicilela namuhla, uthi inhloso yayo ukukhuluma ngokuthuthukiswa kwe-web scraper e-Python esebenzisa i-Selenium, efuna amanani amathikithi endiza. Lapho kuseshwa amathikithi, kusetshenziswa izinsuku eziguquguqukayo (+- izinsuku ezi-3 ezihlobene nezinsuku ezishiwo). I-scraper igcina imiphumela yosesho efayeleni le-Excel futhi ithumelela umuntu owenze usesho i-imeyili enesifinyezo salokho akutholile. Umgomo wale phrojekthi ukusiza abahambi ukuthi bathole amadili angcono kakhulu.

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba

Uma, ngenkathi uqonda indaba, uzizwa ulahlekile, bheka lokhu isihloko.

Sizofunani?

Ukhululekile ukusebenzisa isistimu echazwe lapha ngendlela ofisa ngayo. Ngokwesibonelo, ngangiyisebenzisela ukucinga uhambo lwangezimpelasonto namathikithi okuya edolobheni lakithi. Uma uzimisele ngokuthola amathikithi anenzuzo, ungasebenzisa umbhalo kuseva (elula isifiso, ngama-ruble angu-130 ngenyanga, kufanelekile kulokhu) futhi qiniseka ukuthi isebenza kanye noma kabili ngosuku. Imiphumela yosesho izothunyelwa kuwe nge-imeyili. Ngaphezu kwalokho, ngincoma ukusetha konke ukuze iskripthi sigcine ifayela le-Excel ngemiphumela yosesho kufolda yeDropbox, ezokuvumela ukuthi ubuke amafayela anjalo noma kuphi nanoma nini.

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba
Angikakatholi amanani entengo anamaphutha okwamanje, kodwa ngicabanga ukuthi kungenzeka

Lapho usesha, njengoba sekushiwo, kusetshenziswa “usuku oluguquguqukayo”; iskripthi sithola okunikezwayo okuphakathi kwezinsuku ezintathu zamadethi anikeziwe. Nakuba lapho isebenzisa umbhalo, isesha okunikezwayo ohlangothini olulodwa kuphela, kulula ukuyilungisa ukuze ikwazi ukuqoqa idatha yezindlela ezimbalwa zendiza. Ngosizo lwayo, ungabheka nezintela eziyiphutha; lokho okutholakele kungathakazelisa kakhulu.

Kungani udinga enye i-web scraper?

Lapho ngiqala ukuklwebha iwebhu, ngokweqiniso ngangingenantshisekelo kukho. Bengifuna ukwenza amaphrojekthi amaningi emkhakheni we-predictive modelling, ukuhlaziywa kwezimali, futhi, mhlawumbe, emkhakheni wokuhlaziya imibala engokomzwelo yemibhalo. Kodwa kwavela ukuthi kwakuthakazelisa kakhulu ukuthola indlela yokudala uhlelo oluqoqa idatha kumawebhusayithi. Njengoba ngihlolisisa lesi sihloko, ngabona ukuthi i-web scraping "injini" ye-intanethi.

Ungase ucabange ukuthi lesi yisitatimende esinesibindi kakhulu. Kepha cabanga ukuthi i-Google iqale nge-web scraper eyadalwa nguLarry Page esebenzisa iJava nePython. Amarobhothi e-Google abelokhu ehlola i-inthanethi, ezama ukunikeza abasebenzisi bayo izimpendulo ezingcono kakhulu zemibuzo yabo. I-Web scraping inokusetshenziswa okungapheli, futhi noma ngabe unentshisekelo kwenye into ku-Data Science, uzodinga amakhono athile wokukrwela ukuze uthole idatha oyidingayo ukuze uyihlaziye.

Ngithole amanye amasu asetshenziswa lapha ngendlela emangalisayo incwadi mayelana ne-web scraping, engisanda kuyithola. Iqukethe izibonelo eziningi ezilula nemibono yokusebenzisa lokho okufundile. Ngaphezu kwalokho, kunesahluko esithakazelisa kakhulu sokudlula amasheke e-reCaptcha. Lokhu kwafika njengezindaba kimi, njengoba ngangingazi nokuthi kukhona amathuluzi akhethekile kanye nezinsizakalo eziphelele zokuxazulula izinkinga ezinjalo.

Uyathanda ukuhamba?!

Embuzweni olula futhi ongenangozi ovezwe esihlokweni salesi sigaba, ungakwazi ukuzwa impendulo enhle, ehambisana nezindaba ezimbalwa ezivela ohambweni lomuntu obuzwe kuye. Iningi lethu lingavuma ukuthi ukuhamba kuyindlela enhle yokuzijwayeza ezindaweni ezintsha zamasiko futhi wandise ama-horizons akho. Nokho, uma ubuza othile ukuthi uyathanda yini ukusesha amathikithi endiza, ngiyaqiniseka ukuthi impendulo ngeke ibe yinhle kangako. Eqinisweni, iPython iza ukuzosisiza lapha.

Umsebenzi wokuqala okudingeka siwuxazulule endleleni yokwakha uhlelo lokusesha ulwazi kumathikithi endiza kuzoba ukukhetha inkundla efanelekile esizothatha kuyo ulwazi. Ukuxazulula le nkinga kwakungelula kimi, kodwa ekugcineni ngakhetha isevisi ye-Kayak. Ngizamile izinsizakalo ze-Momondo, Skyscanner, Expedia, nezinye ezimbalwa, kodwa izindlela zokuvikela amarobhothi kulezi zinsiza bezingenakungeneka. Ngemva kwemizamo eminingana, lapho kwadingeka ngibhekane namarobhothi, ukuwela abahamba ngezinyawo namabhayisikili, ngizama ukukholisa izinhlelo ukuthi ngingumuntu, nganquma ukuthi i-Kayak yayingifanele kakhulu, naphezu kweqiniso lokuthi ngisho noma Amakhasi amaningi kakhulu alayishiwe. ngesikhathi esifushane, futhi amasheke nawo aqale. Ngikwazile ukwenza i-bot ithumele izicelo kusayithi ngezikhathi ezithile ze-4 kuya kumahora we-6, futhi yonke into yasebenza kahle. Ngezikhathi ezithile, kuvela ubunzima lapho usebenza ne-Kayak, kodwa uma beqala ukukukhathaza ngamasheke, udinga ukubhekana nabo ngesandla bese uvula i-bot, noma ulinde amahora ambalwa bese ukuhlola kufanele kume. Uma kunesidingo, ungakwazi ukuzivumelanisa kalula nekhodi kwenye inkundla, futhi uma wenza kanjalo, ungayibika kumazwana.

Uma usanda kuqalisa nge-web scraping futhi ungazi ukuthi kungani amanye amawebhusayithi enenkinga nakho, ngaphambi kokuthi uqale iphrojekthi yakho yokuqala kule ndawo, zenzele umusa futhi useshe ku-Google ngamagama athi "web scraping etiquette" . Ukuhlolwa kwakho kungase kuphele ngokushesha kunokuba ucabanga uma wenza ukukhuhla iwebhu ngokungahlakaniphile.

Ukuqalisa

Nakhu ukubuka konke okujwayelekile kokuzokwenzeka kukhodi yethu ye-web scraper:

  • Ngenisa imitapo yolwazi edingekayo.
  • Ivula ithebhu ye-Google Chrome.
  • Shayela umsebenzi oqala i-bot, uyidlulisele amadolobha nezinsuku ezizosetshenziswa uma kuseshwa amathikithi.
  • Lo msebenzi uthatha imiphumela yosesho yokuqala, ihlungwe ngokungcono kakhulu, bese uchofoza inkinobho ukuze ulayishe imiphumela eminingi.
  • Omunye umsebenzi uqoqa idatha kusuka kulo lonke ikhasi bese ubuyisela uhlaka lwedatha.
  • Izinyathelo ezimbili zangaphambilini zenziwa kusetshenziswa izinhlobo zokuhlunga ngentengo yethikithi (eshibhile) nangesivinini sendiza (eshesha kakhulu).
  • I-imeyili equkethe isifinyezo samanani amathikithi (amathikithi ashibhile nenani elimaphakathi) ithunyelwa kumsebenzisi wombhalo, futhi uhlaka lwedatha olunolwazi oluhlungwe ngamamethrikhi amathathu ashiwo ngenhla lulondolozwa njengefayela le-Excel.
  • Zonke lezi zenzo ezingenhla zenziwa ngomjikelezo ngemva kwenkathi ethile yesikhathi.

Kufanele kuqashelwe ukuthi yonke iphrojekthi ye-Selenium iqala ngomshayeli wewebhu. Ngiyasebenzisa I-Chromedriver, ngisebenza ne-Google Chrome, kodwa kukhona ezinye izinketho. I-PhantomJS neFirefox nazo ziyathandwa. Ngemuva kokulanda umshayeli, udinga ukuyibeka kufolda efanelekile, futhi lokhu kuqeda ukulungiswa kokusetshenziswa kwayo. Imigqa yokuqala yombhalo wethu ivula ithebhu entsha ye-Chrome.

Khumbula ukuthi endabeni yami angizami ukuvula ama-horizons amasha ukuze uthole amadili amahle kumathikithi endiza. Kunezindlela ezithuthuke kakhulu zokusesha izinhlinzeko ezinjalo. Ngifuna ukunikeza abafundi balolu lwazi indlela elula kodwa esebenzayo yokuxazulula le nkinga.

Nansi ikhodi esikhulume ngayo ngenhla.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

Ekuqaleni kwekhodi ungabona imiyalo yokungenisa iphakheji esetshenziswa kuyo yonke iphrojekthi yethu. Ngakho, randint isetshenziselwa ukwenza i-bot "ilale" inombolo engahleliwe yemizuzwana ngaphambi kokuqala umsebenzi omusha wokusesha. Imvamisa, ayikho neyodwa i-bot engenza ngaphandle kwalokhu. Uma usebenzisa ikhodi engenhla, iwindi le-Chrome lizovuleka, elizosetshenziswa i-bot ukuze isebenze namasayithi.

Asenze ukuhlola okuncane futhi sivule iwebhusayithi ye-kayak.com efasiteleni elihlukile. Sizokhetha idolobha esizondizela kulo, nedolobha esifuna ukufika kulo, kanye nezinsuku zendiza. Lapho ukhetha amadethi, qiniseka ukuthi ibanga lezinsuku +-3 liyasetshenziswa. Ngibhale ikhodi ngokucabangela lokho isayithi elikhiqizayo ekuphenduleni izicelo ezinjalo. Uma, ngokwesibonelo, udinga ukucinga amathikithi amadethi ashiwo kuphela, khona-ke maningi amathuba okuthi uguqule ikhodi ye-bot. Uma ngikhuluma ngekhodi, nginikeza izincazelo ezifanele, kodwa uma uzizwa udidekile, ungazise.

Manje chofoza inkinobho yokusesha bese ubheka isixhumanisi kubha yekheli. Kufanele ifane nesixhumanisi engisisebenzisayo esibonelweni esingezansi lapho okuguquguqukayo kumenyezelwa khona kayak, egcina i-URL, futhi indlela iyasetshenziswa get umshayeli wewebhu. Ngemva kokuchofoza inkinobho yokusesha, imiphumela kufanele ivele ekhasini.

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba
Lapho ngisebenzisa umyalo get izikhathi ezingaphezu kwezimbili noma ezintathu emizuzwini embalwa, ngacelwa ukuthi ngiqedele ukuqinisekiswa ngisebenzisa i-reCaptcha. Ungakwazi ukudlulisa lokhu kuhlola mathupha futhi uqhubeke uzama kuze kube yilapho isistimu inquma ukwenza ukuhlola okusha. Lapho ngihlola umbhalo, kwabonakala sengathi iseshini yokuqala yokusesha yayihlala ihamba kahle, ngakho-ke uma ufuna ukuhlola ikhodi, kuzodingeka ukuthi uhlole ngezikhathi ezithile bese uvumela ikhodi ukuthi isebenze, usebenzisa izikhawu ezinde phakathi kwezikhathi zokusesha. Futhi, uma ucabanga ngakho, umuntu cishe ngeke adinge ulwazi mayelana nezintengo zamathikithi ezitholwe ngezikhathi zemizuzu eyi-10 phakathi kwemisebenzi yosesho.

Ukusebenza ngekhasi usebenzisa i-XPath

Ngakho, savula iwindi futhi salayisha isayithi. Ukuze uthole amanani kanye nolunye ulwazi, sidinga ukusebenzisa ubuchwepheshe be-XPath noma izikhethi ze-CSS. Nginqume ukunamathela ku-XPath futhi angisizwanga isidingo sokusebenzisa izikhethi ze-CSS, kodwa kungenzeka kakhulu ukusebenza ngaleyo ndlela. Ukuzulazula ekhasini usebenzisa i-XPath kungaba yinto ekhohlisayo, futhi noma ngabe usebenzisa amasu engiwachaze kuwo lokhu i-athikili, ehilela ukukopisha izihlonzi ezihambisanayo kusuka kukhodi yekhasi, ngabona ukuthi lena, empeleni, akuyona indlela efanelekile yokufinyelela izakhi ezidingekayo. Ngendlela, ku lokhu Le ncwadi inikeza incazelo enhle kakhulu yezisekelo zokusebenza ngamakhasi usebenzisa izikhethi ze-XPath ne-CSS. Lena yindlela ehambisanayo yomshayeli wewebhu ebukeka ngayo.

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba
Ngakho-ke, ake siqhubeke nokusebenza ku-bot. Masisebenzise amakhono ohlelo ukuze sikhethe amathikithi ashibhe kakhulu. Esithombeni esilandelayo, ikhodi yesikhethi ye-XPath igqanyiswe ngokubomvu. Ukuze ubuke ikhodi, udinga ukuchofoza kwesokudla engxenyeni yekhasi oyithandayo bese ukhetha umyalo othi Hlola kumenyu evelayo. Lo myalo ungabizwa ngezici zekhasi ezihlukene, ikhodi ezoboniswa futhi igqanyiswe kusibuki sekhodi.

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba
Buka ikhodi yekhasi

Ukuze uthole isiqinisekiso sokucabanga kwami ​​mayelana nokubi kokukopisha abakhethi kukhodi, naka izici ezilandelayo.

Nakhu okutholayo lapho ukopisha ikhodi:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

Ukuze ukopishe into efana nale, udinga ukuchofoza kwesokudla engxenyeni yekhodi oyithandayo bese ukhetha umyalo Kopisha > Kopisha XPath kwimenyu evelayo.

Nakhu engangivame ukuchaza inkinobho Eshibhe Kakhulu:

cheap_results = ‘//a[@data-code = "price"]’

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba
Kopisha umyalo > Kopisha i-XPath

Kusobala ukuthi inketho yesibili ibonakala ilula kakhulu. Uma isetshenziswa, isesha i-elementi a enesibaluli data-code, kuyalingana price. Uma usebenzisa inketho yokuqala, isici siyaseshwa id elilingana ne wtKI-price_aTab, futhi indlela ye-XPath eya ku-elementi ibukeka kanje /div[1]/div/div/div[1]/div/span/span. Umbuzo we-XPath ofana nalona ekhasini uzokwenza iqhinga, kodwa kanye kuphela. Ngingakusho njengamanje lokho id izoshintsha ngesikhathi esilandelayo lapho ikhasi lilayishwa khona. Ukulandelana kwezinhlamvu wtKI ishintsha ngokushintshashintshayo njalo lapho ikhasi lilayishwa, ngakho-ke ikhodi elisebenzisayo izoba yize ngemva kokulayisha kabusha kwekhasi elilandelayo. Ngakho-ke zinike isikhathi sokuqonda i-XPath. Lolu lwazi luzokusiza kahle.

Kodwa-ke, kufanele kuqashelwe ukuthi ukukopisha abakhethi be-XPath kungaba usizo lapho usebenza nezingosi ezilula, futhi uma ukhululekile ngalokhu, akukho lutho olungalungile ngakho.

Manje ake sicabange ukuthi yini okufanele uyenze uma udinga ukuthola yonke imiphumela yosesho emigqeni embalwa, ngaphakathi kohlu. Kulula kakhulu. Umphumela ngamunye ungaphakathi kwento eneklasi resultWrapper. Ukulayisha yonke imiphumela kungenziwa ngeluphu efana nale eboniswe ngezansi.

Kufanele kuqashelwe ukuthi uma uqonda okungenhla, kufanele uqonde kalula iningi lamakhodi esizowahlaziya. Njengoba le khodi isebenza, sifinyelela esikudingayo (empeleni, isici lapho umphumela usongwe) sisebenzisa uhlobo oluthile lwendlela ecacisa indlela (XPath). Lokhu kwenziwa ukuze kutholwe umbhalo we-elementi bese uyibeka entweni okungafundwa kuyo idatha (isetshenziswe okokuqala. flight_containers, bese- flights_list).

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba
Imigqa emithathu yokuqala iyaboniswa futhi singabona ngokucacile konke esikudingayo. Nokho, sinezindlela ezithakazelisayo zokuthola ulwazi. Kudingeka sithathe idatha ku-elementi ngayinye ngokwehlukana.

Ngena emsebenzini!

Indlela elula yokubhala umsebenzi ukulayisha imiphumela eyengeziwe, ngakho-ke yilapho sizoqala khona. Ngingathanda ukukhulisa inani lezindiza uhlelo oluthola ulwazi ngazo, ngaphandle kokuphakamisa izinsolo kusevisi eholela ekuhlolweni, ngakho ngichofoza inkinobho ethi Layisha eminye imiphumela kanye isikhathi ngasinye lapho ikhasi liboniswa. Kule khodi, kufanele unake i-block try, engingezile ngoba ngezinye izikhathi inkinobho ayilayishi kahle. Uma futhi uhlangabezana nalokhu, phawula ngezingcingo kulo msebenzi kukhodi yokusebenza start_kayak, esizoyibheka ngezansi.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Manje, ngemva kokuhlaziywa okude kwalo msebenzi (ngezinye izikhathi ngingakwazi ukuthwala kanzima), sesilungele ukumemezela umsebenzi ozokwesula ikhasi.

Sengikuqoqile okuningi okudingekayo kulo msebenzi olandelayo obizwa ngokuthi page_scrape. Ngezinye izikhathi idatha yendlela ebuyisiwe ihlangene, ngakho ngisebenzisa indlela elula yokuyihlukanisa. Isibonelo, uma ngisebenzisa okuguquguqukayo okokuqala ngqa section_a_list и section_b_list. Umsebenzi wethu ubuyisela uhlaka lwedatha flights_df, lokhu kusivumela ukuthi sihlukanise imiphumela etholwe ezindleleni ezihlukene zokuhlunga idatha futhi kamuva siyihlanganise.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

Ngizamile ukuqamba amagama aguquguqukayo ukuze ikhodi iqondakale. Khumbula ukuthi ama-variable aqala ngawo a abesigaba sokuqala sendlela, futhi b - kwesibili. Asiqhubekele kumsebenzi olandelayo.

Izindlela zokusekela

Manje sinomsebenzi osivumela ukuthi silayishe imiphumela yosesho eyengeziwe kanye nomsebenzi wokucubungula leyo miphumela. Lesi sihloko sasingase siphelele lapha, njengoba le misebenzi emibili ihlinzeka ngakho konke okudingayo ukuze usule amakhasi ongazivula wena. Kodwa asikacabangi ezinye zezindlela ezisizayo okukhulunywe ngazo ngenhla. Isibonelo, lena ikhodi yokuthumela ama-imeyili nezinye izinto. Konke lokhu kungatholakala kuhlelo lokusebenza start_kayak, esizoyicabangela manje.

Ukuze lo msebenzi usebenze, udinga ulwazi mayelana namadolobha namadethi. Ngokusebenzisa lolu lwazi, kwakha isixhumanisi kokuguquguqukayo kayak, esetshenziselwa ukukuyisa ekhasini elizoqukatha imiphumela yosesho ehlungwe ngokufana kwayo okuhle nombuzo. Ngemuva kweseshini yokuqala yokuklwebha, sizosebenza nezintengo etafuleni phezulu ekhasini. Okungukuthi, sizothola ubuncane bentengo yethikithi kanye nenani elimaphakathi. Konke lokhu, kanye nokubikezela okukhishwe isayithi, kuzothunyelwa nge-imeyili. Ekhasini, ithebula elihambisanayo kufanele libe ekhoneni eliphezulu kwesokunxele. Ukusebenza naleli thebula, ngendlela, kungabangela iphutha lapho usesha usebenzisa izinsuku eziqondile, ngoba kulokhu ithebula aliboniswa ekhasini.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

Ngihlole lesi script ngisebenzisa i-akhawunti ye-Outlook (hotmail.com). Angikayihloli ukuthi isebenze kahle nge-akhawunti ye-Gmail, lolu hlelo lwe-imeyili ludume kakhulu, kepha kuningi ongakhetha kukho. Uma usebenzisa i-akhawunti ye-Hotmail, ukuze konke kusebenze, udinga nje ukufaka idatha yakho kukhodi.

Uma ufuna ukuqonda ukuthi yini ngempela eyenziwa ezigabeni ezithile zekhodi yalo msebenzi, ungazikopisha futhi uzihlole. Ukuhlola ikhodi ukuphela kwendlela yokuyiqonda ngempela.

Isistimu elungile

Manje njengoba sesenze konke esikhulume ngakho, singakha iluphu elula ebiza imisebenzi yethu. Umbhalo ucela idatha kumsebenzisi mayelana namadolobha namadethi. Lapho uhlola ngokuqalisa kabusha njalo iskripthi, cishe ngeke ufune ukufaka le datha mathupha ngaso sonke isikhathi, ngakho imigqa ehambisanayo, ngesikhathi sokuhlolwa, ingaphawulwa ngokukhipha amazwana kulabo abangaphansi kwayo, lapho idatha edingwa iskripthi sinekhodi eqinile.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Yile ndlela okubonakala ngayo ukuhlolwa kombhalo.
I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba
Ukuqaliswa kokuhlolwa kwesikripthi

Imiphumela

Uma usufike lapha, siyakuhalalisela! Manje usune-web scraper esebenzayo, nakuba sengingabona izindlela eziningi zokuyithuthukisa. Isibonelo, ingahlanganiswa ne-Twilio ukuze ithumele imiyalezo esikhundleni sama-imeyili. Ungasebenzisa i-VPN noma enye into ukuze uthole ngesikhathi esisodwa imiphumela evela kumaseva ambalwa. Kuphinde kube nenkinga evela ngezikhathi ezithile ngokubheka umsebenzisi wesayithi ukuthi ungumuntu yini, kodwa le nkinga nayo ingaxazululwa. Kunoma yikuphi, manje unesisekelo ongasandisa uma ufisa. Isibonelo, qiniseka ukuthi ifayela le-Excel lithunyelwa kumsebenzisi njengokunamathiselwe ku-imeyili.

I-Python - umsizi ekutholeni amathikithi endiza ashibhile kulabo abathanda ukuhamba

Abasebenzisi ababhalisiwe kuphela abangabamba iqhaza kuhlolovo. Ngena ngemvume, wamukelekile.

Ingabe usebenzisa ubuchwepheshe be-web scraping?

  • Yebo

  • No

Bangu-8 abasebenzisi abavotile. Umsebenzisi ongu-1 ugobile.

Source: www.habr.com

Engeza amazwana