I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba

Umbhali wenqaku, inguqulelo esiyipapashayo namhlanje, ithi injongo yayo kukuthetha malunga nokuphuhliswa kwe-web scraper kwiPython usebenzisa i-Selenium, ekhangela amaxabiso etikiti le-airline. Xa ukhangela amatikiti, imihla eguquguqukayo isetyenziswa (+- 3 iintsuku ngokunxulumene nemihla echaziweyo). I-scraper igcina iziphumo zophando kwifayile ye-Excel kwaye ithumela umntu oqhube ukukhangela i-imeyile kunye nesishwankathelo sento abayifumeneyo. Injongo yale projekthi kukunceda abahambi bafumane amaxabiso angcono.

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba

Ukuba, ngoxa uqonda umbandela, uziva ulahlekile, khawujonge oku inqaku.

Siza kujonga ntoni?

Ukhululekile ukusebenzisa inkqubo echazwe apha njengoko ufuna. Ngokomzekelo, ndayisebenzisa ukukhangela ukhenketho lwempelaveki kunye namatikiti okuya kwidolophu yakowethu. Ukuba uzimisele ngokufumana amatikiti anengeniso, ungaqhuba iskripthi kwiseva (elula umncedisi, I-ruble ye-130 ngenyanga, ifanelekile kule nto) kwaye qiniseka ukuba iqhuba kanye okanye kabini ngosuku. Iziphumo zokukhangela ziya kuthunyelwa kuwe nge-imeyile. Ukongeza, ndincoma ukuseta yonke into ukuze iskripthi sigcine ifayile ye-Excel kunye neziphumo zophando kwifolda yeDropbox, eya kukuvumela ukuba ujonge ezo fayile ukusuka naphi na nangaliphi na ixesha.

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba
andikafumani amaxabiso aneempazamo okwangoku, kodwa ndicinga ukuba kuyenzeka

Xa ukhangela, njengoko sele kukhankanyiwe, “umhla oguquguqukayo” kusetyenziswa; iskripthi sifumana izinikezelo ezingaphakathi kweentsuku ezintathu zemihla enikiweyo. Nangona xa usebenzisa iscript, ikhangela unikezelo kwicala elinye kuphela, kulula ukuyiguqula ukuze ikwazi ukuqokelela idatha kwiindlela ezininzi zenqwelomoya. Ngoncedo lwayo, ungajonga neerhafu ezigwenxa; oko kufunyanisiweyo kunokuba nomdla kakhulu.

Kutheni ufuna enye i-web scraper?

Ukuqala kwam i-web scraping, ngokunyanisekileyo ndandingenamdla wayo. Ndandifuna ukwenza iiprojekthi ezininzi kwinkalo yokulinganisa kwangaphambili, uhlalutyo lwemali, kwaye, mhlawumbi, kwindawo yokuhlalutya umbala weemvakalelo kwiitekisi. Kodwa kwavela ukuba kwakunomdla kakhulu ukufumana indlela yokwenza inkqubo eqokelela idatha kwiiwebhusayithi. Njengoko ndihlolisise kwesi sihloko, ndaqonda ukuba i-web scraping "injini" ye-Intanethi.

Usenokucinga ukuba le yingxelo engqindilili kakhulu. Kodwa cinga ukuba uGoogle waqala nge-web scraper eyenziwe nguLarry Page usebenzisa iJava kunye nePython. Iirobhothi zikaGoogle bezihlola i-Intanethi, zizama ukunika abasebenzisi bayo ezona mpendulo zilungileyo kwimibuzo yabo. I-Web scraping inokusetyenziswa okungapheliyo, kwaye nokuba unomdla kwenye into kwiNzululwazi yeDatha, uya kufuna izakhono zokukrala ukuze ufumane idatha oyifunayo ukuyihlalutya.

Ndifumene ezinye zeendlela ezisetyenziswa apha ngendlela emangalisayo incwadi malunga ne-web scraping, endisanda kuyifumana. Iqulethe imizekelo emininzi elula kunye neengcamango zokusebenzisa oko ukufundileyo. Ukongeza, kukho isahluko esinomdla kakhulu ekugqithiseni iitshekhi ze-reCaptcha. Oku kweza njengeendaba kum, kuba ndandingazi nokuba kukho izixhobo ezikhethekileyo kunye neenkonzo ezipheleleyo zokusombulula ezo ngxaki.

Uyathanda ukuhamba?!

Kumbuzo olula kwaye ongenabungozi owenziwe kwisihloko seli candelo, unokuhlala uva impendulo eyakhayo, ekhatshwa ngamabali aliqela ukusuka kuhambo lomntu ebebuzwe kuye. Uninzi lwethu luya kuvuma ukuba ukuhamba yindlela entle yokuzintywilisela kwindawo entsha yenkcubeko kwaye wandise i-horizons yakho. Nangona kunjalo, ukuba ubuza umntu ukuba uyathanda na ukukhangela amatikiti eenqwelomoya, ndiqinisekile ukuba impendulo ayizukulunga kangako. Ngokwenyani, iPython iza kusinceda apha.

Umsebenzi wokuqala ekufuneka siwusombulule kwindlela yokudala inkqubo yokukhangela ulwazi kumatikiti omoya iya kukhetha iqonga elifanelekileyo apho siya kuthatha khona ulwazi. Ukucombulula le ngxaki kwakungelula kum, kodwa ekugqibeleni ndakhetha inkonzo yeKayak. Ndazama iinkonzo ze-Momondo, i-Skyscanner, i-Expedia, kunye nabanye abambalwa, kodwa iindlela zokukhusela i-robot kwezi zixhobo zazingenakungena. Emva kwemizamo emininzi, apho kwafuneka ndijamelane nezibane zendlela, ukunqumla abahamba ngeenyawo kunye neebhayisikile, ndizama ukukholisa iinkqubo ukuba ndingumntu, ndagqiba ekubeni iKayak yayifaneleka kum, nangona ukuba amaphepha amaninzi alayishiwe. ngexesha elifutshane, kwaye iitshekhi nazo ziqala. Ndikwazile ukwenza i-bot ithumele izicelo kwindawo ngezithuba ze-4 ukuya kwiiyure ze-6, kwaye yonke into yasebenza kakuhle. Ngamaxesha ngamaxesha, kubakho ubunzima xa usebenza neKayak, kodwa ukuba baqala ukukukhathaza ngeetshekhi, kuya kufuneka ujongane nabo ngesandla kwaye uqalise i-bot, okanye ulinde iiyure ezimbalwa kwaye iitshekhi kufuneka ziyeke. Ukuba kuyimfuneko, unokwenza lula ukulungelelanisa ikhowudi kwelinye iqonga, kwaye ukuba wenza njalo, unokuyibika kwizimvo.

Ukuba usandula ukuqalisa nge-web scraping kwaye awazi ukuba kutheni ezinye iiwebhusayithi zisokola ngayo, ke ngaphambi kokuba uqale iprojekthi yakho yokuqala kule ndawo, zenzele ubabalo kwaye wenze uphando kuGoogle kumagama athi "i-web scraping etiquette" . Iimvavanyo zakho zinokuphela ngokukhawuleza kunokuba ucinga ukuba wenza i-web scraping ngokungekho bulumko.

Qalisa

Nantsi inkcazo jikelele yento eza kwenzeka kwikhowudi yethu ye-web scraper:

  • Thatha ngaphandle amathala eencwadi afunekayo.
  • Ukuvula ithebhu kaGoogle Chrome.
  • Biza umsebenzi oqala i-bot, ugqithise izixeko kunye nemihla eya kusetyenziswa xa ukhangela amatikiti.
  • Lo msebenzi uthatha iziphumo zophendlo lokuqala, zibekwe ngeyona ndlela, kwaye ucofa iqhosha ukulayisha iziphumo ezingakumbi.
  • Omnye umsebenzi uqokelela idatha kwiphepha lonke kwaye ubuyisela isakhelo sedatha.
  • Amanyathelo amabini angaphambili enziwa kusetyenziswa iindidi zokuhlela ngexabiso letikiti (eliphantsi) kunye nesantya sokubhabha (esona sikhawulezayo).
  • Umsebenzisi weskripthi uthunyelwa i-imeyile equlethe isishwankathelo samaxabiso etikiti (amathikithi aphantsi kunye nexabiso eliphakathi), kunye nesakhelo sedatha esinolwazi oluhlelwe ngezikhombisi ezintathu ezikhankanywe ngasentla zigcinwa njengefayile ye-Excel.
  • Zonke ezi zenzo zingentla zenziwa kumjikelo emva kwexesha elithile.

Kufuneka kuqatshelwe ukuba yonke iprojekthi ye-Selenium iqala ngomqhubi wewebhu. ndisebenzise Chromedriver, Ndisebenza neGoogle Chrome, kodwa kukho ezinye iindlela. I-PhantomJS kunye neFirefox nazo ziyaziwa. Emva kokukhuphela umqhubi, kufuneka ubeke kwifolda efanelekileyo, kwaye oku kugqiba ukulungiswa kokusetyenziswa kwayo. Imigca yokuqala yesikripthi sethu ivula ithebhu entsha yeChrome.

Gcina ukhumbula ukuba kwibali lam andizami ukuvula i-horizons entsha yokufumana amaxabiso amahle kumatikiti omoya. Kukho iindlela eziphambili kakhulu zokukhangela ezo zinikezelo. Ndifuna nje ukubonelela abafundi beli nqaku ngendlela elula kodwa esebenzayo yokusombulula le ngxaki.

Nantsi ikhowudi esithethe ngayo ngasentla.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

Ekuqaleni kwekhowudi ungabona imiyalelo yokungenisa iphakheji esetyenziswa kuyo yonke iprojekthi yethu. Ngoko, randint isetyenziselwe ukwenza ibhot "ilale" kangangenani elithile lemizuzwana phambi kokuqalisa umsebenzi omtsha wokukhangela. Ngokwesiqhelo, akukho bhot enye enokwenza ngaphandle koku. Ukuba usebenzisa ikhowudi engentla, iwindow yeChrome iya kuvula, apho i-bot iya kuyisebenzisa ukusebenza kunye neesayithi.

Masenze umfuniselo omncinci kwaye sivule iwebhusayithi ye-kayak.com kwifestile eyahlukileyo. Siza kukhetha isixeko esiza kubhabha kuso, kunye nesixeko esifuna ukuya kuso, kunye neentsuku zokubhabha. Xa ukhetha imihla, qiniseka ukuba uluhlu lweentsuku +-3 lusetyenziswa. Ndabhala ikhowudi ngokuqwalasela oko isiza sivelisa ekuphenduleni izicelo ezinjalo. Ukuba, umzekelo, kufuneka ukhangele amatikiti kuphela kwimihla echaziweyo, ngoko kukho amathuba aphezulu okuba uguqule ikhowudi yebhot. Xa ndithetha ngekhowudi, ndinikezela ngeenkcazo ezifanelekileyo, kodwa ukuba uziva udidekile, ndixelele.

Ngoku cofa kwiqhosha lokukhangela kwaye ujonge ikhonkco kwibha yedilesi. Kufuneka ifane nekhonkco endilisebenzisayo kumzekelo ongezantsi apho uguqulo lubhengezwa khona kayak, egcina i-URL, kunye nendlela esetyenziswayo get umqhubi wewebhu. Emva kokucofa iqhosha lokukhangela, iziphumo kufuneka zivele kwiphepha.

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba
Xa ndisebenzisa umyalelo get ngaphezulu kwamaxesha amabini okanye amathathu kwimizuzu embalwa, ndacelwa ukuba ndigqibezele ukuqinisekiswa usebenzisa i-reCaptcha. Ungadlula olu qwalaselo ngesandla kwaye uqhubeke nokulinga de inkqubo igqibe ukwenza itshekhi entsha. Xa ndivavanya iskripthi, kwakubonakala ngathi iseshoni yokuqala yokukhangela yayihlala ihamba kakuhle, ngoko ke ukuba ufuna ukuzama ikhowudi, kuya kufuneka uhlolisise ngesandla kwaye uvumele ikhowudi iqhube, usebenzisa ixesha elide phakathi kweeseshoni zokukhangela. Kwaye, ukuba ucinga ngako, umntu akanakwenzeka ukuba afune ulwazi malunga namaxabiso etikiti afunyenwe kwimizuzu eyi-10 phakathi kwemisebenzi yokukhangela.

Ukusebenza ngephepha usebenzisa i-XPath

Ngoko, savula ifestile kwaye salayisha indawo. Ukufumana amaxabiso kunye nolunye ulwazi, kufuneka sisebenzise ubuchwepheshe be-XPath okanye abakhethi be-CSS. Ndaye ndagqiba ekubeni ndibambelele kwi-XPath kwaye andizange ndizive ndinesidingo sokusebenzisa abakhethi be-CSS, kodwa kunokwenzeka ukuba usebenze ngaloo ndlela. Ukuzulazula kwiphepha usebenzisa i-XPath kunokuba yinkohliso, kwaye nokuba usebenzisa iindlela endizichaze kuzo oku inqaku, elibandakanya ukukopa izichazi ezihambelanayo kwikhowudi yephepha, ndiye ndaqonda ukuba le, enyanisweni, ayisiyiyo indlela yokufikelela kwizinto eziyimfuneko. Ngendlela, kwi oku Incwadi inika inkcazo ebalaseleyo yeziseko zokusebenza kunye namaphepha usebenzisa abakhethi be-XPath kunye ne-CSS. Le yindlela ehambelanayo yomqhubi wewebhu ibonakala ngathi.

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba
Ke, masiqhubeke sisebenza kwi-bot. Masisebenzise ubuchule benkqubo ukukhetha amatikiti asezantsi. Kulo mfanekiso ulandelayo, ikhowudi yokukhetha i-XPath igxininiswe ngobomvu. Ukuze ujonge ikhowudi, kufuneka ucofe ekunene kwinqaku lephepha onomdla kulo kwaye ukhethe i-Hlola umyalelo kwimenyu evelayo. Lo myalelo unokubizwa kwizinto ezahlukeneyo zephepha, ikhowudi eya kuboniswa kwaye igxininiswe kumbonisi wekhowudi.

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba
Jonga ikhowudi yephepha

Ukuze ufumane isiqinisekiso sokuqiqa kwam malunga nokungonakali kokukopisha abakhethi kwikhowudi, nikela ingqalelo kwezi mpawu zilandelayo.

Nantsi into oyifumanayo xa ukopa ikhowudi:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

Ukuze ukopishe into enje, kufuneka ucofe ekunene kwicandelo lekhowudi onomdla kuyo kwaye ukhethe i Khuphela> Khuphela XPath umyalelo kwimenyu evelayo.

Nantsi into ebendiyisebenzisa ukuchaza iqhosha eliphantsi:

cheap_results = ‘//a[@data-code = "price"]’

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba
Khuphela Umyalelo> Kopa XPath

Kucacile ukuba ukhetho lwesibini lubonakala lulula kakhulu. Xa isetyenziswa, ikhangela into ethi a enophawu data-code, iyalingana price. Xa usebenzisa ukhetho lokuqala, into iyakhangelwa id elilingana ne wtKI-price_aTab, kwaye i-XPath umendo wento ibonakala ngathi /div[1]/div/div/div[1]/div/span/span. Umbuzo we-XPath onje kwiphepha uzakwenza iqhinga, kodwa kube kanye kuphela. Ndingatsho ngoku ukuba id izakutshintsha kwixesha elizayo xa iphepha lilayishwa. Ukulandelelana koonobumba wtKI utshintsho oluguquguqukayo lonke ixesha iphepha lilayishiwe, ngoko ikhowudi eyisebenzisayo iya kuba yinto engenamsebenzi emva kokulayisha kwakhona kwekhasi elilandelayo. Ke thatha ixesha lokuqonda i-XPath. Olu lwazi luya kukunceda.

Nangona kunjalo, kufuneka kuqatshelwe ukuba ukukopa abakhethi be-XPath kunokuba luncedo xa usebenza neziza ezilula, kwaye ukuba ukhululekile ngale nto, akukho nto iphosakeleyo ngayo.

Ngoku makhe sicinge ngento omawuyenze ukuba ufuna ukufumana zonke iziphumo zokukhangela kwimigca emininzi, ngaphakathi kuluhlu. Ilula kakhulu. Isiphumo ngasinye singaphakathi kwento eneklasi resultWrapper. Ukulayisha zonke iziphumo kunokwenziwa kwilophu efana nale iboniswe ngezantsi.

Kufuneka kuqatshelwe ukuba ukuba uyakuqonda oku ngasentla, ngoko kufuneka uqonde ngokulula uninzi lwekhowudi esiza kuyihlalutya. Njengoko le khowudi iqhuba, sifikelela kwinto esiyifunayo (enyanisweni, into apho isiphumo sisongelwe) sisebenzisa uhlobo oluthile lwendlela yokuchaza indlela (XPath). Oku kwenziwa ukuze ufumane okubhaliweyo kwento kwaye uyibeke kwinto apho idatha inokufundwa khona (kuqala isetyenziswe flight_containersemva koko flights_list).

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba
Imigca emithathu yokuqala ibonisiwe kwaye sinokubona ngokucacileyo yonke into esiyifunayo. Nangona kunjalo, sineendlela ezinomdla ngakumbi zokufumana ulwazi. Kufuneka sithathe idatha kwi element nganye ngokwahlukeneyo.

Fika emsebenzini!

Eyona ndlela ilula yokubhala umsebenzi kukulayisha iziphumo ezongezelelweyo, kulapho sizakuqala khona. Ndingathanda ukwandisa inani leenqwelomoya inkqubo efumana ulwazi ngayo, ngaphandle kokuphakamisa izikrokro kwinkonzo ekhokelela kuhlolo, ngoko ke ndicofa iqhosha elithi Layisha iziphumo ezingakumbi kanye ngexesha ngalinye iphepha liboniswa. Kule khowudi, kufuneka ubeke ingqalelo kwibhloko try, endiyongezele yona kuba ngamanye amaxesha iqhosha alilayishi ngokufanelekileyo. Ukuba nawe udibana nale nto, phawula ngeefowuni kulo msebenzi kwikhowudi yokusebenza start_kayak, esiza kujonga ngezantsi.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Ngoku, emva kohlahlelo olude lwalo msebenzi (ngamanye amaxesha ndiyakwazi ukuthabatheka), sikulungele ukubhengeza umsebenzi oza kukrwela iphepha.

Sele ndiqokelele uninzi lwezinto ezifunekayo kulo msebenzi ulandelayo othiwa page_scrape. Ngamanye amaxesha idatha yendlela ebuyisiwe idityanisiwe, ngoko ke ndisebenzisa indlela elula yokuyahlula. Umzekelo, xa ndisebenzisa izinto eziguquguqukayo okokuqala section_a_list и section_b_list. Umsebenzi wethu ubuyisela isakhelo sedatha flights_df, oku kusivumela ukuba sahlule iziphumo ezifunyenwe kwiindlela ezahlukeneyo zokuhlela idatha kwaye kamva sizidibanise.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

Ndizame ukubiza amagama aguquguqukayo ukuze ikhowudi iqondeke. Khumbula ukuba iinguqu ziqala nge a bakwinqanaba lokuqala lendlela, kwaye b - ukuya kweyesibini. Masiqhubele phambili kumsebenzi olandelayo.

Iindlela zenkxaso

Ngoku sinomsebenzi osivumela ukuba silayishe iziphumo zophando ezongezelelweyo kunye nomsebenzi wokucubungula ezo ziphumo. Eli nqaku belinokuphelela apha, kuba le mibini imisebenzi ibonelela ngayo yonke into oyifunayo ukukrwela amaphepha onokuzivula ngokwakho. Kodwa asikaziqwalasela ezinye zeendlela ezincedisayo ezixoxwe ngasentla. Umzekelo, le yikhowudi yokuthumela ii-imeyile kunye nezinye izinto. Konke oku kunokufumaneka kumsebenzi start_kayak, esiza kuyiqwalasela ngoku.

Lo msebenzi ufuna ulwazi malunga nezixeko kunye nemihla. Ukusebenzisa olu lwazi, yenza ikhonkco kuguquko kayak, elisetyenziselwa ukukusa kwiphepha eliza kuqulatha iziphumo zophendlo ezihlelwe ngokona kuhambelanayo kumbuzo. Emva kweseshoni yokuqala yokukrala, siya kusebenza kunye namaxabiso etafileni phezulu kwiphepha. Oko kukuthi, siya kufumana elona xabiso lincinci letikiti kunye nexabiso eliphakathi. Konke oku, kunye nokuqikelelwa okukhutshwe yisayithi, kuya kuthunyelwa nge-imeyile. Kwiphepha, itafile ehambelanayo kufuneka ibe kwikona ephezulu ngasekhohlo. Ukusebenza nale tafile, ngendlela, kunokubangela impazamo xa ukhangela usebenzisa imihla echanekileyo, ekubeni kulo mzekelo itafile ayiboniswa kwiphepha.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

Ndivavanye lo mbhalo usebenzisa iakhawunti ye-Outlook (hotmail.com). Andizange ndiyivavanye ukuba isebenze ngokuchanekileyo ngeakhawunti ye-Gmail, le nkqubo ye-imeyile ithandwa kakhulu, kodwa zininzi iindlela onokukhetha kuzo. Ukuba usebenzisa i-akhawunti ye-Hotmail, ngoko ukuze yonke into isebenze, kufuneka nje ufake idatha yakho kwikhowudi.

Ukuba ufuna ukuqonda ukuba yintoni kanye kanye eyenziwayo kumacandelo athile ekhowudi yalo msebenzi, ungawakopa kwaye uzame ngawo. Ukuzama ikhowudi kuphela kwendlela yokuyiqonda ngokwenene.

Inkqubo esele ilungile

Ngoku ukuba senze yonke into ebesithetha ngayo, sinokwenza i-loop elula ebiza imisebenzi yethu. Umbhalo ucela idatha kumsebenzisi malunga nezixeko kunye nemihla. Xa uvavanya ngokuqala ngokutsha kweskripthi, akunakwenzeka ukuba ufune ukufaka le datha ngesandla ngalo lonke ixesha, ngoko ke imigca ehambelanayo, ngexesha lovavanyo, inokuphawulwa ngokungayifaki inkcazo kwabo bangaphantsi kwabo, apho idatha efunekayo iskripthi sinekhowudi.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Oku kujongeka njani uvavanyo lweskripthi.
I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba
Uvavanyo lokuqhuba kwescript

Iziphumo

Ukuba uyenzile le kude, halala! Ngoku une-web scraper esebenzayo, nangona sele ndibona iindlela ezininzi zokuyiphucula. Umzekelo, inokudityaniswa neTwilio ukuze ithumele imiyalezo endaweni yee-imeyile. Ungasebenzisa iVPN okanye enye into ukufumana iziphumo ngaxeshanye kwiiseva ezininzi. Kukwakho ingxaki eyenzeka ngamaxesha athile ngokujonga umsebenzisi wesiza ukubona ukuba ungumntu na, kodwa le ngxaki nayo ingasonjululwa. Kwimeko nayiphi na into, ngoku unesiseko onokusandisa ukuba unqwenela. Umzekelo, qiniseka ukuba ifayile ye-Excel ithunyelwa kumsebenzisi njenge-attachment kwi-imeyile.

I-Python - umncedisi ekufumaneni amatikiti omoya angabizi kakhulu kwabo bathanda ukuhamba

Ngabasebenzisi ababhalisiweyo kuphela abanokuthatha inxaxheba kuphando. Ngena, ndiyacela.

Ngaba usebenzisa itekhnoloji ye-web scraping?

  • ukuba

  • akukho

Bali-8 abasebenzisi abavotileyo. Umsebenzisi om-1 akakhange.

umthombo: www.habr.com

Yongeza izimvo