Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea

Mongoli oa sehlooho, phetolelo eo re e hatisang kajeno, e re sepheo sa eona ke ho bua ka nts'etsopele ea web scraper Python e sebelisang Selenium, e batlang litheko tsa litekete tsa lifofane. Ha ho batloa litekete, ho sebelisoa matsatsi a feto-fetohang (+- matsatsi a 3 a amanang le matsatsi a boletsoeng). The scraper e boloka liphetho tsa lipatlisiso faeleng ea Excel mme e romella motho ea entseng lipatlisiso lengolo-tsoibila ka kakaretso ea seo a se fumaneng. Sepheo sa morero ona ke ho thusa baeti ho fumana litheko tse ntle ka ho fetisisa.

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea

Haeba, ha u ntse u utloisisa boitsebiso, u ikutloa u lahlehile, sheba sena sehlooho.

Re tla batla eng?

U lokolohile ho sebelisa sistimi e hlalositsoeng mona kamoo u batlang. Ka mohlala, ke ne ke e sebelisa ho batla maeto a mafelo-beke le litekete tsa ho ea motseng oa heso. Haeba u ikemiselitse ho fumana litekete tse nang le phaello, u ka tsamaisa script ho seva (e bonolo seva, bakeng sa li-ruble tse 130 ka khoeli, e loketse haholo bakeng sa sena) mme etsa bonnete ba hore e sebetsa hang kapa habeli ka letsatsi. Liphetho tsa lipatlisiso li tla romelloa ho uena ka lengolo-tsoibila. Ho phaella moo, ke khothaletsa ho beha ntho e 'ngoe le e' ngoe e le hore script e boloke faele ea Excel e nang le liphetho tsa lipatlisiso ka har'a foldara ea Dropbox, e tla u lumella ho sheba lifaele tse joalo ho tloha kae kapa kae le ka nako leha e le efe.

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea
Ha ke so fumane litefiso tse nang le liphoso hajoale, empa ke nahana hore hoa khoneha

Ha u batla, joalo ka ha ho se ho boletsoe, ho sebelisoa "letsatsi le feto-fetohang"; sengoloa se fumana litefiso tse ka hare ho matsatsi a mararo ho tloha matsatsing a fanoeng. Le hoja ha e tsamaisa mongolo, e batla litlhahiso ka lehlakoreng le le leng feela, ho bonolo ho e fetola e le hore e khone ho bokella lintlha tsa litsela tse 'maloa tsa lifofane. Ka thuso ea eona, o ka sheba le litefiso tse fosahetseng; lintho tse joalo tse fumanoeng li ka khahla haholo.

Ke hobane'ng ha u hloka web scraper e 'ngoe?

Ha ke qala ho hula marang-rang, ka 'nete ke ne ke se na thahasello ho eona. Ke ne ke batla ho etsa merero e mengata lefapheng la ho etsa mohlala oa ho bolela esale pele, tlhahlobo ea lichelete, le, mohlomong, lefapheng la ho hlahloba mebala ea maikutlo ea litemana. Empa ho ile ha fumaneha hore ho ne ho thahasellisa haholo ho fumana mokhoa oa ho theha lenaneo le bokellang data ho tsoa liwebsaeteng. Ha ke ntse ke nahanisisa ka taba ena, ke ile ka hlokomela hore web scraping ke "enjene" ea Inthanete.

U ka 'na ua nahana hore ena ke polelo e sebete haholo. Empa nahana hore Google e qalile ka web scraper eo Larry Page e e entseng ka Java le Python. Liroboto tsa Google li ntse li hlahloba Marang-rang, li leka ho fa basebelisi ba eona likarabo tse ntle ka ho fetisisa tsa lipotso tsa bona. Web scraping e na le litšebeliso tse sa feleng, 'me le haeba u thahasella ntho e' ngoe ho Data Science, u tla hloka tsebo e itseng ea ho senya ho fumana lintlha tseo u li hlokang ho li hlahloba.

Ke fumane tse ling tsa mekhoa e sebelisoang mona ka mokhoa o babatsehang buka mabapi le web scraping, eo ke sa tsoa e fumana. E na le mehlala le mehopolo e mengata e bonolo bakeng sa ho sebelisa seo u ithutileng sona. Ntle le moo, ho na le khaolo e khahlisang haholo mabapi le ho feta licheke tsa reCaptcha. Sena se ile sa tla e le litaba ho 'na, kaha ke ne ke sa tsebe le hore ho na le lisebelisoa tse khethehileng esita le litšebeletso tsohle tsa ho rarolla mathata a joalo.

O rata ho tsamaea?!

Ho potso e bonolo le e se nang kotsi e hlahang sehloohong sa karolo ena, hangata u ka utloa karabo e ntle, e tsamaeang le lipale tse 'maloa tse tsoang maetong a motho eo a mo botsitseng. Bongata ba rona re ka lumela hore ho tsamaea ke tsela e ntle ea ho ikakhela ka setotsoana tikolohong e ncha ea setso le ho atolosa pono ea hau. Leha ho le joalo, ha u ka botsa motho e mong hore na o rata ho batla litekete tsa sefofane, ke na le bonnete ba hore karabo e ke ke ea e-ba ntle hakaalo. Ha e le hantle, ke hona moo Python e re thusang teng.

Mosebetsi oa pele oo re hlokang ho o rarolla ha re le tseleng ea ho theha sistimi ea ho batla tlhahisoleseling ka litekete tsa moea e tla ba ho khetha sethala se loketseng seo re tla nka tlhahisoleseling ho sona. Ho rarolla bothata bona ho ne ho se bonolo ho 'na, empa qetellong ke ile ka khetha tšebeletso ea Kayak. Ke lekile lits'ebeletso tsa Momondo, Skyscanner, Expedia, le tse ling tse 'maloa, empa mekhoa ea ts'ireletso ea liroboto mehloling ena e ne e sa khonehe. Ka mor'a liteko tse 'maloa, tseo ka tsona ke neng ke tlameha ho sebetsana le mabone a sephethephethe, ho tšela batho ba tsamaeang ka maoto le libaesekele, ho leka ho kholisa litsamaiso hore ke motho, ke ile ka etsa qeto ea hore Kayak e ne e ntsoela molemo ka ho fetisisa, ho sa tsotellehe hore le haeba maqephe a mangata haholo a laetsoe. ka nako e khuts'oane, 'me licheke le tsona lia qala. Ke khonne ho etsa hore bot e romelle likopo sebakeng sa marang-rang ka linako tse ling tsa 4 ho isa ho lihora tsa 6, 'me tsohle li ne li sebetsa hantle. Nako le nako, mathata a hlaha ha o sebetsa le Kayak, empa haeba a qala ho u tšoenya ka licheke, u tlameha ho sebetsana le tsona ka letsoho ebe u qala bot, kapa u eme lihora tse 'maloa ebe licheke li lokela ho emisa. Haeba ho hlokahala, o ka fetola khoutu habonolo bakeng sa sethaleng se seng, 'me haeba u etsa joalo, u ka e tlaleha ka litlhaloso.

Haeba u sa tsoa qala ka web scraping 'me u sa tsebe hore na ke hobane'ng ha liwebsaete tse ling li thatafalloa ke eona, joale pele u qala morero oa hau oa pele sebakeng sena, iketsetse mohau' me u batlisise Google ka mantsoe "web scraping etiquette" . Liteko tsa hau li ka fela kapele ho feta kamoo u nahanang kateng haeba u etsa web scraping ka booatla.

Ho qala

Mona ke kakaretso ea se tla etsahala ho web scraper khoutu ea rona:

  • Kenya lilaebrari tse hlokahalang.
  • Ho bula Google Chrome tab.
  • Letsetsa ts'ebetso e qalang bot, e fetisa litoropo le matsatsi a tla sebelisoa ha ho batloa litekete.
  • Ts'ebetso ena e nka liphetho tsa pele tsa lipatlisiso, tse hlophisoang ka ho fetisisa, ebe o tobetsa konopo ho kenya liphetho tse ling.
  • Ts'ebetso e 'ngoe e bokella data ho tsoa leqepheng lohle ebe e khutlisa foreimi ea data.
  • Mehato e 'meli e fetileng e etsoa ho sebelisoa mefuta ea ho hlopha ka theko ea litekete (e theko e tlase) le ka lebelo la sefofane (le potlakileng haholo).
  • Mosebelisi oa mongolo o romelloa lengolo-tsoibila le nang le kakaretso ea litheko tsa litekete (litekete tse tlase haholo le theko e tloaelehileng), 'me foreimi ea data e nang le tlhaiso-leseling e hlophisitsoeng ka matšoao a mararo a boletsoeng ka holimo e bolokoa joalo ka faele ea Excel.
  • Liketso tsohle tse ka holimo li etsoa ka potoloho ka mor'a nako e itseng.

Re lokela ho hlokomela hore morero o mong le o mong oa Selenium o qala ka mokhanni oa websaete. Kea sebelisa Chromedriver, Ke sebetsa le Google Chrome, empa ho na le likhetho tse ling. PhantomJS le Firefox le tsona li tumme. Ka mor'a ho khoasolla mokhanni, u lokela ho e beha ka foldareng e loketseng, 'me sena se phethela boitokisetso ba tšebeliso ea eona. Mehala ea pele ea sengoloa sa rona e bula tabo e ncha ea Chrome.

Hopola hore paleng ea ka ha ke leke ho bula libaka tse ncha tsa ho fumana litheko tse ntle ka litekete tsa sefofane. Ho na le mekhoa e tsoetseng pele haholo ea ho batla lits'ebeletso tse joalo. Ke mpa feela ke batla ho fa babali ba boitsebiso bona mokhoa o bonolo empa o sebetsang oa ho rarolla bothata bona.

Ke ena khoutu eo re buileng ka eona ka holimo.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

Qalong ea khoutu u ka bona litaelo tsa ho kenya sephutheloana tse sebelisoang ho pholletsa le morero oa rona. Kahoo, randint e sebelisetsoang ho etsa bot "robala" bakeng sa palo e sa reroang ea metsotsoana pele e qala ts'ebetso e ncha ea ho batla. Hangata, ha ho bot e le 'ngoe e ka etsang ntle le sena. Haeba u sebelisa khoutu e ka holimo, fensetere ea Chrome e tla buleha, eo bot e tla e sebelisa ho sebetsa le libaka.

Ha re etse liteko tse nyane ebe re bula sebaka sa marang-rang sa kayak.com fensetereng e ka thoko. Re tla khetha motse oo re tlang ho fofa ho oona, le motse oo re batlang ho fihla ho oona, hammoho le matsatsi a lifofane. Ha u khetha matsatsi, etsa bonnete ba hore matsatsi a + -3 a sebelisoa. Ke ngotse khoutu ke ela hloko seo setša se se hlahisang ho arabela likōpo tse joalo. Haeba, ka mohlala, u hloka ho batla litekete feela bakeng sa matsatsi a boletsoeng, joale ho na le monyetla o moholo oa hore u tla tlameha ho fetola khoutu ea bot. Ha ke bua ka khoutu, ke fana ka litlhaloso tse nepahetseng, empa haeba u ikutloa u ferekane, ntsebise.

Joale tobetsa konopo ea ho batla 'me u shebe sehokelo se bareng ea aterese. E lokela ho tšoana le sehokelo seo ke se sebelisang mohlaleng o ka tlase moo ho phatlalatsoang phapang kayak, e bolokang URL, le mokhoa o sebelisoang get mokhanni oa webo. Kamora ho tobetsa konopo ea ho batla, liphetho li tlameha ho hlaha leqepheng.

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea
Ha ke ne ke sebelisa taelo get makhetlo a fetang a mabeli kapa a mararo ka mor'a metsotso e seng mekae, ke ile ka kōptjoa hore ke qete ho netefatsa ka reCaptcha. U ka fetisa cheke ena ka letsoho 'me u tsoele pele ho leka ho fihlela sistimi e etsa qeto ea ho etsa cheke e ncha. Ha ke lekola script, ho ne ho bonahala eka nako ea pele ea ho batla e tsamaea hantle, kahoo haeba u batla ho leka khoutu, u tla tlameha feela ho hlahloba ka letsoho le ho lumella khoutu hore e sebetse, u sebelisa nako e telele pakeng tsa lipatlisiso. 'Me, haeba u nahana ka eona, motho ha a na monyetla oa ho hloka tlhahisoleseling mabapi le litheko tsa litekete tse amoheloang ka nako ea metsotso e 10 lipakeng tsa ts'ebetso ea ho batla.

Ho sebetsa le leqephe ho sebelisa XPath

Kahoo, re ile ra bula fensetere 'me ra kenya setša. Ho fumana litheko le lintlha tse ling, re hloka ho sebelisa theknoloji ea XPath kapa likhetho tsa CSS. Ke nkile qeto ea ho khomarela XPath mme ha ke utloe ho hlokahala hore ke sebelise likhetho tsa CSS, empa hoa khoneha ho sebetsa ka tsela eo. Ho potoloha leqephe u sebelisa XPath ho ka ba ntho e qhekellang, leha o sebelisa mekhoa eo ke e hlalositseng ho eona sena sengoloa, se neng se kenyelletsa ho kopitsa li-identifiers tse tsamaellanang ho tsoa khoutu ea leqephe, ke ile ka hlokomela hore ena ha e le hantle, ha se mokhoa o nepahetseng oa ho fumana lintlha tse hlokahalang. Ka tsela, ka sena Buka ena e fana ka tlhaloso e babatsehang ea metheo ea ho sebetsa le maqephe a sebelisa XPath le CSS selectors. Sena ke seo mokhoa o lumellanang oa mokhanni oa marang-rang o shebahalang ka ona.

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea
Kahoo, a re tsoeleng pele ho sebetsa ho bot. Ha re sebeliseng bokhoni ba lenaneo ho khetha litekete tse theko e tlaase. Setšoantšong se latelang, khoutu ea khetho ea XPath e totobalitsoe ka bofubelu. E le hore u shebe khoutu, u lokela ho tobetsa ka ho le letona karolong ea leqephe eo u e ratang ebe u khetha taelo ea Hlahloba ho tsoa ho menu e hlahang. Taelo ena e ka bitsoa bakeng sa likarolo tse fapaneng tsa leqephe, khoutu ea eona e tla hlahisoa le ho totobatsoa ho sebali sa khoutu.

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea
Sheba khoutu ea leqephe

E le hore u fumane bopaki ba mabaka a ka mabapi le bofokoli ba ho kopitsa bakhethoa ho tloha khoutu, ela hloko lintlha tse latelang.

Sena ke seo u se fumanang ha u kopitsa khoutu:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

E le hore u kopise ntho e kang ena, u lokela ho tobetsa ka ho le letona karolong ea khoutu eo u e ratang ebe u khetha Kopi> Kopitsa XPath taelo ho tsoa ho menu e hlahang.

Mona ke seo ke neng ke se sebelisa ho hlalosa konopo ea Cheapest:

cheap_results = ‘//a[@data-code = "price"]’

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea
Kopitsa Taelo > Kopitsa XPath

Ho hlakile haholo hore khetho ea bobeli e shebahala e le bonolo haholoanyane. Ha e sebelisoa, e batla element a e nang le tšobotsi data-code, lekana price. Ha u sebelisa khetho ea pele, element e batlisisoa id e lekanang le wtKI-price_aTab, 'me tsela ea XPath ho element e shebahala joalo /div[1]/div/div/div[1]/div/span/span. Potso ea XPath e kang ena leqepheng e tla etsa mano, empa hang feela. Nka bolela hona jwale id e tla fetoha nakong e tlang ha leqephe le kenngoa. Tatelano ya batho wtKI e fetoha ka matla nako le nako ha leqephe le laeloa, kahoo khoutu e e sebelisang e tla ba e se nang thuso ka mor'a hore leqephe le latelang le khutlisetsoe hape. Kahoo iphe nako ea ho utloisisa XPath. Tsebo ena e tla u sebeletsa hantle.

Leha ho le joalo, hoa lokela ho hlokomeloa hore ho kopitsa bakhethoa ba XPath ho ka ba molemo ha u sebetsa le libaka tse bonolo, 'me haeba u phutholohile ka sena, ha ho letho le phoso ka eona.

Joale ha re nahaneng hore na re etse eng haeba u hloka ho fumana liphetho tsohle tsa lipatlisiso ka mela e mengata, ka har'a lethathamo. E bonolo haholo. Sephetho ka seng se ka hare ho ntho e nang le sehlopha resultWrapper. Ho kenya liphetho tsohle ho ka etsoa ka loop e ts'oanang le e bontšitsoeng ka tlase.

Ho lokela ho hlokomeloa hore haeba u utloisisa se ka holimo, joale u lokela ho utloisisa habonolo boholo ba khoutu eo re tla e hlahloba. Ha khoutu ena e ntse e tsoela pele, re fihlella seo re se hlokang (ha e le hantle, ntho eo sephetho se phuthetsoeng ho eona) re sebelisa mofuta o itseng oa mokhoa o hlalosang tsela (XPath). Sena se etsoa e le ho fumana mongolo oa ntho le ho e beha nthong eo data e ka baloang ho eona (e sebelisitsoe pele. flight_containers, joale - flights_list).

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea
Mela e meraro ea pele e bontšoa 'me re ka bona ka ho hlaka tsohle tseo re li hlokang. Leha ho le joalo, re na le mekhoa e meng e khahlisang ea ho fumana lintlha. Re hloka ho nka data ho tsoa ho karolo ka 'ngoe ka thoko.

Kena mosebetsing!

Mokhoa o bonolo oa ho ngola tšebetso ke ho kenya liphetho tse ling, ke hona moo re tla qala. Ke kopa ho eketsa palo ea lifofane tseo lenaneo le fumanang leseli ka tsona, ntle le ho hlahisa lipelaelo ts'ebeletso e lebisang tlhahlobong, kahoo ke tobetsa konopo ea Jarisa liphetho tse ling hang ha leqephe le hlaha. Ka khoutu ena, o lokela ho ela hloko block try, eo ke e kentseng hobane ka linako tse ling konopo ha e laole hantle. Haeba u boetse u kopana le sena, fana ka maikutlo a li-call ho ts'ebetso ena khoutu ea ts'ebetso start_kayak, eo re tla e sheba ka tlase.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Joale, kamora tlhahlobo e telele ea ts'ebetso ena (ka linako tse ling nka kheloha), re ikemiselitse ho phatlalatsa ts'ebetso e tla hlakola leqephe.

Ke se ke bokelletse boholo ba tse hlokahalang tšebetsong e latelang e bitsoang page_scrape. Ka linako tse ling data ea tsela e khutlisitsoeng e kopantsoe, kahoo ke sebelisa mokhoa o bonolo oa ho e arola. Ka mohlala, ha ke sebelisa mefuta-futa ka lekhetlo la pele section_a_list и section_b_list. Mosebetsi oa rona o khutlisa moralo oa data flights_df, sena se re lumella ho arola liphetho tse fumanoeng mekhoeng e fapaneng ea ho hlopha data ebe hamorao re li kopanya.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

Ke ile ka leka ho bolela mabitso a mefuta-futa e le hore khoutu e utloisisehe. Hopola hore liphetoho li qala ka a ba ba mokhahlelo oa pele oa tsela, le b - ho ea bobeli. Ha re feteleng tšebetsong e latelang.

Mekhoa ea tšehetso

Hona joale re na le ts'ebetso e re lumellang ho kenya liphetho tse ling tsa lipatlisiso le mokhoa oa ho sebetsana le liphetho tseo. Sengoliloeng sena se ka be se felile mona, kaha lits'ebetso tsena tse peli li fana ka sohle seo u se hlokang ho hlakola maqephe ao u ka ipulelang. Empa ha re e-so nahane ka tse ling tsa mekhoa e thusang e boletsoeng ka holimo. Mohlala, ona ke khoutu ea ho romella mangolo-tsoibila le lintho tse ling. Tsena tsohle li ka fumanoa ka har'a sesebelisoa start_kayak, eo re tla e hlahloba hona joale.

Hore ts'ebetso ena e sebetse, o hloka tlhahisoleseling mabapi le litoropo le matsatsi. Ka ho sebelisa tlhahisoleseding ena, e theha sehokelo ho feto-fetoha kayak, e sebelisetsoang ho u isa leqepheng le tla ba le liphetho tsa lipatlisiso tse hlophiloeng ho ea ka tse lumellanang hantle le potso. Ka mor'a lenaneo la pele la ho senya, re tla sebetsa le litheko tse tafoleng e ka holimo ho leqephe. E leng, re tla fumana bonyane ba theko ea litekete le theko e tloaelehileng. Sena sohle, hammoho le se boletsoeng esale pele se fanoeng ke sebaka sena, se tla romelloa ka lengolo-tsoibila. Leqepheng, tafole e lumellanang e lokela ho ba sekhutlong se ka holimo le letšehali. Ho sebetsa le tafole ena, ka tsela, ho ka baka phoso ha u batla ho sebelisa matsatsi a tobileng, kaha tabeng ena tafole ha e bontšoe leqepheng.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

Ke lekile mongolo ona ke sebelisa ak'haonte ea Outlook (hotmail.com). Ha ke so e leke hore e sebetse ka nepo ka ak'haonte ea Gmail, sistimi ena ea lengolo-tsoibila e tumme haholo, empa ho na le likhetho tse ngata tse ka khonehang. Haeba u sebelisa akhaonto ea Hotmail, joale e le hore ntho e 'ngoe le e' ngoe e sebetse, u hloka feela ho kenya lintlha tsa hau ka khoutu.

Haeba u batla ho utloisisa hore na hantle-ntle ho etsoa eng likarolong tse itseng tsa khoutu bakeng sa mosebetsi ona, u ka li kopitsa le ho li leka. Ho leka ka khoutu ke eona feela tsela ea ho e utloisisa hantle.

Sistimi e itokisitseng

Kaha joale re entse tsohle tseo re buileng ka tsona, re ka theha loop e bonolo e bitsang mesebetsi ea rona. Sengoliloeng se kopa lintlha ho tsoa ho mosebelisi mabapi le litoropo le matsatsi. Ha u ntse u etsa liteko ka ho qala sengoloa khafetsa, ha ho na monyetla oa hore u batle ho kenya data ena ka letsoho nako le nako, kahoo mela e lumellanang, bakeng sa nako ea tlhahlobo, e ka hlalosoa ka ho fana ka maikutlo ho ba ka tlase ho eona, moo data e hlokoang ke script ke hardcoded.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Sena ke tsela eo tlhahlobo ea script e shebahalang ka eona.
Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea
Ho etsa liteko tsa mongolo

Liphello

Haeba u se u fihlile mona, kea u babatsa! Hona joale u na le web scraper e sebetsang, le hoja ke se ke ntse ke bona litsela tse ngata tsa ho e ntlafatsa. Mohlala, e ka kopanngoa le Twilio hore e romele melaetsa ho e-na le li-imeile. U ka sebelisa VPN kapa ntho e 'ngoe ho fumana liphetho ho tsoa ho li-server tse' maloa ka nako e le 'ngoe. Hape ho na le bothata bo hlahang nako le nako ka ho hlahloba mosebelisi oa sebaka sa marang-rang ho bona hore na ke motho, empa bothata bona bo ka rarolloa. Leha ho le joalo, joale u na le motheo oo u ka o atolosang haeba u lakatsa. Mohlala, etsa bonnete ba hore faele ea Excel e romelloa ho mosebelisi joalo ka sehokelo ho lengolo-tsoibila.

Python - mothusi oa ho fumana litekete tsa moea tse theko e tlaase bakeng sa ba ratang ho tsamaea

Ke basebelisi ba ngolisitsoeng feela ba ka kenyang letsoho phuputsong. kenaka kopo.

Na u sebelisa mahlale a marang-rang a marang-rang?

  • hore

  • No

Basebelisi ba 8 ba ile ba khetha. Mosebedisi a le 1 o hanne.

Source: www.habr.com

Eketsa ka tlhaloso