Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig

Tus sau ntawm tsab xov xwm, cov txhais lus uas peb tab tom tshaj tawm hnub no, hais tias nws lub hom phiaj yog los tham txog kev txhim kho lub vev xaib scraper hauv Python siv Selenium, uas tshawb nrhiav cov nqi pib dav hlau. Thaum tshawb nrhiav daim pib, cov hnub hloov pauv tau siv (+- 3 hnub txheeb ze rau cov hnub teev). Tus scraper khaws cov txiaj ntsig tshawb hauv Excel cov ntaub ntawv thiab xa tus neeg uas khiav qhov kev tshawb nrhiav email nrog cov ntsiab lus ntawm qhov lawv pom. Lub hom phiaj ntawm qhov project no yog pab cov neeg taug kev mus nrhiav qhov zoo tshaj plaws deals.

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig

Yog tias, thaum nkag siab txog cov khoom, koj xav tias poob, ua tib zoo saib qhov no tsab xov xwm.

Peb yuav mus nrhiav dab tsi?

Koj tuaj yeem siv lub kaw lus tau piav qhia ntawm no raws li koj xav tau. Piv txwv li, kuv tau siv nws los tshawb xyuas cov kev ncig xyuas hnub so thiab daim pib mus rau kuv lub nroog. Yog tias koj mob siab txog nrhiav tau daim pib muaj txiaj ntsig, koj tuaj yeem khiav cov ntawv sau rau ntawm lub server (yooj yim neeg rau zaub mov, rau 130 rubles ib hlis, yog heev haum rau qhov no) thiab xyuas kom meej tias nws khiav ib zaug los yog ob zaug ib hnub twg. Cov txiaj ntsig tshawb nrhiav yuav xa tuaj rau koj los ntawm email. Tsis tas li ntawd, kuv pom zoo kom teeb tsa txhua yam kom cov ntawv khaws cov ntaub ntawv Excel nrog cov txiaj ntsig tshawb hauv Dropbox folder, uas yuav tso cai rau koj saib cov ntaub ntawv no los ntawm txhua qhov chaw thiab txhua lub sijhawm.

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig
Kuv tseem tsis tau pom cov nqi se nrog qhov tsis raug, tab sis kuv xav tias nws muaj peev xwm

Thaum tshawb nrhiav, raws li tau hais dhau los, siv "hnub hloov pauv hloov pauv"; tsab ntawv pom muaj cov uas nyob hauv peb hnub ntawm cov hnub muab. Txawm hais tias thaum khiav cov ntawv, nws tshawb nrhiav cov kev qhia hauv ib qho kev taw qhia nkaus xwb, nws yooj yim los hloov nws kom nws tuaj yeem sau cov ntaub ntawv ntawm ob peb lub davhlau cov lus qhia. Nrog nws cov kev pab, koj tuaj yeem nrhiav cov nqi tsis raug; cov kev tshawb pom no tuaj yeem nthuav heev.

Vim li cas koj thiaj xav tau lwm lub vev xaib scraper?

Thaum kuv thawj zaug pib web scraping, kuv ncaj ncees tsis txaus siab rau nws. Kuv xav ua ntau txoj haujlwm hauv kev ua qauv qhia kev kwv yees, kev txheeb xyuas nyiaj txiag, thiab, tejzaum nws, hauv kev txheeb xyuas cov xim ntawm cov ntawv nyeem. Tab sis nws tau muab tawm tias nws yog qhov nthuav heev kom paub yuav ua li cas los tsim ib qho kev pab cuam uas sau cov ntaub ntawv los ntawm cov vev xaib. Raws li kuv delved rau hauv cov ncauj lus no, kuv pom tau hais tias lub vev xaib scraping yog "lub cav" ntawm Is Taws Nem.

Tej zaum koj yuav xav tias qhov no yog ib nqe lus bold heev. Tab sis xav txog tias Google pib nrog lub vev xaib scraper uas Larry Page tsim siv Java thiab Python. Google robots tau tshawb nrhiav hauv Is Taws Nem, sim muab nws cov neeg siv cov lus teb zoo tshaj plaws rau lawv cov lus nug. Web scraping muaj kev siv tsis kawg, thiab txawm tias koj txaus siab rau lwm yam hauv Data Science, koj yuav xav tau qee qhov kev txawj scraping kom tau txais cov ntaub ntawv koj xav tau los tshuaj xyuas.

Kuv pom qee cov tswv yim siv ntawm no hauv qhov zoo phau ntawv hais txog web scraping, uas kuv nyuam qhuav tau. Nws muaj ntau yam piv txwv yooj yim thiab cov tswv yim rau kev siv tswv yim ntawm qhov koj tau kawm. Tsis tas li ntawd, muaj ib tshooj nthuav heev ntawm kev hla reCaptcha cov tshev. Qhov no tuaj raws li xov xwm rau kuv, txij li thaum kuv tseem tsis tau paub tias muaj cov cuab yeej tshwj xeeb thiab txawm tias tag nrho cov kev pabcuam los daws cov teeb meem zoo li no.

Koj puas nyiam mus ncig?!

Rau cov lus nug yooj yooj yim thiab tsis muaj teeb meem nyob rau hauv lub npe ntawm ntu no, feem ntau koj tuaj yeem hnov ​​​​cov lus teb zoo, nrog rau ob peb zaj dab neeg los ntawm kev mus ncig ntawm tus neeg uas nws tau nug. Peb feem coob yuav pom zoo tias kev mus ncig yog ib txoj hauv kev zoo rau koj tus kheej rau hauv cov kab lis kev cai tshiab thiab nthuav koj lub qab ntug. Txawm li cas los xij, yog tias koj nug ib tus neeg seb lawv nyiam nrhiav daim pib dav hlau, kuv paub tseeb tias cov lus teb yuav tsis zoo. Raws li qhov tseeb, Python tuaj rau peb pab ntawm no.

Thawj txoj haujlwm uas peb yuav tsum tau daws ntawm txoj hauv kev los tsim cov txheej txheem rau kev tshawb nrhiav cov ntaub ntawv ntawm daim pib dav hlau yuav yog xaiv lub platform tsim los ntawm qhov peb yuav muab cov ntaub ntawv. Kev daws qhov teeb meem no tsis yooj yim rau kuv, tab sis thaum kawg kuv xaiv qhov kev pabcuam Kayak. Kuv sim cov kev pabcuam ntawm Momondo, Skyscanner, Expedia, thiab ob peb lwm tus, tab sis cov txheej txheem tiv thaiv neeg hlau ntawm cov peev txheej no tsis muaj peev xwm. Tom qab ob peb qhov kev sim, thaum lub sijhawm kuv yuav tsum tau ua haujlwm nrog lub teeb pom kev, kev hla kev taug kev thiab tsheb kauj vab, sim ua kom ntseeg tau tias kuv yog tib neeg, kuv txiav txim siab tias Kayak yog qhov zoo tshaj plaws rau kuv, txawm hais tias txawm tias muaj ntau nplooj ntawv thauj khoom. nyob rau hauv ib lub sij hawm luv luv, thiab cov tshev mis kuj pib. Kuv tau tswj xyuas kom cov bot xa cov lus thov mus rau qhov chaw ntawm 4 mus rau 6 teev, thiab txhua yam ua haujlwm zoo. Qee lub sij hawm, muaj teeb meem tshwm sim thaum ua haujlwm nrog Kayak, tab sis yog tias lawv pib ua phem rau koj nrog cov tshev, ces koj yuav tsum tau nrog lawv tus kheej thiab tom qab ntawd tso lub bot, lossis tos ob peb teev thiab cov tshev yuav tsum nres. Yog tias tsim nyog, koj tuaj yeem hloov kho cov cai yooj yim rau lwm lub platform, thiab yog tias koj ua li ntawd, koj tuaj yeem qhia nws hauv cov lus.

Yog tias koj nyuam qhuav pib nrog lub vev xaib scraping thiab tsis paub vim li cas qee lub vev xaib tawm tsam nrog nws, ua ntej koj pib koj thawj qhov project hauv cheeb tsam no, ua koj tus kheej nyiam thiab ua Google tshawb ntawm cov lus "web scraping ethiquette" . Koj qhov kev sim yuav xaus sai dua li qhov koj xav yog tias koj ua web scraping unwisely.

Pib

Nov yog cov ntsiab lus dav dav ntawm qhov yuav tshwm sim hauv peb lub vev xaib scraper code:

  • Ntshuam cov tsev qiv ntawv xav tau.
  • Qhib Google Chrome tab.
  • Hu rau lub luag haujlwm uas pib lub bot, hla nws lub nroog thiab cov hnub uas yuav siv thaum nrhiav daim pib.
  • Txoj haujlwm no yuav siv thawj qhov kev tshawb nrhiav, txheeb los ntawm qhov zoo tshaj plaws, thiab nyem lub pob kom thauj khoom ntxiv.
  • Lwm txoj haujlwm sau cov ntaub ntawv los ntawm tag nrho nplooj ntawv thiab xa rov qab cov ntaub ntawv thav duab.
  • Ob kauj ruam dhau los no yog ua los ntawm kev siv hom kev faib los ntawm tus nqi pib (pheej yig) thiab los ntawm kev ya davhlau (ceev tshaj plaws).
  • Tus neeg siv ntawm tsab ntawv raug xa email uas muaj cov ntsiab lus ntawm cov nqi pib (cov nqi pheej yig tshaj plaws thiab tus nqi nruab nrab), thiab cov ntaub ntawv nrog cov ntaub ntawv txheeb xyuas los ntawm peb cov lus qhia saum toj no tau txais kev cawmdim raws li cov ntaub ntawv Excel.
  • Tag nrho cov haujlwm saum toj no tau ua nyob rau hauv ib lub voj voog tom qab lub sijhawm teev tseg.

Nws yuav tsum raug sau tseg tias txhua qhov haujlwm Selenium pib nrog tus tsav tsheb hauv lub vev xaib. kuv siv Chromedriver, Kuv ua haujlwm nrog Google Chrome, tab sis muaj lwm txoj kev xaiv. PhantomJS thiab Firefox kuj nrov. Tom qab rub tawm tus tsav tsheb, koj yuav tsum muab tso rau hauv daim nplaub tshev tsim nyog, thiab qhov no ua tiav qhov kev npaj rau nws siv. Thawj kab ntawm peb tsab ntawv qhib Chrome tab tshiab.

Nco ntsoov tias hauv kuv zaj dab neeg kuv tsis tau sim qhib lub qab ntug tshiab rau kev nrhiav kev pom zoo ntawm daim pib dav hlau. Muaj ntau txoj hauv kev zoo tshaj plaws ntawm kev tshawb nrhiav cov kev pabcuam zoo li no. Kuv tsuas yog xav muab cov neeg nyeem ntawm cov ntaub ntawv no yooj yim tab sis siv tau los daws qhov teeb meem no.

Ntawm no yog tus lej peb tau tham txog saum toj no.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

Thaum pib ntawm txoj cai koj tuaj yeem pom cov pob ntshuam cov lus txib uas tau siv thoob plaws hauv peb qhov project. Yog li ntawd, randint siv los ua tus bot "pw tsaug zog" rau ib tug random naj npawb ntawm vib nas this ua ntej pib ib tug tshiab tshawb nrhiav. Feem ntau, tsis yog ib tus bot tuaj yeem ua yam tsis muaj qhov no. Yog tias koj khiav cov cai saum toj no, Chrome qhov rais yuav qhib, uas bot yuav siv los ua haujlwm nrog cov chaw.

Cia peb ua ib qho kev sim me ntsis thiab qhib lub vev xaib kayak.com nyob rau hauv ib lub qhov rais cais. Peb yuav xaiv lub nroog uas peb yuav ya mus, thiab lub nroog uas peb xav mus, nrog rau hnub davhlau. Thaum xaiv cov hnub, xyuas kom meej tias qhov ntau ntawm +-3 hnub yog siv. Kuv tau sau cov cai coj mus rau hauv tus account seb lub vev xaib tsim los teb rau cov lus thov li cas. Yog tias, piv txwv li, koj yuav tsum nrhiav daim pib tsuas yog rau cov hnub teev, ces muaj qhov tshwm sim siab uas koj yuav tau hloov kho bot code. Thaum kuv tham txog cov cai, kuv muab cov lus piav qhia tsim nyog, tab sis yog tias koj xav tias tsis meej pem, qhia rau kuv paub.

Tam sim no nyem rau ntawm lub pob tshawb nrhiav thiab saib qhov txuas hauv qhov chaw nyob bar. Nws yuav tsum zoo ib yam li qhov txuas uas kuv siv hauv qhov piv txwv hauv qab no uas qhov sib txawv tau tshaj tawm kayak, uas khaws cov URL, thiab txoj kev siv get web driver. Tom qab nias lub pob tshawb nrhiav, cov txiaj ntsig yuav tsum tshwm sim ntawm nplooj ntawv.

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig
Thaum kuv siv cov lus txib get ntau tshaj li ob lossis peb zaug hauv ob peb feeb, kuv tau hais kom ua tiav kev pov thawj siv reCaptcha. Koj tuaj yeem hla daim tshev no manually thiab txuas ntxiv sim mus txog thaum lub kaw lus txiav txim siab khiav daim tshev tshiab. Thaum kuv sim cov ntawv, nws zoo li qhov kev tshawb nrhiav thawj zaug ib txwm ua tau zoo, yog li yog tias koj xav sim nrog cov lej, koj tsuas yog yuav tsum tau kuaj xyuas tus kheej ib ntus thiab cia tus lej khiav, siv sijhawm ntev ntawm kev tshawb nrhiav. Thiab, yog tias koj xav txog nws, ib tus neeg tsis zoo li xav tau cov ntaub ntawv hais txog tus nqi pib tau txais ntawm 10-feeb lub sijhawm ntawm kev tshawb nrhiav haujlwm.

Ua haujlwm nrog nplooj ntawv siv XPath

Yog li, peb qhib lub qhov rais thiab thauj lub xaib. Txhawm rau kom tau txais tus nqi thiab lwm yam ntaub ntawv, peb yuav tsum siv XPath thev naus laus zis lossis CSS selectors. Kuv txiav txim siab los lo nrog XPath thiab tsis xav tias yuav tsum siv CSS selectors, tab sis nws muaj peev xwm ua haujlwm li ntawd. Kev taw qhia ib ncig ntawm nplooj ntawv siv XPath tuaj yeem ua kom yuam kev, thiab txawm tias koj siv cov txheej txheem kuv tau piav qhia hauv qhov no tsab xov xwm, uas koom nrog kev luam cov ntawv txheeb xyuas los ntawm nplooj ntawv code, kuv pom tau hais tias qhov no yog, qhov tseeb, tsis yog txoj hauv kev zoo kom nkag mus rau cov ntsiab lus tsim nyog. Los ntawm txoj kev, hauv qhov no Phau ntawv muab cov lus piav qhia zoo heev ntawm cov hauv paus ntawm kev ua haujlwm nrog nplooj ntawv siv XPath thiab CSS selectors. Qhov no yog qhov sib thooj web driver method zoo li.

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig
Yog li, cia peb txuas ntxiv ua haujlwm ntawm bot. Cia peb siv qhov kev pab cuam lub peev xwm los xaiv cov nqi pheej yig tshaj plaws. Hauv cov duab hauv qab no, XPath selector code yog highlighted liab. Txhawm rau saib cov cai, koj yuav tsum tau nyem rau ntawm nplooj ntawv keeb kwm uas koj xav tau thiab xaiv Cov Lus Qhia Tshawb Xyuas los ntawm cov ntawv qhia zaub mov uas tshwm. Cov lus txib no tuaj yeem raug hu rau cov nplooj ntawv sib txawv, cov cai uas yuav tshwm sim thiab tseem ceeb hauv tus lej saib.

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig
Saib nplooj ntawv code

Txhawm rau nrhiav kev lees paub ntawm kuv qhov laj thawj txog qhov tsis zoo ntawm kev luam cov neeg xaiv los ntawm cov lej, ua tib zoo mloog cov yam ntxwv hauv qab no.

Nov yog qhov koj tau txais thaum koj luam cov code:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

Txhawm rau luam ib yam dab tsi zoo li no, koj yuav tsum tau nyem rau ntawm ntu ntawm cov cai uas koj xav tau thiab xaiv qhov Copy> Copy XPath hais kom ua los ntawm cov ntawv qhia zaub mov uas tshwm.

Nov yog qhov kuv tau siv los txhais cov khawm Pheej yig tshaj:

cheap_results = ‘//a[@data-code = "price"]’

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig
Copy Command> Copy XPath

Nws yog qhov pom tseeb tias qhov kev xaiv thib ob zoo li yooj yim dua. Thaum siv, nws tshawb nrhiav lub ntsiab lus uas muaj tus cwj pwm data-codesib npaug price. Thaum siv thawj qhov kev xaiv, lub caij tshawb nrhiav id uas yog sib npaug wtKI-price_aTab, thiab XPath txoj kev mus rau lub caij zoo li /div[1]/div/div/div[1]/div/span/span. Ib qho lus nug XPath zoo li no rau nplooj ntawv yuav ua kom yuam kev, tab sis tsuas yog ib zaug xwb. Kuv tuaj yeem hais tam sim no id yuav hloov lub sij hawm tom ntej lub nplooj ntawv yog loaded. Cim kab ke wtKI hloov dynamically txhua lub sij hawm cov nplooj ntawv yog loaded, yog li cov code uas siv nws yuav siv tsis tau tom qab nplooj ntawv tom ntej reload. Yog li siv sijhawm los nkag siab txog XPath. Qhov kev paub no yuav pab tau koj zoo.

Txawm li cas los xij, nws yuav tsum raug sau tseg tias kev theej XPath selectors tuaj yeem pab tau thaum ua haujlwm nrog cov chaw yooj yim, thiab yog tias koj nyiam qhov no, tsis muaj dab tsi tsis ncaj ncees lawm.

Tam sim no cia peb xav txog yuav ua li cas yog tias koj xav tau tag nrho cov kev tshawb fawb hauv ob peb kab, hauv ib daim ntawv teev npe. Yooj yim heev. Txhua qhov tshwm sim yog nyob rau hauv ib yam khoom nrog ib chav kawm resultWrapper. Kev thauj khoom tag nrho cov txiaj ntsig tuaj yeem ua tiav hauv lub voj voog zoo ib yam li qhov qhia hauv qab no.

Nws yuav tsum raug sau tseg tias yog tias koj nkag siab cov lus saum toj no, koj yuav tsum yooj yim nkag siab feem ntau ntawm cov cai uas peb yuav txheeb xyuas. Raws li cov cai no khiav, peb nkag mus rau qhov peb xav tau (qhov tseeb, lub ntsiab lus uas tau muab qhwv) siv qee yam kev qhia txog txoj hauv kev (XPath). Qhov no yog ua kom tau cov ntawv nyeem ntawm lub caij thiab muab tso rau hauv ib qho khoom los ntawm cov ntaub ntawv tuaj yeem nyeem tau (thawj siv flight_containers, ces - flights_list).

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig
Thawj peb kab tau nthuav tawm thiab peb tuaj yeem pom meej txhua yam peb xav tau. Txawm li cas los xij, peb muaj ntau txoj hauv kev kom tau txais cov ntaub ntawv. Peb yuav tsum tau muab cov ntaub ntawv los ntawm txhua qhov sib cais.

Mus ua haujlwm!

Txoj hauv kev yooj yim tshaj plaws los sau ib txoj haujlwm yog thauj cov txiaj ntsig ntxiv, yog li peb yuav pib. Kuv xav ua kom muaj ntau lub davhlau uas qhov kev zov me nyuam tau txais cov ntaub ntawv hais txog, yam tsis muaj kev xav tsis thoob hauv qhov kev pabcuam uas ua rau muaj kev tshuaj xyuas, yog li kuv nyem rau ntawm Load ntxiv cov txiaj ntsig khawm ib zaug txhua lub sijhawm cov nplooj ntawv nthuav tawm. Hauv cov cai no, koj yuav tsum tau them sai sai rau qhov thaiv try, uas kuv ntxiv vim tias qee zaum lub pob tsis thauj khoom zoo. Yog tias koj tseem ntsib qhov no, tawm tswv yim tawm hu rau qhov haujlwm no hauv cov lej ua haujlwm start_kayak, uas peb yuav saib hauv qab no.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Tam sim no, tom qab kev soj ntsuam ntev ntawm txoj haujlwm no (qee zaum kuv tuaj yeem nqa mus), peb tau npaj los tshaj tawm txoj haujlwm uas yuav txhuam nplooj ntawv.

Kuv twb tau sau feem ntau ntawm qhov xav tau hauv cov haujlwm hauv qab no hu ua page_scrape. Qee lub sij hawm cov ntaub ntawv xa rov qab tau ua ke, yog li kuv siv txoj hauv kev yooj yim los cais nws. Piv txwv li, thaum kuv siv cov hloov pauv thawj zaug section_a_list и section_b_list. Peb txoj haujlwm xa rov qab cov ntaub ntawv flights_df, qhov no tso cai rau peb cais cov txiaj ntsig tau los ntawm cov ntaub ntawv sib txawv ntawm kev txheeb xyuas thiab tom qab ntawd muab lawv sib xyaw.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

Kuv sim sau npe cov kev hloov pauv kom tus lej yuav nkag siab. Nco ntsoov tias qhov hloov pauv pib nrog a belongs rau thawj theem ntawm txoj kev, thiab b - rau qhov thib ob. Cia peb mus rau qhov ua haujlwm tom ntej.

Txhawb nqa mechanisms

Tam sim no peb muaj lub luag haujlwm uas tso cai rau peb thauj cov txiaj ntsig tshawb fawb ntxiv thiab ua haujlwm los ua cov txiaj ntsig tau. Cov kab lus no tuaj yeem xaus rau ntawm no, txij li ob txoj haujlwm no muab txhua yam koj xav tau los txhuam cov nplooj ntawv uas koj tuaj yeem qhib koj tus kheej. Tab sis peb tseem tsis tau txiav txim siab qee qhov kev pab cuam uas tau tham saum toj no. Piv txwv li, qhov no yog tus lej xa email thiab lwm yam. Tag nrho cov no tuaj yeem pom hauv kev ua haujlwm start_kayak, uas tam sim no peb yuav xav txog.

Txoj haujlwm no xav tau cov ntaub ntawv hais txog lub nroog thiab hnub tim. Siv cov ntaub ntawv no, nws tsim ib qhov txuas hauv qhov sib txawv kayak, uas yog siv coj koj mus rau nplooj ntawv uas yuav muaj cov txiaj ntsig tshawb nrhiav los ntawm lawv qhov zoo tshaj plaws match rau cov lus nug. Tom qab thawj zaug scraping, peb yuav ua hauj lwm nrog cov nqi hauv lub rooj nyob rau sab saum toj ntawm nplooj ntawv. Namely, peb yuav pom qhov tsawg kawg nkaus daim pib nqi thiab tus nqi nruab nrab. Tag nrho cov no, nrog rau kev kwv yees tawm los ntawm lub xaib, yuav raug xa los ntawm email. Nyob rau ntawm nplooj ntawv, lub rooj sib tham yuav tsum nyob rau hauv lub ces kaum sab laug. Ua haujlwm nrog cov lus no, los ntawm txoj kev, yuav ua rau muaj kev ua yuam kev thaum tshawb nrhiav siv cov hnub tim, txij li qhov no lub rooj tsis tshwm rau ntawm nplooj ntawv.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

Kuv sim tsab ntawv no siv Outlook account (hotmail.com). Kuv tsis tau sim nws ua haujlwm kom raug nrog Gmail account, qhov email system no nrov heev, tab sis muaj ntau txoj kev xaiv. Yog tias koj siv Hotmail account, tom qab ntawd kom txhua yam ua haujlwm, koj tsuas yog yuav tsum nkag mus rau koj cov ntaub ntawv rau hauv cov lej.

Yog tias koj xav nkag siab tias yuav ua li cas raws nraim hauv ntu tshwj xeeb ntawm cov cai rau txoj haujlwm no, koj tuaj yeem luam lawv thiab sim nrog lawv. Kev sim nrog tus lej yog tib txoj hauv kev kom nkag siab tiag tiag.

Npaj system

Tam sim no peb tau ua txhua yam peb tau tham txog, peb tuaj yeem tsim lub voj voog yooj yim uas hu peb cov haujlwm. Tsab ntawv thov cov ntaub ntawv los ntawm tus neeg siv txog lub nroog thiab hnub tim. Thaum kuaj nrog tas li rov pib dua ntawm tsab ntawv, koj tsis zoo li xav nkag mus rau cov ntaub ntawv no manually txhua lub sijhawm, yog li cov kab sib raug zoo, rau lub sijhawm sim, tuaj yeem tawm tswv yim tawm los ntawm uncommenting cov hauv qab no, uas cov ntaub ntawv xav tau los ntawm lub tsab ntawv yog hardcoded.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Qhov no yog qhov kev sim khiav ntawm tsab ntawv zoo li.
Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig
Test khiav ntawm tsab ntawv

Cov txiaj ntsim tau los

Yog tias koj tau ua nws nyob deb, ua kev zoo siab! Tam sim no koj muaj lub web scraper ua haujlwm, txawm hais tias kuv twb pom ntau txoj hauv kev los txhim kho nws. Piv txwv li, nws tuaj yeem ua ke nrog Twilio kom nws xa cov ntawv nyeem tsis yog email. Koj tuaj yeem siv VPN lossis lwm yam kom tau txais txiaj ntsig los ntawm ntau lub servers ib txhij. Kuj tseem muaj teeb meem tshwm sim ib ntus nrog kev kuaj xyuas tus neeg siv lub xaib kom pom tias nws yog ib tus neeg, tab sis qhov teeb meem no kuj tuaj yeem daws tau. Txawm li cas los xij, tam sim no koj muaj lub hauv paus uas koj tuaj yeem nthuav dav yog tias koj xav tau. Piv txwv li, nco ntsoov tias cov ntaub ntawv Excel raug xa mus rau tus neeg siv raws li txuas rau email.

Python - tus pabcuam hauv kev nrhiav daim pib dav hlau pheej yig rau cov neeg uas nyiam mus ncig

Tsuas yog cov neeg siv sau npe tuaj yeem koom nrog hauv daim ntawv ntsuam xyuas. Kos npe rau hauvthov.

Koj puas siv web scraping technologies?

  • Yog

  • Tsis

Voted los ntawm 8 cov neeg siv. 1 tus neeg siv abstained.

Tau qhov twg los: www.hab.com

Ntxiv ib saib