Umbhali wenqaku, inguqulelo esiyipapashayo namhlanje, ithi injongo yayo kukuthetha malunga nokuphuhliswa kwe-web scraper kwiPython usebenzisa i-Selenium, ekhangela amaxabiso etikiti le-airline. Xa ukhangela amatikiti, imihla eguquguqukayo isetyenziswa (+- 3 iintsuku ngokunxulumene nemihla echaziweyo). I-scraper igcina iziphumo zophando kwifayile ye-Excel kwaye ithumela umntu oqhube ukukhangela i-imeyile kunye nesishwankathelo sento abayifumeneyo. Injongo yale projekthi kukunceda abahambi bafumane amaxabiso angcono.
Ukuba, ngoxa uqonda umbandela, uziva ulahlekile, khawujonge
Siza kujonga ntoni?
Ukhululekile ukusebenzisa inkqubo echazwe apha njengoko ufuna. Ngokomzekelo, ndayisebenzisa ukukhangela ukhenketho lwempelaveki kunye namatikiti okuya kwidolophu yakowethu. Ukuba uzimisele ngokufumana amatikiti anengeniso, ungaqhuba iskripthi kwiseva (elula
andikafumani amaxabiso aneempazamo okwangoku, kodwa ndicinga ukuba kuyenzeka
Xa ukhangela, njengoko sele kukhankanyiwe, “umhla oguquguqukayo” kusetyenziswa; iskripthi sifumana izinikezelo ezingaphakathi kweentsuku ezintathu zemihla enikiweyo. Nangona xa usebenzisa iscript, ikhangela unikezelo kwicala elinye kuphela, kulula ukuyiguqula ukuze ikwazi ukuqokelela idatha kwiindlela ezininzi zenqwelomoya. Ngoncedo lwayo, ungajonga neerhafu ezigwenxa; oko kufunyanisiweyo kunokuba nomdla kakhulu.
Kutheni ufuna enye i-web scraper?
Ukuqala kwam i-web scraping, ngokunyanisekileyo ndandingenamdla wayo. Ndandifuna ukwenza iiprojekthi ezininzi kwinkalo yokulinganisa kwangaphambili, uhlalutyo lwemali, kwaye, mhlawumbi, kwindawo yokuhlalutya umbala weemvakalelo kwiitekisi. Kodwa kwavela ukuba kwakunomdla kakhulu ukufumana indlela yokwenza inkqubo eqokelela idatha kwiiwebhusayithi. Njengoko ndihlolisise kwesi sihloko, ndaqonda ukuba i-web scraping "injini" ye-Intanethi.
Usenokucinga ukuba le yingxelo engqindilili kakhulu. Kodwa cinga ukuba uGoogle waqala nge-web scraper eyenziwe nguLarry Page usebenzisa iJava kunye nePython. Iirobhothi zikaGoogle bezihlola i-Intanethi, zizama ukunika abasebenzisi bayo ezona mpendulo zilungileyo kwimibuzo yabo. I-Web scraping inokusetyenziswa okungapheliyo, kwaye nokuba unomdla kwenye into kwiNzululwazi yeDatha, uya kufuna izakhono zokukrala ukuze ufumane idatha oyifunayo ukuyihlalutya.
Ndifumene ezinye zeendlela ezisetyenziswa apha ngendlela emangalisayo
Uyathanda ukuhamba?!
Kumbuzo olula kwaye ongenabungozi owenziwe kwisihloko seli candelo, unokuhlala uva impendulo eyakhayo, ekhatshwa ngamabali aliqela ukusuka kuhambo lomntu ebebuzwe kuye. Uninzi lwethu luya kuvuma ukuba ukuhamba yindlela entle yokuzintywilisela kwindawo entsha yenkcubeko kwaye wandise i-horizons yakho. Nangona kunjalo, ukuba ubuza umntu ukuba uyathanda na ukukhangela amatikiti eenqwelomoya, ndiqinisekile ukuba impendulo ayizukulunga kangako. Ngokwenyani, iPython iza kusinceda apha.
Umsebenzi wokuqala ekufuneka siwusombulule kwindlela yokudala inkqubo yokukhangela ulwazi kumatikiti omoya iya kukhetha iqonga elifanelekileyo apho siya kuthatha khona ulwazi. Ukucombulula le ngxaki kwakungelula kum, kodwa ekugqibeleni ndakhetha inkonzo yeKayak. Ndazama iinkonzo ze-Momondo, i-Skyscanner, i-Expedia, kunye nabanye abambalwa, kodwa iindlela zokukhusela i-robot kwezi zixhobo zazingenakungena. Emva kwemizamo emininzi, apho kwafuneka ndijamelane nezibane zendlela, ukunqumla abahamba ngeenyawo kunye neebhayisikile, ndizama ukukholisa iinkqubo ukuba ndingumntu, ndagqiba ekubeni iKayak yayifaneleka kum, nangona ukuba amaphepha amaninzi alayishiwe. ngexesha elifutshane, kwaye iitshekhi nazo ziqala. Ndikwazile ukwenza i-bot ithumele izicelo kwindawo ngezithuba ze-4 ukuya kwiiyure ze-6, kwaye yonke into yasebenza kakuhle. Ngamaxesha ngamaxesha, kubakho ubunzima xa usebenza neKayak, kodwa ukuba baqala ukukukhathaza ngeetshekhi, kuya kufuneka ujongane nabo ngesandla kwaye uqalise i-bot, okanye ulinde iiyure ezimbalwa kwaye iitshekhi kufuneka ziyeke. Ukuba kuyimfuneko, unokwenza lula ukulungelelanisa ikhowudi kwelinye iqonga, kwaye ukuba wenza njalo, unokuyibika kwizimvo.
Ukuba usandula ukuqalisa nge-web scraping kwaye awazi ukuba kutheni ezinye iiwebhusayithi zisokola ngayo, ke ngaphambi kokuba uqale iprojekthi yakho yokuqala kule ndawo, zenzele ubabalo kwaye wenze uphando kuGoogle kumagama athi "i-web scraping etiquette" . Iimvavanyo zakho zinokuphela ngokukhawuleza kunokuba ucinga ukuba wenza i-web scraping ngokungekho bulumko.
Qalisa
Nantsi inkcazo jikelele yento eza kwenzeka kwikhowudi yethu ye-web scraper:
- Thatha ngaphandle amathala eencwadi afunekayo.
- Ukuvula ithebhu kaGoogle Chrome.
- Biza umsebenzi oqala i-bot, ugqithise izixeko kunye nemihla eya kusetyenziswa xa ukhangela amatikiti.
- Lo msebenzi uthatha iziphumo zophendlo lokuqala, zibekwe ngeyona ndlela, kwaye ucofa iqhosha ukulayisha iziphumo ezingakumbi.
- Omnye umsebenzi uqokelela idatha kwiphepha lonke kwaye ubuyisela isakhelo sedatha.
- Amanyathelo amabini angaphambili enziwa kusetyenziswa iindidi zokuhlela ngexabiso letikiti (eliphantsi) kunye nesantya sokubhabha (esona sikhawulezayo).
- Umsebenzisi weskripthi uthunyelwa i-imeyile equlethe isishwankathelo samaxabiso etikiti (amathikithi aphantsi kunye nexabiso eliphakathi), kunye nesakhelo sedatha esinolwazi oluhlelwe ngezikhombisi ezintathu ezikhankanywe ngasentla zigcinwa njengefayile ye-Excel.
- Zonke ezi zenzo zingentla zenziwa kumjikelo emva kwexesha elithile.
Kufuneka kuqatshelwe ukuba yonke iprojekthi ye-Selenium iqala ngomqhubi wewebhu. ndisebenzise
Gcina ukhumbula ukuba kwibali lam andizami ukuvula i-horizons entsha yokufumana amaxabiso amahle kumatikiti omoya. Kukho iindlela eziphambili kakhulu zokukhangela ezo zinikezelo. Ndifuna nje ukubonelela abafundi beli nqaku ngendlela elula kodwa esebenzayo yokusombulula le ngxaki.
Nantsi ikhowudi esithethe ngayo ngasentla.
from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart
# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)
Ekuqaleni kwekhowudi ungabona imiyalelo yokungenisa iphakheji esetyenziswa kuyo yonke iprojekthi yethu. Ngoko, randint
isetyenziselwe ukwenza ibhot "ilale" kangangenani elithile lemizuzwana phambi kokuqalisa umsebenzi omtsha wokukhangela. Ngokwesiqhelo, akukho bhot enye enokwenza ngaphandle koku. Ukuba usebenzisa ikhowudi engentla, iwindow yeChrome iya kuvula, apho i-bot iya kuyisebenzisa ukusebenza kunye neesayithi.
Masenze umfuniselo omncinci kwaye sivule iwebhusayithi ye-kayak.com kwifestile eyahlukileyo. Siza kukhetha isixeko esiza kubhabha kuso, kunye nesixeko esifuna ukuya kuso, kunye neentsuku zokubhabha. Xa ukhetha imihla, qiniseka ukuba uluhlu lweentsuku +-3 lusetyenziswa. Ndabhala ikhowudi ngokuqwalasela oko isiza sivelisa ekuphenduleni izicelo ezinjalo. Ukuba, umzekelo, kufuneka ukhangele amatikiti kuphela kwimihla echaziweyo, ngoko kukho amathuba aphezulu okuba uguqule ikhowudi yebhot. Xa ndithetha ngekhowudi, ndinikezela ngeenkcazo ezifanelekileyo, kodwa ukuba uziva udidekile, ndixelele.
Ngoku cofa kwiqhosha lokukhangela kwaye ujonge ikhonkco kwibha yedilesi. Kufuneka ifane nekhonkco endilisebenzisayo kumzekelo ongezantsi apho uguqulo lubhengezwa khona kayak
, egcina i-URL, kunye nendlela esetyenziswayo get
umqhubi wewebhu. Emva kokucofa iqhosha lokukhangela, iziphumo kufuneka zivele kwiphepha.
Xa ndisebenzisa umyalelo get
ngaphezulu kwamaxesha amabini okanye amathathu kwimizuzu embalwa, ndacelwa ukuba ndigqibezele ukuqinisekiswa usebenzisa i-reCaptcha. Ungadlula olu qwalaselo ngesandla kwaye uqhubeke nokulinga de inkqubo igqibe ukwenza itshekhi entsha. Xa ndivavanya iskripthi, kwakubonakala ngathi iseshoni yokuqala yokukhangela yayihlala ihamba kakuhle, ngoko ke ukuba ufuna ukuzama ikhowudi, kuya kufuneka uhlolisise ngesandla kwaye uvumele ikhowudi iqhube, usebenzisa ixesha elide phakathi kweeseshoni zokukhangela. Kwaye, ukuba ucinga ngako, umntu akanakwenzeka ukuba afune ulwazi malunga namaxabiso etikiti afunyenwe kwimizuzu eyi-10 phakathi kwemisebenzi yokukhangela.
Ukusebenza ngephepha usebenzisa i-XPath
Ngoko, savula ifestile kwaye salayisha indawo. Ukufumana amaxabiso kunye nolunye ulwazi, kufuneka sisebenzise ubuchwepheshe be-XPath okanye abakhethi be-CSS. Ndaye ndagqiba ekubeni ndibambelele kwi-XPath kwaye andizange ndizive ndinesidingo sokusebenzisa abakhethi be-CSS, kodwa kunokwenzeka ukuba usebenze ngaloo ndlela. Ukuzulazula kwiphepha usebenzisa i-XPath kunokuba yinkohliso, kwaye nokuba usebenzisa iindlela endizichaze kuzo
Ke, masiqhubeke sisebenza kwi-bot. Masisebenzise ubuchule benkqubo ukukhetha amatikiti asezantsi. Kulo mfanekiso ulandelayo, ikhowudi yokukhetha i-XPath igxininiswe ngobomvu. Ukuze ujonge ikhowudi, kufuneka ucofe ekunene kwinqaku lephepha onomdla kulo kwaye ukhethe i-Hlola umyalelo kwimenyu evelayo. Lo myalelo unokubizwa kwizinto ezahlukeneyo zephepha, ikhowudi eya kuboniswa kwaye igxininiswe kumbonisi wekhowudi.
Jonga ikhowudi yephepha
Ukuze ufumane isiqinisekiso sokuqiqa kwam malunga nokungonakali kokukopisha abakhethi kwikhowudi, nikela ingqalelo kwezi mpawu zilandelayo.
Nantsi into oyifumanayo xa ukopa ikhowudi:
//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span
Ukuze ukopishe into enje, kufuneka ucofe ekunene kwicandelo lekhowudi onomdla kuyo kwaye ukhethe i Khuphela> Khuphela XPath umyalelo kwimenyu evelayo.
Nantsi into ebendiyisebenzisa ukuchaza iqhosha eliphantsi:
cheap_results = ‘//a[@data-code = "price"]’
Khuphela Umyalelo> Kopa XPath
Kucacile ukuba ukhetho lwesibini lubonakala lulula kakhulu. Xa isetyenziswa, ikhangela into ethi a enophawu data-code
, iyalingana price
. Xa usebenzisa ukhetho lokuqala, into iyakhangelwa id
elilingana ne wtKI-price_aTab
, kwaye i-XPath umendo wento ibonakala ngathi /div[1]/div/div/div[1]/div/span/span
. Umbuzo we-XPath onje kwiphepha uzakwenza iqhinga, kodwa kube kanye kuphela. Ndingatsho ngoku ukuba id
izakutshintsha kwixesha elizayo xa iphepha lilayishwa. Ukulandelelana koonobumba wtKI
utshintsho oluguquguqukayo lonke ixesha iphepha lilayishiwe, ngoko ikhowudi eyisebenzisayo iya kuba yinto engenamsebenzi emva kokulayisha kwakhona kwekhasi elilandelayo. Ke thatha ixesha lokuqonda i-XPath. Olu lwazi luya kukunceda.
Nangona kunjalo, kufuneka kuqatshelwe ukuba ukukopa abakhethi be-XPath kunokuba luncedo xa usebenza neziza ezilula, kwaye ukuba ukhululekile ngale nto, akukho nto iphosakeleyo ngayo.
Ngoku makhe sicinge ngento omawuyenze ukuba ufuna ukufumana zonke iziphumo zokukhangela kwimigca emininzi, ngaphakathi kuluhlu. Ilula kakhulu. Isiphumo ngasinye singaphakathi kwento eneklasi resultWrapper
. Ukulayisha zonke iziphumo kunokwenziwa kwilophu efana nale iboniswe ngezantsi.
Kufuneka kuqatshelwe ukuba ukuba uyakuqonda oku ngasentla, ngoko kufuneka uqonde ngokulula uninzi lwekhowudi esiza kuyihlalutya. Njengoko le khowudi iqhuba, sifikelela kwinto esiyifunayo (enyanisweni, into apho isiphumo sisongelwe) sisebenzisa uhlobo oluthile lwendlela yokuchaza indlela (XPath). Oku kwenziwa ukuze ufumane okubhaliweyo kwento kwaye uyibeke kwinto apho idatha inokufundwa khona (kuqala isetyenziswe flight_containers
emva koko flights_list
).
Imigca emithathu yokuqala ibonisiwe kwaye sinokubona ngokucacileyo yonke into esiyifunayo. Nangona kunjalo, sineendlela ezinomdla ngakumbi zokufumana ulwazi. Kufuneka sithathe idatha kwi element nganye ngokwahlukeneyo.
Fika emsebenzini!
Eyona ndlela ilula yokubhala umsebenzi kukulayisha iziphumo ezongezelelweyo, kulapho sizakuqala khona. Ndingathanda ukwandisa inani leenqwelomoya inkqubo efumana ulwazi ngayo, ngaphandle kokuphakamisa izikrokro kwinkonzo ekhokelela kuhlolo, ngoko ke ndicofa iqhosha elithi Layisha iziphumo ezingakumbi kanye ngexesha ngalinye iphepha liboniswa. Kule khowudi, kufuneka ubeke ingqalelo kwibhloko try
, endiyongezele yona kuba ngamanye amaxesha iqhosha alilayishi ngokufanelekileyo. Ukuba nawe udibana nale nto, phawula ngeefowuni kulo msebenzi kwikhowudi yokusebenza start_kayak
, esiza kujonga ngezantsi.
# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
try:
more_results = '//a[@class = "moreButton"]'
driver.find_element_by_xpath(more_results).click()
# Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
print('sleeping.....')
sleep(randint(45,60))
except:
pass
Ngoku, emva kohlahlelo olude lwalo msebenzi (ngamanye amaxesha ndiyakwazi ukuthabatheka), sikulungele ukubhengeza umsebenzi oza kukrwela iphepha.
Sele ndiqokelele uninzi lwezinto ezifunekayo kulo msebenzi ulandelayo othiwa page_scrape
. Ngamanye amaxesha idatha yendlela ebuyisiwe idityanisiwe, ngoko ke ndisebenzisa indlela elula yokuyahlula. Umzekelo, xa ndisebenzisa izinto eziguquguqukayo okokuqala section_a_list
и section_b_list
. Umsebenzi wethu ubuyisela isakhelo sedatha flights_df
, oku kusivumela ukuba sahlule iziphumo ezifunyenwe kwiindlela ezahlukeneyo zokuhlela idatha kwaye kamva sizidibanise.
def page_scrape():
"""This function takes care of the scraping part"""
xp_sections = '//*[@class="section duration"]'
sections = driver.find_elements_by_xpath(xp_sections)
sections_list = [value.text for value in sections]
section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
section_b_list = sections_list[1::2]
# Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
# О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
# это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
# тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
# я использую тут SystemExit так как хочу протестировать всё с самого начала
if section_a_list == []:
raise SystemExit
# Я буду использовать букву A для уходящих рейсов и B для прибывающих
a_duration = []
a_section_names = []
for n in section_a_list:
# Получаем время
a_section_names.append(''.join(n.split()[2:5]))
a_duration.append(''.join(n.split()[0:2]))
b_duration = []
b_section_names = []
for n in section_b_list:
# Получаем время
b_section_names.append(''.join(n.split()[2:5]))
b_duration.append(''.join(n.split()[0:2]))
xp_dates = '//div[@class="section date"]'
dates = driver.find_elements_by_xpath(xp_dates)
dates_list = [value.text for value in dates]
a_date_list = dates_list[::2]
b_date_list = dates_list[1::2]
# Получаем день недели
a_day = [value.split()[0] for value in a_date_list]
a_weekday = [value.split()[1] for value in a_date_list]
b_day = [value.split()[0] for value in b_date_list]
b_weekday = [value.split()[1] for value in b_date_list]
# Получаем цены
xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
prices = driver.find_elements_by_xpath(xp_prices)
prices_list = [price.text.replace('$','') for price in prices if price.text != '']
prices_list = list(map(int, prices_list))
# stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
xp_stops = '//div[@class="section stops"]/div[1]'
stops = driver.find_elements_by_xpath(xp_stops)
stops_list = [stop.text[0].replace('n','0') for stop in stops]
a_stop_list = stops_list[::2]
b_stop_list = stops_list[1::2]
xp_stops_cities = '//div[@class="section stops"]/div[2]'
stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
stops_cities_list = [stop.text for stop in stops_cities]
a_stop_name_list = stops_cities_list[::2]
b_stop_name_list = stops_cities_list[1::2]
# сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
xp_schedule = '//div[@class="section times"]'
schedules = driver.find_elements_by_xpath(xp_schedule)
hours_list = []
carrier_list = []
for schedule in schedules:
hours_list.append(schedule.text.split('n')[0])
carrier_list.append(schedule.text.split('n')[1])
# разделяем сведения о времени и о перевозчиках между рейсами a и b
a_hours = hours_list[::2]
a_carrier = carrier_list[1::2]
b_hours = hours_list[::2]
b_carrier = carrier_list[1::2]
cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
'Price'])
flights_df = pd.DataFrame({'Out Day': a_day,
'Out Weekday': a_weekday,
'Out Duration': a_duration,
'Out Cities': a_section_names,
'Return Day': b_day,
'Return Weekday': b_weekday,
'Return Duration': b_duration,
'Return Cities': b_section_names,
'Out Stops': a_stop_list,
'Out Stop Cities': a_stop_name_list,
'Return Stops': b_stop_list,
'Return Stop Cities': b_stop_name_list,
'Out Time': a_hours,
'Out Airline': a_carrier,
'Return Time': b_hours,
'Return Airline': b_carrier,
'Price': prices_list})[cols]
flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
return flights_df
Ndizame ukubiza amagama aguquguqukayo ukuze ikhowudi iqondeke. Khumbula ukuba iinguqu ziqala nge a
bakwinqanaba lokuqala lendlela, kwaye b
- ukuya kweyesibini. Masiqhubele phambili kumsebenzi olandelayo.
Iindlela zenkxaso
Ngoku sinomsebenzi osivumela ukuba silayishe iziphumo zophando ezongezelelweyo kunye nomsebenzi wokucubungula ezo ziphumo. Eli nqaku belinokuphelela apha, kuba le mibini imisebenzi ibonelela ngayo yonke into oyifunayo ukukrwela amaphepha onokuzivula ngokwakho. Kodwa asikaziqwalasela ezinye zeendlela ezincedisayo ezixoxwe ngasentla. Umzekelo, le yikhowudi yokuthumela ii-imeyile kunye nezinye izinto. Konke oku kunokufumaneka kumsebenzi start_kayak
, esiza kuyiqwalasela ngoku.
Lo msebenzi ufuna ulwazi malunga nezixeko kunye nemihla. Ukusebenzisa olu lwazi, yenza ikhonkco kuguquko kayak
, elisetyenziselwa ukukusa kwiphepha eliza kuqulatha iziphumo zophendlo ezihlelwe ngokona kuhambelanayo kumbuzo. Emva kweseshoni yokuqala yokukrala, siya kusebenza kunye namaxabiso etafileni phezulu kwiphepha. Oko kukuthi, siya kufumana elona xabiso lincinci letikiti kunye nexabiso eliphakathi. Konke oku, kunye nokuqikelelwa okukhutshwe yisayithi, kuya kuthunyelwa nge-imeyile. Kwiphepha, itafile ehambelanayo kufuneka ibe kwikona ephezulu ngasekhohlo. Ukusebenza nale tafile, ngendlela, kunokubangela impazamo xa ukhangela usebenzisa imihla echanekileyo, ekubeni kulo mzekelo itafile ayiboniswa kwiphepha.
def start_kayak(city_from, city_to, date_start, date_end):
"""City codes - it's the IATA codes!
Date format - YYYY-MM-DD"""
kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
'/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
driver.get(kayak)
sleep(randint(8,10))
# иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
try:
xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
driver.find_elements_by_xpath(xp_popup_close)[5].click()
except Exception as e:
pass
sleep(randint(60,95))
print('loading more.....')
# load_more()
print('starting first scrape.....')
df_flights_best = page_scrape()
df_flights_best['sort'] = 'best'
sleep(randint(60,80))
# Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
matrix_prices = [price.text.replace('$','') for price in matrix]
matrix_prices = list(map(int, matrix_prices))
matrix_min = min(matrix_prices)
matrix_avg = sum(matrix_prices)/len(matrix_prices)
print('switching to cheapest results.....')
cheap_results = '//a[@data-code = "price"]'
driver.find_element_by_xpath(cheap_results).click()
sleep(randint(60,90))
print('loading more.....')
# load_more()
print('starting second scrape.....')
df_flights_cheap = page_scrape()
df_flights_cheap['sort'] = 'cheap'
sleep(randint(60,80))
print('switching to quickest results.....')
quick_results = '//a[@data-code = "duration"]'
driver.find_element_by_xpath(quick_results).click()
sleep(randint(60,90))
print('loading more.....')
# load_more()
print('starting third scrape.....')
df_flights_fast = page_scrape()
df_flights_fast['sort'] = 'fast'
sleep(randint(60,80))
# Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
city_from, city_to,
date_start, date_end), index=False)
print('saved df.....')
# Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
xp_loading = '//div[contains(@id,"advice")]'
loading = driver.find_element_by_xpath(xp_loading).text
xp_prediction = '//span[@class="info-text"]'
prediction = driver.find_element_by_xpath(xp_prediction).text
print(loading+'n'+prediction)
# иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
# если это прозошло - меняем её на "Not Sure"
weird = '¯_(ツ)_/¯'
if loading == weird:
loading = 'Not sure'
username = '[email protected]'
password = 'YOUR PASSWORD'
server = smtplib.SMTP('smtp.outlook.com', 587)
server.ehlo()
server.starttls()
server.login(username, password)
msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
message = MIMEMultipart()
message['From'] = '[email protected]'
message['to'] = '[email protected]'
server.sendmail('[email protected]', '[email protected]', msg)
print('sent email.....')
Ndivavanye lo mbhalo usebenzisa iakhawunti ye-Outlook (hotmail.com). Andizange ndiyivavanye ukuba isebenze ngokuchanekileyo ngeakhawunti ye-Gmail, le nkqubo ye-imeyile ithandwa kakhulu, kodwa zininzi iindlela onokukhetha kuzo. Ukuba usebenzisa i-akhawunti ye-Hotmail, ngoko ukuze yonke into isebenze, kufuneka nje ufake idatha yakho kwikhowudi.
Ukuba ufuna ukuqonda ukuba yintoni kanye kanye eyenziwayo kumacandelo athile ekhowudi yalo msebenzi, ungawakopa kwaye uzame ngawo. Ukuzama ikhowudi kuphela kwendlela yokuyiqonda ngokwenene.
Inkqubo esele ilungile
Ngoku ukuba senze yonke into ebesithetha ngayo, sinokwenza i-loop elula ebiza imisebenzi yethu. Umbhalo ucela idatha kumsebenzisi malunga nezixeko kunye nemihla. Xa uvavanya ngokuqala ngokutsha kweskripthi, akunakwenzeka ukuba ufune ukufaka le datha ngesandla ngalo lonke ixesha, ngoko ke imigca ehambelanayo, ngexesha lovavanyo, inokuphawulwa ngokungayifaki inkcazo kwabo bangaphantsi kwabo, apho idatha efunekayo iskripthi sinekhowudi.
city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')
# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'
for n in range(0,5):
start_kayak(city_from, city_to, date_start, date_end)
print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
# Ждём 4 часа
sleep(60*60*4)
print('sleep finished.....')
Oku kujongeka njani uvavanyo lweskripthi.
Uvavanyo lokuqhuba kwescript
Iziphumo
Ukuba uyenzile le kude, halala! Ngoku une-web scraper esebenzayo, nangona sele ndibona iindlela ezininzi zokuyiphucula. Umzekelo, inokudityaniswa neTwilio ukuze ithumele imiyalezo endaweni yee-imeyile. Ungasebenzisa iVPN okanye enye into ukufumana iziphumo ngaxeshanye kwiiseva ezininzi. Kukwakho ingxaki eyenzeka ngamaxesha athile ngokujonga umsebenzisi wesiza ukubona ukuba ungumntu na, kodwa le ngxaki nayo ingasonjululwa. Kwimeko nayiphi na into, ngoku unesiseko onokusandisa ukuba unqwenela. Umzekelo, qiniseka ukuba ifayile ye-Excel ithunyelwa kumsebenzisi njenge-attachment kwi-imeyile.
Ngabasebenzisi ababhalisiweyo kuphela abanokuthatha inxaxheba kuphando.
Ngaba usebenzisa itekhnoloji ye-web scraping?
-
ukuba
-
akukho
Bali-8 abasebenzisi abavotileyo. Umsebenzisi om-1 akakhange.
umthombo: www.habr.com