Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya

Marubucin kasidar da muke wallafawa a yau, ta ce manufarta ita ce ta yi magana a kan samar da na’ura mai sarrafa yanar gizo a Python ta hanyar amfani da Selenium, wanda ke neman farashin tikitin jiragen sama. Lokacin neman tikiti, ana amfani da ranakun masu sassauƙa (+- kwanaki 3 dangane da ƙayyadaddun kwanakin). Scraper yana adana sakamakon bincike a cikin fayil ɗin Excel kuma ya aika mutumin da ya gudanar da binciken saƙon imel tare da taƙaitaccen abin da ya samo. Manufar wannan aikin shine don taimakawa matafiya su sami mafi kyawun ciniki.

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya

Idan, yayin fahimtar kayan, kun ji asara, duba wannan labarin.

Me za mu nema?

Kuna da 'yanci don amfani da tsarin da aka kwatanta a nan kamar yadda kuke so. Misali, na yi amfani da shi wajen nemo yawon shakatawa na karshen mako da tikitin zuwa garinmu. Idan kuna da gaske game da neman tikiti masu riba, zaku iya gudanar da rubutun akan sabar (mai sauƙi sabar, don 130 rubles a wata, ya dace da wannan) kuma tabbatar da cewa yana gudana sau ɗaya ko sau biyu a rana. Za a aika maka da sakamakon bincike ta imel. Bugu da ƙari, ina ba da shawarar saita komai don rubutun ya adana fayil ɗin Excel tare da sakamakon bincike a cikin babban fayil na Dropbox, wanda zai ba ku damar duba irin waɗannan fayiloli daga ko'ina kuma a kowane lokaci.

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya
Ban sami kuɗin fito tare da kurakurai ba tukuna, amma ina tsammanin yana yiwuwa

Lokacin bincike, kamar yadda aka ambata, ana amfani da “kwanaki mai sassauƙa”; rubutun yana samun tayin da ke cikin kwanaki uku na kwanakin da aka bayar. Kodayake lokacin gudanar da rubutun, yana neman tayi ta hanya ɗaya kawai, yana da sauƙi a canza shi ta yadda zai iya tattara bayanai a kan hanyoyin jirgin da yawa. Tare da taimakonsa, har ma za ku iya nemo kuɗin fito na kuskure; irin waɗannan abubuwan na iya zama masu ban sha'awa sosai.

Me yasa kuke buƙatar wani scraper yanar gizo?

Lokacin da na fara goge yanar gizo, a gaskiya ban sha'awar shi ba. Ina son yin ƙarin ayyuka a fagen ƙirar ƙira, nazarin kuɗi, kuma, mai yiyuwa, a fagen nazarin canza launi na rubutu. Amma ya zama abin ban sha'awa sosai don gano yadda ake ƙirƙirar shirin da ke tattara bayanai daga gidajen yanar gizo. Yayin da na shiga cikin wannan batu, na gane cewa rubutun yanar gizo shine "injin" na Intanet.

Kuna iya tunanin cewa wannan magana ce mai ƙarfin hali. Amma yi la'akari da cewa Google ya fara ne da mai goge gidan yanar gizo wanda Larry Page ya ƙirƙira ta amfani da Java da Python. Mutum-mutumin Google ya kasance yana binciken yanar gizo, yana ƙoƙarin samarwa masu amfani da shi mafi kyawun amsoshin tambayoyinsu. Rushewar yanar gizo yana da amfani mara iyaka, kuma ko da kuna sha'awar wani abu a cikin Kimiyyar Bayanai, kuna buƙatar wasu ƙwarewar gogewa don samun bayanan da kuke buƙatar bincika.

Na sami wasu fasahohin da aka yi amfani da su a nan cikin ban mamaki littafin game da gogewar yanar gizo, wanda na samu kwanan nan. Ya ƙunshi misalai masu sauƙi da ra'ayoyi don aiwatar da abin da kuka koya. Bugu da kari, akwai babi mai ban sha'awa kan ketare cak na reCaptcha. Wannan ya zo mani a matsayin labari, tun da ban ma san cewa akwai kayan aiki na musamman ba har ma da dukan ayyuka don magance irin waɗannan matsalolin.

Kuna son tafiya?!

Ga tambaya mai sauƙi kuma marar lahani da aka yi a cikin taken wannan sashe, sau da yawa za ku iya jin amsa mai kyau, tare da labarai guda biyu daga tafiye-tafiyen mutumin da aka yi masa. Yawancinmu za mu yarda cewa tafiya hanya ce mai kyau don nutsar da kanku a cikin sabbin wuraren al'adu da faɗaɗa hangen nesa. Koyaya, idan ka tambayi wani ko yana son neman tikitin jirgin sama, na tabbata cewa amsar ba za ta yi kyau ba. A zahiri, Python yana zuwa don taimakonmu anan.

Ayyukan farko da muke buƙatar warwarewa akan hanyar ƙirƙirar tsarin neman bayanai akan tikitin jirgin sama shine zaɓin dandamali mai dacewa wanda daga ciki zamu ɗauki bayanai. Magance wannan matsalar bai yi mini sauƙi ba, amma a ƙarshe na zaɓi hidimar Kayak. Na gwada sabis na Momondo, Skyscanner, Expedia, da wasu ƴan wasu, amma hanyoyin kariya na mutum-mutumi akan waɗannan albarkatun sun kasance masu yuwuwa. Bayan yunƙuri da yawa, lokacin da na fuskanci fitulun zirga-zirgar ababen hawa, ƙetare masu tafiya a ƙasa da kekuna, ƙoƙarin shawo kan tsarin cewa ni ɗan adam ne, sai na yanke shawarar cewa Kayak ya fi dacewa da ni, duk da cewa ko da yawancin shafuka suna ɗorawa. a cikin kankanin lokaci, kuma ana fara cak. Na yi nasarar sanya bot aika buƙatun zuwa rukunin yanar gizon a tazara na 4 zuwa 6 hours, kuma komai yayi aiki lafiya. Daga lokaci zuwa lokaci, matsaloli suna tasowa yayin aiki tare da Kayak, amma idan sun fara cutar da ku da cak, to kuna buƙatar ko dai ku yi mu'amala da su da hannu sannan ku ƙaddamar da bot, ko jira 'yan sa'o'i kaɗan kuma cak ɗin ya kamata ya tsaya. Idan ya cancanta, zaku iya daidaita lambar don wani dandamali, kuma idan kun yi haka, zaku iya ba da rahoto a cikin sharhi.

Idan har yanzu kuna farawa da gogewar yanar gizo kuma ba ku san dalilin da yasa wasu gidajen yanar gizon ke fama da shi ba, to kafin ku fara aikinku na farko a wannan fanni, yi wa kanku alheri kuma ku yi bincike na Google akan kalmomin "da'a na lalata yanar gizo" . Gwaje-gwajenku na iya ƙare da wuri fiye da yadda kuke zato idan kun yi lalata yanar gizo ba da hikima ba.

FarawaEND_LINK

Anan ga cikakken bayanin abin da zai faru a cikin lambar rubutun mu ta yanar gizo:

  • Shigo da ɗakunan karatu da ake buƙata.
  • Bude shafin Google Chrome.
  • Kira aikin da ya fara bot, aika shi birane da kwanakin da za a yi amfani da su lokacin neman tikiti.
  • Wannan aikin yana ɗaukar sakamakon bincike na farko, an tsara shi ta mafi kyau, kuma yana danna maɓalli don ɗaukar ƙarin sakamako.
  • Wani aikin yana tattara bayanai daga duk shafin kuma ya dawo da firam ɗin bayanai.
  • Matakan biyu da suka gabata ana yin su ta amfani da nau'ikan rarrabuwa ta farashin tikiti (mai arha) da kuma ta saurin tashi (mafi sauri).
  • Ana aika mai amfani da rubutun saƙon imel mai ɗauke da taƙaitaccen farashin tikiti (tikiti mafi arha da matsakaicin farashi), kuma an adana firam ɗin bayanai tare da bayanan da aka jera ta hanyar alamomi ukun da aka ambata a sama a matsayin fayil ɗin Excel.
  • Dukkan ayyukan da ke sama ana yin su ne a cikin zagayowar bayan ƙayyadadden lokaci.

Ya kamata a lura cewa kowane aikin Selenium yana farawa da direban gidan yanar gizo. Ina amfani Chromedriver, Ina aiki tare da Google Chrome, amma akwai wasu zaɓuɓɓuka. PhantomJS da Firefox suma shahararru ne. Bayan zazzage direban, kuna buƙatar sanya shi a cikin babban fayil ɗin da ya dace, kuma wannan yana kammala shirye-shiryen amfani da shi. Layukan farko na rubutun mu sun buɗe sabon shafin Chrome.

Ka tuna cewa a cikin labarina ba na ƙoƙarin buɗe sabon hangen nesa don nemo manyan yarjejeniyoyin kan tikitin jirgin sama. Akwai hanyoyin ci gaba da yawa na neman irin waɗannan tayin. Ina so kawai in ba wa masu karatun wannan abu hanya mai sauƙi amma a aikace don magance wannan matsala.

Ga lambar da muka yi magana a sama.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

A farkon lambar za ku iya ganin umarnin shigo da fakitin da ake amfani da su cikin aikinmu. Don haka, randint ana amfani da shi don sanya bot ɗin “ya yi barci” na ɗan daƙiƙa kaɗan kafin fara sabon aikin bincike. Yawancin lokaci, babu bot guda ɗaya da zai iya yin ba tare da wannan ba. Idan kun gudanar da lambar da ke sama, taga Chrome zai buɗe, wanda bot ɗin zai yi amfani da shi don aiki tare da shafuka.

Bari mu yi ɗan gwaji kuma mu buɗe gidan yanar gizon kayak.com a wata taga daban. Za mu zabi birnin da za mu tashi, da birnin da muke son zuwa, da kuma kwanakin da za a tashi. Lokacin zabar kwanakin, tabbatar cewa an yi amfani da kewayon +-3 kwanaki. Na rubuta lambar la'akari da abin da shafin ke samarwa don amsa irin waɗannan buƙatun. Idan, alal misali, kuna buƙatar nemo tikiti kawai don takamaiman kwanakin, to akwai yuwuwar babban yuwuwar dole ne ku canza lambar bot. Lokacin da na yi magana game da lambar, na ba da bayanin da ya dace, amma idan kun ji ruɗani, sanar da ni.

Yanzu danna maɓallin nema kuma duba hanyar haɗin da ke cikin mashin adireshi. Ya kamata ya yi kama da hanyar haɗin da nake amfani da shi a cikin misalin da ke ƙasa inda aka bayyana ma'auni kayak, wanda ke adana URL ɗin, kuma ana amfani da hanyar get direban yanar gizo. Bayan danna maɓallin nema, yakamata sakamakon ya bayyana akan shafin.

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya
Lokacin da na yi amfani da umarnin get fiye da sau biyu ko uku a cikin 'yan mintoci kaɗan, an nemi in kammala tantancewa ta amfani da reCaptcha. Kuna iya wuce wannan rajistan da hannu kuma ku ci gaba da gwaji har sai tsarin ya yanke shawarar gudanar da sabon rajistan. Lokacin da na gwada rubutun, ya zama kamar zaman bincike na farko ya kasance koyaushe yana tafiya daidai, don haka idan kuna son gwada lambar, kawai kuna buƙatar bincika lokaci-lokaci da hannu kuma ku bar lambar ta gudana, ta yin amfani da dogon lokaci tsakanin zaman bincike. Kuma, idan kun yi tunani game da shi, da wuya mutum ya buƙaci bayani game da farashin tikitin da aka karɓa a cikin tazara na mintuna 10 tsakanin ayyukan bincike.

Yin aiki tare da shafi ta amfani da XPath

Don haka, mun bude taga kuma muka loda wurin. Don samun farashi da sauran bayanai, muna buƙatar amfani da fasahar XPath ko masu zaɓin CSS. Na yanke shawarar tsayawa tare da XPath kuma ban ji bukatar yin amfani da masu zaɓin CSS ba, amma yana yiwuwa a yi aiki ta wannan hanyar. Yin kewaya shafi ta amfani da XPath na iya zama da wahala, kuma ko da kuna amfani da dabarun da na bayyana a ciki wannan labarin, wanda ya ƙunshi kwafin abubuwan gano masu dacewa daga lambar shafi, na gane cewa wannan, a zahiri, ba shine mafi kyawun hanyar samun damar abubuwan da ake buƙata ba. Af, in wannan Littafin yana ba da kyakkyawan bayanin tushen aiki tare da shafuka ta amfani da masu zaɓin XPath da CSS. Wannan shine yadda madaidaicin hanyar direban gidan yanar gizo yayi kama.

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya
Don haka, bari mu ci gaba da aiki akan bot. Bari mu yi amfani da damar shirin don zaɓar tikiti mafi arha. A cikin hoton da ke gaba, lambar zaɓen XPath tana haskakawa da ja. Domin duba lambar, kuna buƙatar danna-dama akan ɓangaren shafin da kuke sha'awar kuma zaɓi umarnin Dubawa daga menu wanda ya bayyana. Ana iya kiran wannan umarni don abubuwa daban-daban na shafi, lambar wanda za a nuna da kuma haskaka su a cikin mai duba lambar.

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya
Duba lambar shafi

Domin samun tabbacin dalili na game da illolin kwafin masu zaɓe daga lamba, kula da waɗannan fasaloli masu zuwa.

Wannan shine abin da kuke samu lokacin da kuka kwafi lambar:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

Domin kwafa wani abu makamancin haka, kuna buƙatar danna-dama akan sashin lambar da kuke sha'awar kuma zaɓi Kwafi> Kwafi XPath daga menu wanda ya bayyana.

Ga abin da na yi amfani da shi don ayyana maɓalli mafi arha:

cheap_results = ‘//a[@data-code = "price"]’

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya
Kwafi Umurnin> Kwafi XPath

A bayyane yake cewa zaɓi na biyu ya fi sauƙi. Lokacin da aka yi amfani da shi, yana neman wani abu wanda ke da sifa data-codedaidai da price. Lokacin amfani da zaɓi na farko, ana bincika kashi id wanda yayi daidai da wtKI-price_aTab, kuma hanyar XPath zuwa kashi yayi kama /div[1]/div/div/div[1]/div/span/span. Tambayar XPath kamar wannan zuwa shafi zai yi dabarar, amma sau ɗaya kawai. Zan iya cewa a yanzu haka id zai canza lokaci na gaba da aka loda shafin. Jerin haruffa wtKI yana canzawa sosai a duk lokacin da aka loda shafin, don haka lambar da ke amfani da ita ba za ta yi amfani ba bayan sake shigar da shafi na gaba. Don haka ɗauki ɗan lokaci don fahimtar XPath. Wannan ilimin zai taimaka muku da kyau.

Duk da haka, ya kamata a lura cewa kwafin masu zaɓin XPath na iya zama da amfani yayin aiki tare da shafuka masu sauƙi, kuma idan kun gamsu da wannan, babu wani abu mara kyau tare da shi.

Yanzu bari mu yi tunanin abin da za ku yi idan kuna buƙatar samun duk sakamakon bincike a cikin layi da yawa, a cikin jeri. Mai sauqi qwarai. Kowane sakamako yana cikin abu mai aji resultWrapper. Load duk sakamakon za a iya yi a cikin madauki mai kama da wanda aka nuna a ƙasa.

Ya kamata a lura cewa idan kun fahimci abin da ke sama, to ya kamata ku fahimci mafi yawan lambar da za mu bincika. Yayin da wannan lambar ke gudana, muna samun damar abin da muke buƙata (a zahiri, ɓangaren da aka naɗe sakamakon) ta amfani da wasu nau'ikan hanyoyin tantance hanya (XPath). Ana yin haka ne domin a samu rubutun sinadarin a sanya shi a cikin wani abu da za a iya karanta bayanai daga ciki (da farko ana amfani da shi. flight_containers, sannan - flights_list).

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya
Lines uku na farko suna nunawa kuma za mu iya ganin duk abin da muke bukata a fili. Koyaya, muna da hanyoyi masu ban sha'awa na samun bayanai. Muna buƙatar ɗaukar bayanai daga kowane kashi daban.

Je zuwa aiki!

Hanya mafi sauƙi don rubuta aiki ita ce ɗaukar ƙarin sakamako, don haka a nan ne za mu fara. Ina so in ƙara yawan jiragen da shirin ke karɓar bayanai game da su, ba tare da haifar da tuhuma a cikin sabis ɗin da ke kaiwa ga dubawa ba, don haka sai in danna maɓallin Load ƙarin sakamako sau ɗaya a duk lokacin da shafin ya nuna. A cikin wannan lambar, ya kamata ku kula da toshe try, wanda na kara saboda wani lokacin maballin baya yin lodi da kyau. Idan kuma kun ci karo da wannan, yi sharhin kira zuwa wannan aikin a cikin lambar aiki start_kayak, wanda za mu duba a kasa.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Yanzu, bayan dogon nazari na wannan aikin (wani lokaci zan iya ɗauka), muna shirye mu bayyana aikin da zai goge shafin.

Na riga na tattara mafi yawan abin da ake buƙata a cikin aikin da ake kira page_scrape. Wani lokaci ana haɗa bayanan hanyar da aka dawo, don haka ina amfani da hanya mai sauƙi don raba shi. Misali, lokacin da na yi amfani da masu canji a karon farko section_a_list и section_b_list. Ayyukanmu suna dawo da firam ɗin bayanai flights_df, wannan yana ba mu damar raba sakamakon da aka samu daga hanyoyin rarraba bayanai daban-daban sannan mu hada su.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

Na yi ƙoƙarin sanya sunayen masu canji domin lambar ta zama abin fahimta. Ka tuna cewa masu canji suna farawa da a kasance cikin matakin farko na hanya, kuma b - zuwa na biyu. Mu ci gaba zuwa aiki na gaba.

Hanyoyin tallafi

Yanzu muna da aikin da ke ba mu damar loda ƙarin sakamakon bincike da aikin aiwatar da waɗannan sakamakon. Wannan labarin zai iya ƙare a nan, tun da waɗannan ayyuka biyu suna ba da duk abin da kuke buƙata don goge shafukan da za ku iya buɗewa da kanku. Amma har yanzu ba mu yi la'akari da wasu hanyoyin taimako da aka tattauna a sama ba. Misali, wannan shine lambar aika imel da wasu abubuwa. Ana iya samun duk wannan a cikin aikin start_kayak, wanda za mu yi la'akari yanzu.

Don wannan aikin ya yi aiki, kuna buƙatar bayani game da birane da kwanakin. Yin amfani da wannan bayanin, yana samar da hanyar haɗi a cikin mai canzawa kayak, wanda ake amfani da shi don kai ku zuwa shafin da zai ƙunshi sakamakon bincike da aka jera ta mafi dacewarsu da tambayar. Bayan zaman sharewa na farko, za mu yi aiki tare da farashin a cikin tebur a saman shafin. Wato, za mu sami mafi ƙarancin farashin tikiti da matsakaicin farashi. Duk waɗannan, tare da hasashen da shafin ya bayar, za a aika ta imel. A kan shafin, teburin da ya dace ya kamata ya kasance a kusurwar hagu na sama. Yin aiki tare da wannan tebur, ta hanya, na iya haifar da kuskure yayin bincike ta amfani da ainihin kwanakin, tun da a wannan yanayin ba a nuna tebur a shafi ba.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

Na gwada wannan rubutun ta amfani da asusun Outlook (hotmail.com). Ban gwada shi don yin aiki daidai da asusun Gmail ba, wannan tsarin imel ɗin ya shahara sosai, amma akwai zaɓuɓɓuka da yawa masu yuwuwa. Idan kuna amfani da asusun Hotmail, to don komai ya yi aiki, kawai kuna buƙatar shigar da bayanan ku a cikin lambar.

Idan kuna son fahimtar ainihin abin da ake yi a takamaiman sassan lambar don wannan aikin, zaku iya kwafa su kuma kuyi gwaji da su. Gwaji tare da lambar ita ce kawai hanyar fahimtar ta da gaske.

Tsarin shirye-shirye

Yanzu da mun yi duk abin da muka yi magana akai, za mu iya ƙirƙirar madauki mai sauƙi wanda ke kiran ayyukanmu. Rubutun yana buƙatar bayanai daga mai amfani game da birane da kwanakin. Lokacin gwaji tare da sake kunna rubutun akai-akai, da alama ba za ku so shigar da wannan bayanan da hannu kowane lokaci ba, don haka layukan da suka dace, don lokacin gwaji, ana iya yin sharhi ta hanyar uncommenting waɗanda ke ƙasa da su, waɗanda bayanan da ake buƙata ta rubutun yana da hardcoded.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Wannan shine yadda aikin gwajin rubutun yayi kama.
Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya
Gwajin gudu na rubutun

Sakamakon

Idan kun yi nisa, taya murna! Yanzu kuna da maƙallan gidan yanar gizo mai aiki, kodayake na riga na ga hanyoyin da yawa don inganta shi. Misali, ana iya haɗa shi da Twilio domin ya aika saƙonnin rubutu maimakon imel. Kuna iya amfani da VPN ko wani abu don karɓar sakamako lokaci guda daga sabar da yawa. Haka nan akwai matsalar da ke tasowa lokaci-lokaci ta hanyar duba mai amfani da shafin don ganin ko shi mutum ne, amma kuma ana iya magance wannan matsalar. A kowane hali, yanzu kuna da tushe wanda zaku iya fadada idan kuna so. Misali, tabbatar cewa an aika fayil ɗin Excel zuwa mai amfani azaman abin da aka makala zuwa imel.

Python - mai taimakawa wajen nemo tikitin jirgin sama mara tsada ga masu son tafiya

Masu amfani da rajista kawai za su iya shiga cikin binciken. Shigadon Allah.

Kuna amfani da fasahar goge yanar gizo?

  • A

  • Babu

Masu amfani 8 sun kada kuri'a. 1 mai amfani ya ƙi.

source: www.habr.com

Add a comment