Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba

Munyori wezvinyorwa, shanduro yatiri kubudisa nhasi, inotaura kuti chinangwa chayo ndechekutaura nezvekugadzirwa kwewebhu scraper muPython uchishandisa Selenium, iyo inotsvaga mitengo yetikiti yendege. Paunenge uchitsvaga matikiti, mazuva anochinjika anoshandiswa (+- 3 mazuva ane chekuita nemazuva akatarwa). Iyo scraper inochengetedza mitsva yekutsvaga muExcel file uye inotumira munhu akamhanya kutsvaga email ine pfupiso yezvavakawana. Chinangwa chechirongwa ichi ndechekubatsira vafambi kuwana madhiri akanakisa.

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba

Kana, paunenge uchinzwisisa mashoko acho, unonzwa warasika, tarisa izvi article.

Tichatsvagei?

Iwe wakasununguka kushandisa sisitimu inotsanangurwa pano sezvaunoda. Semuenzaniso, ndakaishandisa kutsvaga rwendo rwekupera kwevhiki uye matikiti ekumusha kwangu. Kana iwe wakakomba nezve kutsvaga matikiti ane pundutso, unogona kumhanya script pane server (nyore server, ye 130 rubles pamwedzi, yakanyatsokodzera izvi) uye iva nechokwadi chokuti inomhanya kamwe kana kaviri pazuva. Mhinduro dzekutsvaga dzichatumirwa kwauri neemail. Mukuwedzera, ini ndinokurudzira kuisa zvinhu zvose kuitira kuti script ichengetedze faira yeExcel nemigumisiro yekutsvaga muDropbox folda, iyo ichakubvumira kuona mafaira akadaro kubva kupi zvako uye chero nguva.

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba
Ini handisati ndawana tariffs ine zvikanganiso parizvino, asi ndinofunga zvinogoneka

Kana uchitsvaga, sezvatotaurwa, "zuva rinochinjika" rinoshandiswa; iyo script inowana zvinopihwa zviri mukati memazuva matatu emazuva akapihwa. Kunyangwe kana ichimhanyisa script, inotsvaga zvinopihwa munzira imwe chete, zviri nyore kuigadzirisa kuti ikwanise kuunganidza data pane akati wandei nzira dzekubhururuka. Nerubatsiro rwayo, unogona kutotsvaga miripo isina kururama; kuwanikwa kwakadaro kunogona kunakidza kwazvo.

Sei uchida imwe web scraper?

Pandakatanga webhu scraping, ini chokwadi ndakanga ndisinganyanye kuifarira. Ini ndaida kuita mamwe mapurojekiti mumunda wekufanotaura modhi, kuongororwa kwemari, uye, pamwe, mumunda wekuongorora manzwiro emavara ezvinyorwa. Asi zvakazoitika kuti zvainakidza chaizvo kufunga nzira yekugadzira chirongwa chinounganidza data kubva kumawebhusaiti. Sezvandakafungisisa munyaya iyi, ndakaona kuti web scraping ndiyo "injini" yeInternet.

Unogona kufunga kuti ichi chitauriro chakashinga. Asi funga kuti Google yakatanga newebhu scraper iyo Larry Peji yakagadzirwa uchishandisa Java nePython. Marobhoti eGoogle anga achiongorora Internet, achiedza kupa vashandisi vayo mhinduro dzakanakisa dzemibvunzo yavo. Web scraping ine kushandiswa kusingagumi, uye kunyange kana iwe uchifarira chimwe chinhu muData Science, iwe uchada humwe unyanzvi hwekuchera kuti uwane data yaunoda kuongorora.

Ndakawana mamwe maitiro anoshandiswa pano mune zvinoshamisa bhuku racho nezve web scraping, yandachangobva kuwana. Rine mienzaniso yakawanda yakapfava uye mazano ekushandisa zvinobudirira zvawadzidza. Mukuwedzera, kune chitsauko chinonakidza chekupfuura reCaptcha cheki. Izvi zvakauya senhau kwandiri, sezvo ndaisatomboziva kuti kune maturusi akakosha uye kunyangwe masevhisi ese ekugadzirisa matambudziko akadaro.

Unoda kufamba here?!

Kumubvunzo wakapfava uye usingakuvadzi wakabvunzwa mumusoro wechikamu chino, unogona kazhinji kunzwa mhinduro yakanaka, inoperekedzwa nenyaya dzinoverengeka kubva munzendo dzemunhu waakabvunzwa. Vazhinji vedu tinobvuma kuti kufamba inzira yakanaka yekunyura mumagariro matsva uye kuwedzera mahorizoni ako. Nekudaro, kana iwe ukabvunza mumwe munhu kana achida kutsvaga matikiti endege, ndine chokwadi chekuti mhinduro haizove yakanaka kudaro. Sezvineiwo, Python inouya kuzotibatsira pano.

Basa rekutanga ratinofanira kugadzirisa munzira yekugadzira sisitimu yekutsvaga ruzivo pamatikiti emhepo ichave iri kusarudza chikuva chakakodzera kwatinotora ruzivo. Kugadzirisa dambudziko iri kwakanga kusiri nyore kwandiri, asi pakupedzisira ndakasarudza basa reKayak. Ndakaedza masevhisi eMomondo, Skyscanner, Expedia, uye nevamwe vashoma, asi marobhoti ekudzivirira maitiro pane izvi zviwanikwa aive asingapindike. Mushure mekuedza kwakati wandei, panguva yandaifanira kutarisana nemarambi emigwagwa, mafambiro evanofamba netsoka uye mabhasikoro, ndichiedza kugonesa masisitimu kuti ndaive munhu, ndakafunga kuti Kayak yakandikodzera zvakanyanya, kunyangwe hazvo mapeji akawanda akatakurwa. munguva pfupi, uye macheki anotangawo. Ndakakwanisa kuita kuti bot itumire zvikumbiro kune saiti panguva dze4 kusvika kumaawa 6, uye zvese zvakashanda zvakanaka. Nguva nenguva, matambudziko anomuka kana uchishanda neKayak, asi kana vakatanga kukunetsa necheki, saka unofanirwa kubata navo nemawoko wozovhura bot, kana kumirira maawa mashoma uye cheki dzinofanira kumira. Kana zvichidikanwa, iwe unogona nyore kugadzirisa iyo kodhi yeimwe puratifomu, uye kana ukadaro, iwe unogona kuzvitaura mune zvakataurwa.

Kana iwe uchangotanga newebhu scraping uye usingazive kuti nei mamwe mawebhusaiti achinetsekana nazvo, saka usati watanga chirongwa chako chekutanga munzvimbo ino, zviitire zvakanaka uye ita Google tsvaga pamashoko ekuti "web scraping etiquette" . Miedzo yako inogona kupera nekukurumidza kupfuura zvaunofunga kana iwe ukaita web scraping zvisina hungwaru.

kutanga

Heino mhedziso yezvichaitika mune yedu web scraper kodhi:

  • Pinza kunze maraibhurari anodiwa.
  • Kuvhura Google Chrome tab.
  • Fonera basa rinotanga bhoti, uchipfuudza maguta nemazuva anozoshandiswa pakutsvaga matikiti.
  • Iri basa rinotora mhinduro dzekutanga dzekutsvaga, dzakarongwa nepamusoro, uye nekudzvanya bhatani kurodha mimwe mibairo.
  • Rimwe basa rinounganidza data kubva kune peji rese uye rinodzorera data data.
  • Matanho maviri apfuura anoitwa pachishandiswa marudzi ekuronga nemutengo wetikiti (yakachipa) uye nekumhanya kwendege (nekukurumidza).
  • Mushandisi wescript anotumirwa email ine pfupiso yemitengo yetikiti (matikiti akachipa uye mutengo wepakati), uye data furemu ine ruzivo rwakarongwa neatatu ataurwa pamusoro anochengetwa seExcel faira.
  • Zvese zviito zviri pamusoro zvinoitwa mukutenderera mushure menguva yakatarwa.

Izvo zvinofanirwa kucherechedzwa kuti ese Selenium chirongwa chinotanga nemutyairi wewebhu. ndinoshandisa Chromedriver, Ndinoshanda neGoogle Chrome, asi pane zvimwe zvingasarudzwa. PhantomJS neFirefox zvakare dzakakurumbira. Mushure mekudhawunirodha mutyairi, unofanirwa kuiisa mune yakakodzera folda, uye izvi zvinopedzisa kugadzirira kwekushandisa kwayo. Mitsetse yekutanga yescript yedu inovhura itsva Chrome tab.

Ramba uchifunga kuti mune yangu nyaya handisi kuedza kuvhura mahorizons matsva ekutsvaga madhiri pamatiketi emhepo. Kune dzimwe nzira dzepamusoro dzekutsvaga izvo zvinopihwa. Ini ndinongoda kupa vaverengi vechinhu ichi nzira iri nyore asi inoshanda yekugadzirisa dambudziko iri.

Heino kodhi yatakataura nezvayo pamusoro.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

Pakutanga kwekodhi iwe unogona kuona iyo package yekupinza mirairo iyo inoshandiswa mukati meprojekiti yedu. Saka, randint yakashandiswa kuita bot "kurara" kwenhamba isina kujairika yemasekonzi isati yatanga tsvakiridzo yekutsvaga. Kazhinji, hapana bot rimwe chete rinogona kuita pasina izvi. Kana iwe uchimhanyisa kodhi iri pamusoro, hwindo reChrome richavhurwa, iro bot richashandisa kushanda nemasaiti.

Ngatiitei kuyedza zvishoma uye tivhure kayak.com webhusaiti mune imwe hwindo. Tichasarudza guta ratinobhururuka kubva, uye guta ratinoda kusvika, pamwe chete nemazuva ekubhururuka. Pakusarudza mazuva, ita shuwa kuti huwandu hwe + -3 mazuva hunoshandiswa. Ndakanyora kodhi ndichifunga izvo saiti inoburitsa mukupindura zvikumbiro zvakadaro. Kana, semuenzaniso, iwe unofanirwa kutsvaga matikiti chete emazuva akatarwa, saka pane mukana wakakura wekuti uchafanirwa kugadzirisa bot kodhi. Pandinotaura nezvekodhi, ndinopa tsananguro dzakakodzera, asi kana iwe uchinzwa kuvhiringika, ndizivise.

Zvino tinya bhatani rekutsvaga uye tarisa chinongedzo chiri mubhawa rekero. Inofanira kunge yakafanana neyekubatanidza ini ndinoshandisa mumuenzaniso pazasi apo iyo shanduko inoziviswa kayak, iyo inochengeta URL, uye nzira inoshandiswa get web driver. Mushure mekudzvanya bhatani rekutsvaga, mhinduro dzinofanira kuoneka papeji.

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba
Pandakashandisa murairo get kanopfuura kaviri kana katatu mukati memaminitsi mashoma, ndakakumbirwa kuti ndipedze kuongorora ndichishandisa reCaptcha. Iwe unogona kupfuudza cheki iyi nemaoko uye ramba uchiedza kudzamara sisitimu yafunga kumhanyisa cheki nyowani. Pandakaedza script, yakaita seyokutanga yekutsvaga nguva yaigara ichienda zvakanaka, saka kana iwe waida kuedza nekodhi, iwe waizofanira chete nguva nenguva kutarisa uye kurega kodhi ichimhanya, uchishandisa nguva refu pakati pezvikamu zvekutsvaga. Uye, kana iwe uchifunga nezvazvo, munhu haafanire kuda ruzivo nezvemitengo yetikiti inogamuchirwa pamaminetsi gumi-maminetsi pakati pemabasa ekutsvaga.

Kushanda nepeji uchishandisa XPath

Saka, takavhura hwindo uye takaisa saiti. Kuti tiwane mitengo nerumwe ruzivo, tinoda kushandisa tekinoroji yeXPath kana maCSS selectors. Ndakafunga kunamatira XPath uye handina kunzwa kudiwa kwekushandisa CSS selectors, asi zvinogoneka kushanda nenzira iyoyo. Kutenderera peji uchishandisa XPath kunogona kuve kwakaoma, uye kunyangwe ukashandisa matekiniki andakatsanangura mauri izvi chinyorwa, icho chaisanganisira kukopa zviziviso zvinoenderana kubva pane peji rekodhi, ndakaona kuti iyi, chokwadi, haisi iyo nzira yakakwana yekuwana izvo zvinodiwa. Nenzira, in izvi Bhuku rinopa tsananguro yakanakisa yezvakakosha zvekushanda nemapeji uchishandisa XPath uye CSS vanosarudza. Izvi ndizvo zvinoita senge inoenderana webhu mutyairi nzira.

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba
Saka, ngatirambei tichishanda pane bot. Ngatishandisei kugona kwechirongwa kusarudza matikiti akachipa. Mumufananidzo unotevera, iyo XPath yekusarudza kodhi inoiswa mutsvuku. Kuti utarise iyo kodhi, unofanirwa kudzvanya-kurudyi pane peji peji raunofarira uye sarudza iyo Ongorora kuraira kubva kumenyu inoonekwa. Uyu murairo unogona kudanwa kune akasiyana mapeji mapeji, iyo kodhi iyo icharatidzwa uye yakasimbiswa mune kodhi yekuona.

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba
Ona kodhi yepeji

Kuti uwane kusimbiswa kwekufunga kwangu pamusoro pezvisina kunaka zvekukopa vanosarudza kubva kukodhi, teerera kune zvinotevera maficha.

Izvi ndizvo zvaunowana paunokopa kodhi:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

Kuti ukope chimwe chinhu chakadai, unofanirwa kudzvanya-kurudyi pane chikamu chekodhi chaunofarira uye sarudza iyo Kopi> Kopa XPath kuraira kubva kumenyu inooneka.

Hezvino zvandaishandisa kutsanangura bhatani rakachipa:

cheap_results = ‘//a[@data-code = "price"]’

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba
Kopira Raira> Kopa XPath

Zviripachena kuti yechipiri sarudzo inotaridzika zvakanyanya nyore. Kana yashandiswa, inotsvaga chinhu a chine hunhu data-code, zvakaenzana price. Paunenge uchishandisa yekutanga sarudzo, chinhu chinotsvakwa id izvo zvakaenzana ne wtKI-price_aTab, uye iyo XPath nzira yechinhu chinotaridzika /div[1]/div/div/div[1]/div/span/span. Mubvunzo weXPath seuyu kune peji uchaita hunyengeri, asi kamwe chete. Ndinogona kutaura izvozvi id ichachinja nguva inotevera peji painoiswa. Kutevedzana kwehunhu wtKI inochinja zvine simba pese pese pese pese parinotakurwa, saka iyo kodhi inoishandisa ichave isina basa mushure meiyo inotevera peji reload. Saka tora nguva yekunzwisisa XPath. Zivo iyi inozokuitira zvakanaka.

Nekudaro, zvinofanirwa kucherechedzwa kuti kutevedzera XPath vanosarudza kunogona kubatsira kana uchishanda nemasaiti akareruka, uye kana iwe wakasununguka neizvi, hapana chakaipa nazvo.

Zvino ngatifungei nezve zvekuita kana iwe uchida kuwana zvese zvekutsvaga mumitsara yakati wandei, mukati merunyorwa. Very simple. Chigumisiro chega chega chiri mukati mechinhu chine kirasi resultWrapper. Kurodha zvese zvabuda zvinogona kuitwa muchiuno chakafanana neicho chinoratidzwa pazasi.

Izvo zvinofanirwa kucherechedzwa kuti kana iwe uchinzwisisa zviri pamusoro, saka iwe unofanirwa kunzwisisa zviri nyore yakawanda yekodhi yatichaongorora. Sezvo iyi kodhi inomhanya, tinowana izvo zvatinoda (chaizvoizvo, chinhu icho mhedzisiro yakaputirwa) tichishandisa imwe mhando yenzira-inotsanangura nzira (XPath). Izvi zvinoitwa kuitira kuti uwane zvinyorwa zvechinhu uye wozviisa muchinhu chinogona kuverengerwa data (chekutanga kushandiswa. flight_containers, ipapo - flights_list).

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba
Mitsara mitatu yekutanga inoratidzwa uye tinogona kuona zvakajeka zvose zvatinoda. Nekudaro, isu tine dzimwe nzira dzinonakidza dzekuwana ruzivo. Isu tinofanirwa kutora data kubva kune chimwe nechimwe chinhu zvakasiyana.

Enda kubasa!

Nzira iri nyore yekunyora basa ndeyekuisa mamwe mibairo, saka ndipo patichatangira. Ndinoda kuwedzera huwandu hwendege idzo chirongwa chinogashira ruzivo pamusoro pazvo, pasina kusimudza fungidziro musevhisi inotungamira kuongororo, saka ndinodzvanya bhatani reRodha mimwe mibairo kamwe chete pese peji parinoratidzwa. Mune iyi kodhi, iwe unofanirwa kutarisisa kune block try, yandakawedzera nekuti dzimwe nguva bhatani harina kurodha zvakanaka. Kana iwe zvakare ukasangana neizvi, taura kunze mafoni kune iyi basa mune kodhi yebasa start_kayak, izvo zvatichatarisa pazasi.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Zvino, mushure mekuongorora kwenguva refu kwebasa iri (dzimwe nguva ndinogona kutakurwa), isu takagadzirira kuzivisa basa rinokwenya peji.

Ndakatounganidza zvakawanda zvinodikanwa mubasa rinotevera rinonzi page_scrape. Dzimwe nguva nzira yakadzoserwa data inosanganiswa, saka ini ndinoshandisa nzira iri nyore kuiparadzanisa. Semuenzaniso, pandinoshandisa zvinoshanduka kekutanga section_a_list и section_b_list. Basa redu rinodzorera data frame flights_df, izvi zvinotibvumira kuparadzanisa mibairo yakawanikwa kubva kune dzakasiyana nzira dzekugadzirisa data uye gare gare tozvibatanidza.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

Ndakaedza kutumidza mazita akasiyana kuitira kuti code inzwisise. Rangarira kuti zvinoshanduka kutanga nazvo a ndevechikamu chekutanga chegwara, uye b - kune yechipiri. Ngatienderei kune rimwe basa.

Tsigira nzira

Isu tave nebasa rinotibvumidza kurodha mamwe mibairo yekutsvaga uye basa rekugadzirisa izvo zvabuda. Ichi chinyorwa chingadai chapera pano, sezvo maviri aya mabasa anopa zvese zvaunoda kukwenya mapeji aunogona kuvhura wega. Asi isu hatisati tafunga nezvedzimwe nzira dzebetsero dzakurukurwa pamusoro apa. Semuenzaniso, iyi ndiyo kodhi yekutumira maemail uye zvimwe zvinhu. Zvose izvi zvinogona kuwanikwa mubasa start_kayak, izvo zvatichakurukura zvino.

Kuti basa iri rishande, unoda ruzivo nezvemaguta nemazuva. Ichishandisa ruzivo urwu, inoumba chinongedzo mune shanduko kayak, iyo inoshandiswa kukuendesa kune peji rinenge riine mibairo yekutsvaga yakarongwa nekwavo kuenderana nemubvunzo. Mushure mechikamu chekutanga chekucheka, tichashanda nemitengo iri patafura kumusoro kwepeji. Sezvineiwo, isu tichawana yakaderera mutengo wetikiti uye mutengo wepakati. Zvese izvi, pamwe nekufanotaura kwakapihwa nesaiti, zvichatumirwa neemail. Pa peji, tafura inoenderana inofanira kunge iri mukona yepamusoro kuruboshwe. Kushanda netafura iyi, nenzira, kunogona kukonzera kukanganisa paunenge uchitsvaga uchishandisa mazuva chaiwo, sezvo munyaya iyi tafura haina kuratidzwa pane peji.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

Ndakaedza chinyorwa ichi ndichishandisa Outlook account (hotmail.com). Ini handina kuiyedza kuti ishande nemazvo neakaundi yeGmail, iyi email sisitimu yakakurumbira, asi pane zvakawanda zvinogoneka sarudzo. Kana iwe ukashandisa Hotmail account, saka kuitira kuti zvese zvishande, iwe unongoda kuisa data rako mukodhi.

Kana iwe uchida kunzwisisa kuti chii chaizvo chiri kuitwa muzvikamu zvekodhi zvebasa iri, unogona kuzvikopa uye kuedza nazvo. Kuedza nekodhi ndiyo chete nzira yekunyatsoinzwisisa.

Ready system

Zvino zvataita zvese zvatakataura nezvazvo, tinogona kugadzira loop iri nyore inodana mabasa edu. Iyo script inokumbira data kubva kumushandisi nezvemaguta nemazuva. Paunenge uchiyedza nekuramba uchitangazve script, haungaite kuti ude kuisa iyi data nemaoko nguva dzese, saka mitsara inoenderana, yenguva yekuyedzwa, inogona kutaurwa nekusapa mhinduro kune avo vari pasi pavo, umo data rinodiwa ne script is hardcoded.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Izvi ndizvo zvinoita bvunzo kumhanya kwescript kutaridzika.
Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba
Yedza kumhanya kwezvinyorwa

Migumisiro

Kana wakwanisa kusvika apa, makorokoto! Iwe zvino une web scraper inoshanda, kunyange ini ndinogona kutoona nzira dzakawanda dzekuvandudza. Semuenzaniso, inogona kubatanidzwa neTwilio kuitira kuti itumire mameseji pachinzvimbo cheemail. Unogona kushandisa VPN kana chimwe chinhu kuti ugamuchire panguva imwe chete mibairo kubva kumaseva akati wandei. Pane zvakare nguva nenguva dambudziko rinomuka nekutarisa mushandisi wesaiti kuti uone kana ari munhu, asi dambudziko iri rinogona kugadziriswa zvakare. Chero zvazvingaitika, ikozvino une hwaro hwaunogona kuwedzera kana uchida. Semuenzaniso, ita shuwa kuti faira reExcel rinotumirwa kumushandisi sekunamatira kune email.

Python - mubatsiri mukutsvaga matikiti emhepo asingadhure kune avo vanoda kufamba

Vashandisi vakanyoresa chete ndivo vanogona kutora chikamu muongororo. Nyorera mu, Munogamuchirwa.

Iwe unoshandisa web scraping technologies here?

  • kuti

  • kwete

8 vashandisi vakavhota. 1 mushandisi haana.

Source: www.habr.com

Voeg