Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda

Wolemba nkhaniyi, kumasulira kwake komwe tikusindikiza lero, akunena kuti cholinga chake ndikulankhula za chitukuko cha web scraper ku Python pogwiritsa ntchito Selenium, yomwe imayang'ana mitengo ya tikiti ya ndege. Posaka matikiti, masiku osinthika amagwiritsidwa ntchito (+- 3 masiku okhudzana ndi masiku omwe atchulidwa). The scraper amasunga zotsatira zakusaka mu fayilo ya Excel ndikutumiza munthu yemwe adafufuza imelo ndi chidule cha zomwe adapeza. Cholinga cha polojekitiyi ndi kuthandiza apaulendo kupeza malonda abwino kwambiri.

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda

Ngati, pamene mukumvetsa nkhaniyo, mukumva kuti mwatayika, yang'anani izi nkhani.

Kodi tiyang'ana chiyani?

Ndinu omasuka kugwiritsa ntchito dongosolo lomwe likufotokozedwa pano momwe mukufunira. Mwachitsanzo, ndinkagwiritsa ntchito posaka maulendo a kumapeto kwa sabata ndi matikiti opita kumudzi kwathu. Ngati mukufunitsitsa kupeza matikiti opindulitsa, mutha kuyendetsa script pa seva (yosavuta Seva, kwa ma ruble 130 pamwezi, ndizoyenera izi) ndipo onetsetsani kuti zikuyenda kamodzi kapena kawiri patsiku. Zotsatira zidzatumizidwa kwa inu ndi imelo. Kuphatikiza apo, ndikupangira kukhazikitsa zonse kuti script isunge fayilo ya Excel yokhala ndi zotsatira zosaka mufoda ya Dropbox, yomwe imakupatsani mwayi wowonera mafayilo oterowo kulikonse komanso nthawi iliyonse.

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda
Sindinapeze matarifi omwe ali ndi zolakwika pano, koma ndikuganiza kuti ndizotheka

Mukasaka, monga tanenera kale, "tsiku losinthika" limagwiritsidwa ntchito; script imapeza zopereka zomwe zili mkati mwa masiku atatu kuchokera masiku omwe aperekedwa. Ngakhale poyendetsa script, imasaka zotsatsa mbali imodzi yokha, ndizosavuta kuzisintha kuti zitha kusonkhanitsa zambiri zamayendedwe angapo owuluka. Ndi chithandizo chake, mutha kuyang'ananso mitengo yolakwika; zopeza zotere zitha kukhala zosangalatsa kwambiri.

N'chifukwa chiyani mukufunikira web scraper ina?

Nditayamba kukwapula pa intaneti, moona mtima sindinachite chidwi nazo. Ndinkafuna kuchita zambiri pazantchito zolosera zam'tsogolo, kusanthula zachuma, komanso, mwina, pakuwunika momwe zolemba zimasinthira. Koma zidapezeka kuti zinali zosangalatsa kwambiri kudziwa momwe mungapangire pulogalamu yomwe imasonkhanitsa deta kuchokera pamasamba. Pamene ndimafufuza nkhaniyi, ndinazindikira kuti web scraping ndi "injini" ya intaneti.

Mutha kuganiza kuti mawu awa ndi olimba mtima kwambiri. Koma taganizirani kuti Google idayamba ndi tsamba lawebusayiti lomwe Larry Page adapanga pogwiritsa ntchito Java ndi Python. Ma robot a Google akhala akuyang'ana pa intaneti, kuyesera kupatsa ogwiritsa ntchito mayankho abwino ku mafunso awo. Kujambula pa intaneti kuli ndi ntchito zopanda malire, ndipo ngakhale mutakhala ndi chidwi ndi china chake mu Data Science, mudzafunika luso lopukuta kuti mupeze deta yomwe mukufunikira kuti mufufuze.

Ndinapeza njira zina zomwe zimagwiritsidwa ntchito pano modabwitsa buku za web scraping, zomwe ndapeza posachedwa. Lili ndi zitsanzo ndi malingaliro osavuta ogwiritsira ntchito zimene mwaphunzira. Kuphatikiza apo, pali mutu wosangalatsa kwambiri pakulambalala macheke a reCaptcha. Izi zidabwera ngati nkhani kwa ine, popeza sindimadziwa kuti panali zida zapadera komanso ntchito zonse zothetsera mavuto.

Kodi mumakonda kuyenda?!

Pafunso losavuta komanso lopanda vuto lomwe limafunsidwa pamutu wagawoli, nthawi zambiri mumatha kumva yankho labwino, limodzi ndi nkhani zingapo kuchokera pamaulendo a munthu yemwe adafunsidwa. Ambiri aife titha kuvomereza kuti kuyenda ndi njira yabwino yodziwira nokha m'zikhalidwe zatsopano ndikukulitsa malingaliro anu. Komabe, ngati mutafunsa wina ngati angakonde kusaka matikiti a ndege, ndikukhulupirira kuti yankho silikhala labwino. Zowona zake, Python amabwera kudzatithandiza pano.

Ntchito yoyamba yomwe tiyenera kuthana nayo pakupanga dongosolo lofufuzira zambiri pa matikiti a ndege ndikusankha nsanja yoyenera komwe tidzatengerako chidziwitso. Kuthetsa vutoli sikunali kophweka kwa ine, koma pamapeto pake ndinasankha utumiki wa Kayak. Ndinayesa mautumiki a Momondo, Skyscanner, Expedia, ndi ena ochepa, koma njira zotetezera robot pazinthu izi zinali zosatheka. Pambuyo poyesera kangapo, pomwe ndimayenera kuthana ndi magetsi, kuwoloka kwa oyenda pansi ndi njinga, kuyesa kutsimikizira machitidwe kuti ndine munthu, ndinaganiza kuti Kayak inali yoyenera kwa ine, ngakhale kuti ngakhale masamba ambiri atadzaza. m'nthawi yochepa, ndipo macheke nawonso amayamba. Ndidakwanitsa kupanga bot kutumiza zopempha pamalopo pakadutsa maola 4 mpaka 6, ndipo zonse zidayenda bwino. Nthawi ndi nthawi, pamakhala zovuta mukamagwira ntchito ndi Kayak, koma ngati ayamba kukuvutitsani ndi macheke, ndiye kuti muyenera kuthana nawo pamanja ndikuyambitsa bot, kapena dikirani maola angapo ndipo macheke ayime. Ngati ndi kotheka, mutha kusintha mosavuta kachidindo ka nsanja ina, ndipo ngati mutero, mutha kulengeza mu ndemanga.

Ngati mutangoyamba kumene kukwapula pa intaneti ndipo simukudziwa chifukwa chake mawebusaiti ena akulimbana nawo, ndiye kuti musanayambe ntchito yanu yoyamba m'derali, dzichitireni zabwino ndikufufuza pa Google pa mawu akuti "web scraping etiquette" . Zoyeserera zanu zitha kutha posachedwa kuposa momwe mukuganizira ngati mukusakatula intaneti mopanda nzeru.

Kuyamba

Nayi chidule cha zomwe zidzachitike pa intaneti yathu scraper code:

  • Lowetsani malaibulale ofunikira.
  • Kutsegula tsamba la Google Chrome.
  • Imbani ntchito yomwe imayambitsa bot, ndikudutsa mizinda ndi masiku omwe adzagwiritsidwe ntchito posaka matikiti.
  • Ntchitoyi imatenga zotsatira zoyamba zosaka, zosankhidwa bwino, ndikudina batani kuti muyike zotsatira zambiri.
  • Ntchito ina imasonkhanitsa deta kuchokera patsamba lonse ndikubwezeretsanso mawonekedwe a data.
  • Masitepe awiri am'mbuyomu amachitidwa pogwiritsa ntchito mitundu yosanja ndi mtengo wa tikiti (yotsika mtengo) komanso kuthamanga kwa ndege (mwachangu kwambiri).
  • Wogwiritsa script amatumizidwa imelo yomwe ili ndi chidule cha mitengo ya matikiti (matikiti otsika mtengo komanso mtengo wapakati), ndipo chithunzi cha data chokhala ndi chidziwitso chosankhidwa ndi zizindikiro zitatu zomwe tatchulazi chimasungidwa ngati fayilo ya Excel.
  • Zochita zonse pamwambapa zimachitika mozungulira pakapita nthawi yodziwika.

Tiyenera kudziwa kuti projekiti iliyonse ya Selenium imayamba ndi woyendetsa intaneti. Ndimagwiritsa ntchito Chromedriver, Ndimagwira ntchito ndi Google Chrome, koma pali njira zina. PhantomJS ndi Firefox ndizodziwikanso. Mukatsitsa dalaivala, muyenera kuyiyika mufoda yoyenera, ndipo izi zimamaliza kukonzekera ntchito yake. Mizere yoyamba ya script yathu imatsegula tabu yatsopano ya Chrome.

Kumbukirani kuti m'nkhani yanga sindikuyesera kutsegula mahorizoni atsopano kuti mupeze malonda abwino pa matikiti a ndege. Pali njira zapamwamba kwambiri zofufuzira zotsatsa zotere. Ndikungofuna kupereka owerenga nkhaniyi njira yosavuta koma yothandiza yothetsera vutoli.

Nayi code yomwe takambirana pamwambapa.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

Kumayambiriro kwa kachidindo mungathe kuwona malamulo olowetsa phukusi omwe amagwiritsidwa ntchito pa polojekiti yathu yonse. Choncho, randint amagwiritsidwa ntchito kuti bot "agone" kwa masekondi angapo asanayambe ntchito yatsopano yosaka. Nthawi zambiri, palibe bot imodzi yomwe ingachite popanda izi. Ngati mutayendetsa nambala yomwe ili pamwambayi, zenera la Chrome lidzatsegulidwa, lomwe bot idzagwiritsa ntchito pogwira ntchito ndi masamba.

Tiyeni tiyese pang'ono ndikutsegula tsamba la kayak.com pawindo lina. Tidzasankha mzinda womwe titi tiwuluke, mzinda womwe tikufuna kukafikako, komanso masiku othawa. Posankha masiku, onetsetsani kuti masiku + -3 amagwiritsidwa ntchito. Ndinalemba code ndikuganizira zomwe tsambalo limapanga poyankha zopempha zoterezi. Ngati, mwachitsanzo, muyenera kufufuza matikiti amasiku osankhidwa okha, ndiye kuti pali mwayi waukulu woti musinthe nambala ya bot. Ndikakamba za code, ndimapereka mafotokozedwe oyenera, koma ngati mukumva kusokonezeka, ndidziwitseni.

Tsopano dinani batani losaka ndikuyang'ana ulalo womwe uli mu bar ya adilesi. Ziyenera kukhala zofanana ndi ulalo womwe ndimagwiritsa ntchito pachitsanzo chomwe chili pansipa pomwe kusinthaku kumalengezedwa kayak, yomwe imasunga ulalo, ndipo njirayo imagwiritsidwa ntchito get woyendetsa intaneti. Mukadina batani losaka, zotsatira ziyenera kuwonekera patsamba.

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda
Pamene ndinagwiritsa ntchito lamulo get kupitilira kawiri kapena katatu mkati mwa mphindi zochepa, ndidafunsidwa kuti ndimalize kutsimikizira pogwiritsa ntchito reCaptcha. Mutha kudutsa cheke ichi pamanja ndikupitiliza kuyesa mpaka dongosolo litasankha kuyendetsa cheke chatsopano. Ndikayesa script, zinkawoneka ngati gawo loyamba lofufuzira nthawi zonse limayenda bwino, kotero ngati mukufuna kuyesa kachidindoyo, mumayenera kuyang'ana pamanja nthawi ndi nthawi ndikulola kachidindoyo kuthamanga, pogwiritsa ntchito nthawi yayitali pakati pa kufufuza. Ndipo, ngati mukuganiza za izi, munthu sangafune kudziwa zambiri zamitengo yamatikiti yomwe amalandila pakadutsa mphindi 10 pakati pakusaka.

Kugwira ntchito ndi tsamba pogwiritsa ntchito XPath

Chifukwa chake, tidatsegula zenera ndikutsitsa tsambalo. Kuti tipeze mitengo ndi zambiri, tifunika kugwiritsa ntchito ukadaulo wa XPath kapena zosankha za CSS. Ndinaganiza zokakamira XPath ndipo sindinamve kufunika kogwiritsa ntchito osankha CSS, koma ndizotheka kugwira ntchito mwanjira imeneyo. Kuyenda mozungulira tsamba pogwiritsa ntchito XPath kungakhale kovuta, ndipo ngakhale mutagwiritsa ntchito njira zomwe ndafotokozeramo izi Nkhani, yomwe imaphatikizapo kukopera zizindikiritso zofananira patsamba, ndinazindikira kuti iyi si njira yabwino yopezera zinthu zofunika. Mwa njira, in izi Bukhuli limapereka kufotokozera kwabwino kwambiri pazoyambira zogwirira ntchito ndi masamba ogwiritsa ntchito XPath ndi CSS osankha. Izi ndi zomwe njira yofananira yoyendetsa ukonde imawonekera.

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda
Choncho, tiyeni tipitirize kugwira ntchito pa bot. Tiyeni tigwiritse ntchito mphamvu za pulogalamuyi kuti tisankhe matikiti otsika mtengo. Pachithunzi chotsatira, nambala yosankha XPath ikuwonetsedwa mofiira. Kuti muwone kachidindo, muyenera dinani kumanja patsamba lomwe mukufuna ndikusankha Yang'anirani lamulo kuchokera pamenyu yomwe ikuwoneka. Lamuloli likhoza kuyitanidwa pazinthu zosiyanasiyana zamasamba, zomwe code yake idzawonetsedwa ndikuwonetsedwa mu code viewer.

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda
Onani tsamba latsamba

Kuti mupeze chitsimikiziro cha kulingalira kwanga za kuipa kokopera osankhidwa kuchokera ku code, tcherani khutu kuzinthu zotsatirazi.

Izi ndi zomwe mumapeza mukakopera khodi:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

Kuti mukopere chinthu chonga ichi, muyenera dinani kumanja pagawo la code lomwe mukufuna ndikusankha Copy> Copy XPath lamulo kuchokera pamenyu yomwe ikuwoneka.

Izi ndi zomwe ndimakonda kutanthauzira batani lotsika mtengo kwambiri:

cheap_results = ‘//a[@data-code = "price"]’

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda
Koperani Lamulo> Koperani XPath

Ndizowonekeratu kuti njira yachiwiri ikuwoneka yosavuta kwambiri. Ikagwiritsidwa ntchito, imafufuza chinthu A chomwe chili ndi tanthauzo lake data-codewofanana ndi price. Mukamagwiritsa ntchito njira yoyamba, chinthucho chimafufuzidwa id chomwe chili chofanana ndi wtKI-price_aTab, ndipo njira ya XPath yopita ku chinthucho imawoneka ngati /div[1]/div/div/div[1]/div/span/span. Funso la XPath ngati ili patsamba lichita chinyengo, koma kamodzi kokha. Ndikhoza kunena pakali pano id idzasintha nthawi ina pamene tsamba lidzatsegulidwa. Kutsatizana kwa zilembo wtKI amasintha mwachangu nthawi iliyonse tsambalo litatsitsidwa, kotero kuti code yomwe imagwiritsa ntchito idzakhala yopanda phindu pambuyo pa tsamba lotsatira. Chifukwa chake khalani ndi nthawi kuti mumvetsetse XPath. Kudziwa zimenezi kudzakuthandizani.

Komabe, ziyenera kudziwidwa kuti kukopera osankha XPath kungakhale kothandiza pogwira ntchito ndi masamba osavuta, ndipo ngati muli omasuka ndi izi, palibe cholakwika.

Tsopano tiyeni tiganizire zoyenera kuchita ngati mukufuna kupeza zotsatira zonse zakusaka m'mizere ingapo, mkati mwa mndandanda. Zosavuta kwambiri. Chotsatira chilichonse chimakhala mkati mwa chinthu chokhala ndi kalasi resultWrapper. Kukweza zotsatira zonse kutha kuchitidwa mu lupu lofanana ndi lomwe likuwonetsedwa pansipa.

Zindikirani kuti ngati mumvetsetsa zomwe zili pamwambapa, muyenera kumvetsetsa mosavuta ma code ambiri omwe tidzasanthula. Pamene code iyi ikuyenda, timapeza zomwe tikufuna (kwenikweni, chinthu chomwe zotsatira zake zimakutidwa) pogwiritsa ntchito njira yofotokozera njira (XPath). Izi zimachitika kuti mutenge mawu a chinthucho ndikuchiyika mu chinthu chomwe deta imatha kuwerengedwa (yoyamba kugwiritsidwa ntchito. flight_containers, ndiye - flights_list).

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda
Mizere itatu yoyambirira ikuwonetsedwa ndipo titha kuwona bwino zonse zomwe tikufuna. Komabe, tili ndi njira zambiri zosangalatsa zopezera chidziwitso. Tiyenera kutenga deta kuchokera ku chinthu chilichonse padera.

Pitani kuntchito!

Njira yosavuta yolembera ntchito ndikutsitsa zotsatira zowonjezera, ndipamene tiyambire. Ndikufuna kukulitsa kuchuluka kwa maulendo apandege omwe pulogalamuyo imalandila zambiri, osadzutsa kukayikira muutumiki womwe umatsogolera pakuwunika, chifukwa chake ndikudina batani la Katundu zochulukirapo kamodzi tsamba lililonse likuwonetsedwa. Mu code iyi, muyenera kulabadira block try, zomwe ndidawonjezera chifukwa nthawi zina batani silimatsegula bwino. Ngati mukukumananso ndi izi, perekani ndemanga ku ntchitoyi mu code code start_kayak, zomwe tiwona pansipa.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Tsopano, titafufuza kwa nthawi yayitali ntchitoyi (nthawi zina ndimatha kunyamulidwa), ndife okonzeka kulengeza ntchito yomwe idzawononge tsambalo.

Ndasonkhanitsa kale zambiri zomwe zikufunika mu ntchito yotsatirayi yotchedwa page_scrape. Nthawi zina deta yobwereranso imaphatikizidwa, choncho ndimagwiritsa ntchito njira yosavuta yolekanitsa. Mwachitsanzo, ndikamagwiritsa ntchito zosinthika koyamba section_a_list и section_b_list. Ntchito yathu imabweretsanso chimango cha data flights_df, izi zimatilola kuti tilekanitse zotsatira zomwe tapeza kuchokera ku njira zosiyanasiyana zosankhira deta ndipo kenako kuziphatikiza.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

Ndinayesa kutchula zosinthazo kuti codeyo imveke. Kumbukirani kuti zosintha zimayamba ndi a ndi gawo loyamba la njira, ndi b - kwachiwiri. Tiyeni tipitirire ku ntchito ina.

Njira zothandizira

Tsopano tili ndi ntchito yomwe imatilola kutsitsa zotsatira zowonjezera ndi ntchito yokonza zotsatirazo. Nkhaniyi ikanatha apa, popeza ntchito ziwirizi zimapereka zonse zomwe mungafune kuti mufufuze masamba omwe mutha kutsegula nokha. Koma sitinaganizirepo zina mwa njira zothandizira zomwe takambirana pamwambapa. Mwachitsanzo, iyi ndi code yotumizira maimelo ndi zinthu zina. Zonsezi zingapezeke mu ntchito start_kayak, zomwe tikambirana tsopano.

Kuti ntchitoyi igwire ntchito, muyenera kudziwa zamizinda ndi masiku. Pogwiritsa ntchito chidziwitsochi, imapanga ulalo wosinthika kayak, yomwe imagwiritsidwa ntchito kukutengerani patsamba lomwe lidzakhala ndi zotsatira zosanjidwa molingana ndi zomwe zafunsidwa. Pambuyo pa gawo loyamba lochotsa, tidzagwira ntchito ndi mitengo yomwe ili patebulo pamwamba pa tsamba. Mwakutero, tipeza mtengo wochepera wa tikiti komanso mtengo wapakati. Zonsezi, pamodzi ndi zoneneratu zomwe zaperekedwa ndi tsambalo, zidzatumizidwa ndi imelo. Patsamba, tebulo lolingana liyenera kukhala pakona yakumanzere. Kugwira ntchito ndi tebulo ili, mwa njira, kungayambitse cholakwika pofufuza pogwiritsa ntchito masiku enieni, chifukwa pamenepa tebulo silikuwonetsedwa patsamba.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

Ndinayesa script iyi pogwiritsa ntchito akaunti ya Outlook (hotmail.com). Sindinayesere kuti igwire ntchito moyenera ndi akaunti ya Gmail, imelo iyi ndiyotchuka kwambiri, koma pali zambiri zomwe mungachite. Ngati mumagwiritsa ntchito akaunti ya Hotmail, ndiye kuti zonse zigwire ntchito, muyenera kungolowetsa deta yanu mu code.

Ngati mukufuna kumvetsetsa zomwe zikuchitika m'magawo ena a code pa ntchitoyi, mutha kuzikopera ndikuyesa nazo. Kuyesera ndi code ndiyo njira yokhayo yomvetsetsa bwino.

Okonzeka dongosolo

Tsopano popeza tachita zonse zomwe takambirana, titha kupanga lupu losavuta lomwe limatcha ntchito zathu. Zolemba zimapempha data kuchokera kwa wogwiritsa ntchito zamizinda ndi masiku. Mukayesa ndikuyambitsanso script nthawi zonse, simungafune kuyika izi pamanja nthawi zonse, kotero mizere yofananira, nthawi yonse yoyesedwa, ikhoza kufotokozedwa ndikusiya zomwe zili pansipa, momwe deta yofunikira ndi script ndi hardcoded.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Izi ndi zomwe kuyesa kwa script kumawoneka.
Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda
Yesani kuthamanga kwa script

Zotsatira

Ngati mwakwanitsa mpaka pano, zikomo! Tsopano muli ndi web scraper yogwira ntchito, ngakhale ndikuwona kale njira zambiri zowongolera. Mwachitsanzo, imatha kuphatikizidwa ndi Twilio kotero kuti imatumiza mameseji m'malo mwa maimelo. Mutha kugwiritsa ntchito VPN kapena china chake kuti mulandire zotsatira kuchokera kumaseva angapo nthawi imodzi. Palinso vuto lomwe limadza nthawi ndi nthawi ndikuyang'ana wogwiritsa ntchito tsambalo kuti muwone ngati ndi munthu, koma vutoli litha kuthetsedwanso. Mulimonsemo, tsopano muli ndi maziko omwe mungathe kukulitsa ngati mukufuna. Mwachitsanzo, onetsetsani kuti fayilo ya Excel yatumizidwa kwa wogwiritsa ntchito ngati cholumikizira ku imelo.

Python - wothandizira kupeza matikiti a ndege otsika mtengo kwa iwo omwe amakonda kuyenda

Ogwiritsa ntchito olembetsedwa okha ndi omwe angatenge nawo gawo pa kafukufukuyu. Lowani muakauntichonde.

Kodi mumagwiritsa ntchito matekinoloje a web scraping?

  • kuti

  • No

Ogwiritsa 8 adavota. Wogwiritsa m'modzi adasala.

Source: www.habr.com

Kuwonjezera ndemanga