Python is an assistant in finding cheap flights for those who love to travel

The author of the article, the translation of which we are publishing today, says that her goal is to talk about the development of a Python web scraper using Selenium that searches for airfare prices. When searching for tickets, flexible dates are used (+- 3 days from the indicated dates). The scraper saves the search results in an Excel file and sends an email to the person who ran it with a summary of what it found. The goal of this project is to help travelers find the best deals.

Python is an assistant in finding cheap flights for those who love to travel

If you feel lost as you go through the material, take a look at this an article.

What are we looking for?

You are free to use the system described here as you wish. For example, I used it to find weekend tours and tickets to my hometown. If you are serious about finding profitable tickets, you can run a script on the server (a simple server, for 130 rubles a month, it’s quite suitable for this) and make it run once or twice a day. Search results will be sent to you by email. In addition, I recommend setting everything up so that the script saves an Excel file with search results in a Dropbox folder, which will allow you to view such files from anywhere and at any time.

Python is an assistant in finding cheap flights for those who love to travel
I haven't found any tariffs with errors yet, but I think it's possible

When searching, as already mentioned, a “flexible date” is used, the script finds offers that are within three days of the given dates. Although the script only searches for offers in one direction when it is run, it is easy to modify it so that it can collect data for several flight directions. With it, you can even look for erroneous rates, such finds can be very interesting.

Why do you need another web scraper?

When I first got into web scraping, I honestly wasn't that interested. I wanted to do more projects in predictive modeling, financial analysis, and perhaps text sentiment analysis. But it turned out to be very interesting to figure out how to create a program that collects data from websites. As I delved into this topic, I realized that it is web scraping that is the "engine" of the Internet.

You may think that this is too bold a statement. But consider that Google started with a web scraper that Larry Page created using Java and Python. Google bots have explored and are exploring the internet in an attempt to provide their users with the best possible answers to their questions. Web scraping has an endless number of applications, and even if you are interested in something else in the field of Data Science, you will need some scraping skills to get data for analysis.

Some of the techniques used here I found in a wonderful a book about web scraping, which I recently acquired. In it you can find many simple examples and ideas for the practical application of what has been learned. Also, there is a very interesting chapter on bypassing reCaptcha checks. For me, this was news, since I did not know that there were special tools and even entire services for solving such problems.

Do you like to travel?!

The simple and rather innocuous question posed as the title of this section can often be answered in the affirmative, complete with a couple of travel stories from the person asked. Most of us would agree that travel is a great way to immerse yourself in new cultures and broaden your horizons. However, if you ask someone a question about whether he likes to look for flights, then I'm sure that the answer to it will be far from being so positive. In fact, this is where Python comes to the rescue.

The first task that we need to solve on the way to creating a system for searching information on air tickets will be the selection of a suitable platform from which we will take information. The solution to this problem was not easy for me, but in the end I chose the Kayak service. I tried Momondo, Skyscanner, Expedia, and a few others, but the anti-robotic mechanisms on these resources were impenetrable. After several attempts to convince the systems that I'm human, I had to deal with traffic lights, crosswalks and bicycles, I decided that Kayak was the best for me, despite the fact that even here, if load too many pages in a short time, checks also begin. I managed to make the bot send requests to the site at intervals of 4 to 6 hours, and everything worked fine. From time to time, difficulties arise when working with Kayak, but if they start pestering you with checks, then you need to either deal with them manually, then start the bot, or wait a few hours, and the checks should stop. You, if necessary, may well adapt the code for another platform, and if you do so, you can report it in the comments.

If you're just getting started with web scraping and don't know why some websites are struggling with it, then before you start your first project in this area, do yourself a favor and search Google for materials on the words web scraping etiquette. Your experiments may end sooner than you think, if you do web scraping unwisely.

Beginning of work

Here is a general overview of what will happen in our web scraper code:

  • Import required libraries.
  • Opening a Google Chrome tab.
  • Calling a function that starts the bot, passing it the cities and dates that will be used when searching for tickets.
  • This function gets the first search results sorted by best and clicks on a button to load more results.
  • Another function collects data from the entire page and returns a data frame.
  • The previous two steps are performed using the sort types by ticket price (cheap) and by flight speed (fastest).
  • An email is sent to the user of the script containing a summary of the ticket prices (cheapest tickets and average price), and a data frame with the details sorted by the above three indicators is saved as an Excel file.
  • All the above actions are performed in a cycle after a given period of time.

It should be noted that every Selenium project starts with a web driver. I use chromedriver, I work with Google Chrome, but there are other options. PhantomJS and Firefox are also popular. After downloading the driver, it must be placed in the appropriate folder, this completes the preparation for its use. The first lines of our script open a new Chrome tab.

Keep in mind that I, in my story, am not trying to open new horizons for finding profitable deals on air tickets. There are much more advanced techniques for finding such offers. I just want to offer readers of this material a simple but practical way to solve this problem.

Here is the code we talked about above.

from time import sleep, strftime
from random import randint
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import smtplib
from email.mime.multipart import MIMEMultipart

# Используйте тут ваш путь к chromedriver!
chromedriver_path = 'C:/{YOUR PATH HERE}/chromedriver_win32/chromedriver.exe'

driver = webdriver.Chrome(executable_path=chromedriver_path) # Этой командой открывается окно Chrome
sleep(2)

At the beginning of the code, you can see the package import commands that are used throughout our project. So, randint used to cause the bot to "sleep" for a random number of seconds before starting a new search operation. Usually not a single bot can do without it. If you run the above code, a Chrome window will open, which the bot will use to work with sites.

Let's conduct a small experiment and open the site kayak.com in a separate window. We will select the city from which we are going to fly, and the city we want to get to, as well as the dates of the flights. When choosing dates, check that the range +-3 days is used. I wrote the code taking into account what the site produces in response to such requests. If, for example, you need to search for tickets only for specified dates, then there is a high probability that you will have to modify the bot code. When talking about the code, I make the appropriate explanations, but if you feel that you are confused, let me know.

Now click on the start search button and look at the link in the address bar. It should be similar to the link I use in the example below where the variable is declared kayak, which stores the URL, and the method is used get web driver. After clicking on the search button, the results should appear on the page.

Python is an assistant in finding cheap flights for those who love to travel
When I used the command get more than two or three times within a few minutes, I was offered to pass a verification using reCaptcha. You can pass this check manually and continue experimenting until the system decides to arrange a new check. When I tested the script, it felt like the first search session always went smoothly, so if you want to experiment with the code, all you have to do is manually check periodically and leave the code running using long intervals between search sessions. And, if you think about it, a person is unlikely to need information about ticket prices obtained with 10-minute intervals between searches.

Working with the page using XPath

So, we opened the window and loaded the site. In order to get prices and other information, we need to use XPath technology or CSS selectors. I decided to stick with XPath and didn't feel the need to use CSS selectors, but it's entirely possible to work that way. Navigating the page using XPath can be tricky, and even if you use the methods I described in this article, where copying the appropriate identifiers from the page code was used, I realized that this, in fact, is not the best way to refer to the necessary elements. By the way, in this In this book, you'll find an excellent introduction to the basics of working with pages using XPath and CSS selectors. Here is what the corresponding webdriver method looks like.

Python is an assistant in finding cheap flights for those who love to travel
So, we continue to work on the bot. We will use the program to select the cheapest tickets. In the following image, the XPath selector code is highlighted in red. In order to view the code, you need to right-click on the page element you are interested in and select the Inspect command from the menu that appears. This command can be called on different elements of the page, the code of which will be displayed and highlighted in the code view window.

Python is an assistant in finding cheap flights for those who love to travel
View page code

In order to find confirmation of my reasoning about the shortcomings of copying selectors from code, pay attention to the following features.

Here's what happens when you copy the code:

//*[@id="wtKI-price_aTab"]/div[1]/div/div/div[1]/div/span/span

In order to copy something similar, you need to right-click on the section of code you are interested in and select the Copy > Copy XPath command from the menu that appears.

Here's what I used to define the Cheapest button:

cheap_results = ‘//a[@data-code = "price"]’

Python is an assistant in finding cheap flights for those who love to travel
Copy > Copy XPath Command

It is quite obvious that the second option looks much simpler. When used, it searches for an a element that has the attribute data-codeequal to price. When using the first option, the element is searched id which is equal to wtKI-price_aTab, while the XPath path to the element looks like /div[1]/div/div/div[1]/div/span/span. Such an XPath query to the page will do its job, but only once. I can say right now that id will change the next time the page is loaded. Character sequence wtKI changes dynamically with each page load, as a result, the code in which it is used, after the next page reload, will be useless. So take some time to get to grips with XPath. This knowledge will serve you well.

However, it should be noted that copying XPath selectors can come in handy when working with fairly simple sites, and if that suits you, there is nothing wrong with that.

Now let's think about how to be, if you need to get all the search results in several lines, inside the list. Very simple. Each result is inside an object with a class resultWrapper. Loading all the results can be done in a loop similar to the one shown below.

It should be noted that if you understand the above, then you should easily understand most of the code that we will analyze. In the course of this code, what we need (in fact, this is the element in which the result is wrapped), we access using some mechanism for specifying the path (XPath). This is done in order to get the text of the element and place it in an object from which you can read data (first use flight_containersthen - flights_list).

Python is an assistant in finding cheap flights for those who love to travel
The first three lines are output and we can clearly see everything we need. However, we have more interesting ways to get information. We need to take data from each element separately.

To work!

It's easiest to write a function to load additional results, so let's start with that. I would like to maximize the number of flights that the program receives without causing the service to become suspicious, leading to a check, so I click the Load more results button once each time the page is displayed. In this code, pay attention to the block try, which I added due to the fact that sometimes the button does not load normally. If you also come across this, comment out the calls to this function in the function code start_kayak, which we will consider below.

# Загрузка большего количества результатов для того, чтобы максимизировать объём собираемых данных
def load_more():
    try:
        more_results = '//a[@class = "moreButton"]'
        driver.find_element_by_xpath(more_results).click()
        # Вывод этих заметок в ходе работы программы помогает мне быстро выяснить то, чем она занята
        print('sleeping.....')
        sleep(randint(45,60))
    except:
        pass

Now, after a long discussion of this function (sometimes I can get carried away), we are ready to declare a function that will handle page scraping.

I have already compiled most of what is needed in the following function called page_scrape. Sometimes the returned data about the stages of the path are combined, I use a simple method to separate them. For example, when I use variables for the first time section_a_list и section_b_list. Our function returns a data frame flights_df, this allows us to separate the results obtained by using different data sorting methods, and later combine them.

def page_scrape():
    """This function takes care of the scraping part"""
    
    xp_sections = '//*[@class="section duration"]'
    sections = driver.find_elements_by_xpath(xp_sections)
    sections_list = [value.text for value in sections]
    section_a_list = sections_list[::2] # так мы разделяем информацию о двух полётах
    section_b_list = sections_list[1::2]
    
    # Если вы наткнулись на reCaptcha, вам может понадобиться что-то предпринять.
    # О том, что что-то пошло не так, вы узнаете исходя из того, что вышеприведённые списки пусты
    # это выражение if позволяет завершить работу программы или сделать ещё что-нибудь
    # тут можно приостановить работу, что позволит вам пройти проверку и продолжить скрапинг
    # я использую тут SystemExit так как хочу протестировать всё с самого начала
    if section_a_list == []:
        raise SystemExit
    
    # Я буду использовать букву A для уходящих рейсов и B для прибывающих
    a_duration = []
    a_section_names = []
    for n in section_a_list:
        # Получаем время
        a_section_names.append(''.join(n.split()[2:5]))
        a_duration.append(''.join(n.split()[0:2]))
    b_duration = []
    b_section_names = []
    for n in section_b_list:
        # Получаем время
        b_section_names.append(''.join(n.split()[2:5]))
        b_duration.append(''.join(n.split()[0:2]))

    xp_dates = '//div[@class="section date"]'
    dates = driver.find_elements_by_xpath(xp_dates)
    dates_list = [value.text for value in dates]
    a_date_list = dates_list[::2]
    b_date_list = dates_list[1::2]
    # Получаем день недели
    a_day = [value.split()[0] for value in a_date_list]
    a_weekday = [value.split()[1] for value in a_date_list]
    b_day = [value.split()[0] for value in b_date_list]
    b_weekday = [value.split()[1] for value in b_date_list]
    
    # Получаем цены
    xp_prices = '//a[@class="booking-link"]/span[@class="price option-text"]'
    prices = driver.find_elements_by_xpath(xp_prices)
    prices_list = [price.text.replace('$','') for price in prices if price.text != '']
    prices_list = list(map(int, prices_list))

    # stops - это большой список, в котором первый фрагмент пути находится по чётному индексу, а второй - по нечётному
    xp_stops = '//div[@class="section stops"]/div[1]'
    stops = driver.find_elements_by_xpath(xp_stops)
    stops_list = [stop.text[0].replace('n','0') for stop in stops]
    a_stop_list = stops_list[::2]
    b_stop_list = stops_list[1::2]

    xp_stops_cities = '//div[@class="section stops"]/div[2]'
    stops_cities = driver.find_elements_by_xpath(xp_stops_cities)
    stops_cities_list = [stop.text for stop in stops_cities]
    a_stop_name_list = stops_cities_list[::2]
    b_stop_name_list = stops_cities_list[1::2]
    
    # сведения о компании-перевозчике, время отправления и прибытия для обоих рейсов
    xp_schedule = '//div[@class="section times"]'
    schedules = driver.find_elements_by_xpath(xp_schedule)
    hours_list = []
    carrier_list = []
    for schedule in schedules:
        hours_list.append(schedule.text.split('n')[0])
        carrier_list.append(schedule.text.split('n')[1])
    # разделяем сведения о времени и о перевозчиках между рейсами a и b
    a_hours = hours_list[::2]
    a_carrier = carrier_list[1::2]
    b_hours = hours_list[::2]
    b_carrier = carrier_list[1::2]

    
    cols = (['Out Day', 'Out Time', 'Out Weekday', 'Out Airline', 'Out Cities', 'Out Duration', 'Out Stops', 'Out Stop Cities',
            'Return Day', 'Return Time', 'Return Weekday', 'Return Airline', 'Return Cities', 'Return Duration', 'Return Stops', 'Return Stop Cities',
            'Price'])

    flights_df = pd.DataFrame({'Out Day': a_day,
                               'Out Weekday': a_weekday,
                               'Out Duration': a_duration,
                               'Out Cities': a_section_names,
                               'Return Day': b_day,
                               'Return Weekday': b_weekday,
                               'Return Duration': b_duration,
                               'Return Cities': b_section_names,
                               'Out Stops': a_stop_list,
                               'Out Stop Cities': a_stop_name_list,
                               'Return Stops': b_stop_list,
                               'Return Stop Cities': b_stop_name_list,
                               'Out Time': a_hours,
                               'Out Airline': a_carrier,
                               'Return Time': b_hours,
                               'Return Airline': b_carrier,                           
                               'Price': prices_list})[cols]
    
    flights_df['timestamp'] = strftime("%Y%m%d-%H%M") # время сбора данных
    return flights_df

I tried to name the variables in such a way that the code would be understandable. Remember that variables starting with a belong to the first stage of the journey, and b - to the second. Let's move on to the next function.

Auxiliary mechanisms

We now have a function that allows us to load additional search results and a function to process those results. This article could have ended there, since these two functions provide everything you need to scrape pages that you can open yourself. But we have not yet considered some of the auxiliary mechanisms that were discussed above. For example, this is a code for sending emails and some other things. All this can be found in the function start_kayak, which we will now consider.

This function requires information about cities and dates. She, using this information, forms a link in a variable kayak, which is used to navigate to the page where the search results will be located, sorted by their best match to the query. After the first scraping session, we will work with the prices in the table at the top of the page. Namely, we find the minimum ticket price and the average price. All this, along with the prediction given by the site, will be sent by email. On the page, the corresponding table should be in the upper left corner. Working with this table, by the way, may cause an error when searching using exact dates, since in this case the table is not displayed on the page.

def start_kayak(city_from, city_to, date_start, date_end):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    kayak = ('https://www.kayak.com/flights/' + city_from + '-' + city_to +
             '/' + date_start + '-flexible/' + date_end + '-flexible?sort=bestflight_a')
    driver.get(kayak)
    sleep(randint(8,10))
    
    # иногда появляется всплывающее окно, для проверки на это и его закрытия можно воспользоваться блоком try
    try:
        xp_popup_close = '//button[contains(@id,"dialog-close") and contains(@class,"Button-No-Standard-Style close ")]'
        driver.find_elements_by_xpath(xp_popup_close)[5].click()
    except Exception as e:
        pass
    sleep(randint(60,95))
    print('loading more.....')
    
#     load_more()
    
    print('starting first scrape.....')
    df_flights_best = page_scrape()
    df_flights_best['sort'] = 'best'
    sleep(randint(60,80))
    
    # Возьмём самую низкую цену из таблицы, расположенной в верхней части страницы
    matrix = driver.find_elements_by_xpath('//*[contains(@id,"FlexMatrixCell")]')
    matrix_prices = [price.text.replace('$','') for price in matrix]
    matrix_prices = list(map(int, matrix_prices))
    matrix_min = min(matrix_prices)
    matrix_avg = sum(matrix_prices)/len(matrix_prices)
    
    print('switching to cheapest results.....')
    cheap_results = '//a[@data-code = "price"]'
    driver.find_element_by_xpath(cheap_results).click()
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting second scrape.....')
    df_flights_cheap = page_scrape()
    df_flights_cheap['sort'] = 'cheap'
    sleep(randint(60,80))
    
    print('switching to quickest results.....')
    quick_results = '//a[@data-code = "duration"]'
    driver.find_element_by_xpath(quick_results).click()  
    sleep(randint(60,90))
    print('loading more.....')
    
#     load_more()
    
    print('starting third scrape.....')
    df_flights_fast = page_scrape()
    df_flights_fast['sort'] = 'fast'
    sleep(randint(60,80))
    
    # Сохранение нового фрейма в Excel-файл, имя которого отражает города и даты
    final_df = df_flights_cheap.append(df_flights_best).append(df_flights_fast)
    final_df.to_excel('search_backups//{}_flights_{}-{}_from_{}_to_{}.xlsx'.format(strftime("%Y%m%d-%H%M"),
                                                                                   city_from, city_to, 
                                                                                   date_start, date_end), index=False)
    print('saved df.....')
    
    # Можно следить за тем, как прогноз, выдаваемый сайтом, соотносится с реальностью
    xp_loading = '//div[contains(@id,"advice")]'
    loading = driver.find_element_by_xpath(xp_loading).text
    xp_prediction = '//span[@class="info-text"]'
    prediction = driver.find_element_by_xpath(xp_prediction).text
    print(loading+'n'+prediction)
    
    # иногда в переменной loading оказывается эта строка, которая, позже, вызывает проблемы с отправкой письма
    # если это прозошло - меняем её на "Not Sure"
    weird = '¯_(ツ)_/¯'
    if loading == weird:
        loading = 'Not sure'
    
    username = '[email protected]'
    password = 'YOUR PASSWORD'

    server = smtplib.SMTP('smtp.outlook.com', 587)
    server.ehlo()
    server.starttls()
    server.login(username, password)
    msg = ('Subject: Flight Scrapernn
Cheapest Flight: {}nAverage Price: {}nnRecommendation: {}nnEnd of message'.format(matrix_min, matrix_avg, (loading+'n'+prediction)))
    message = MIMEMultipart()
    message['From'] = '[email protected]'
    message['to'] = '[email protected]'
    server.sendmail('[email protected]', '[email protected]', msg)
    print('sent email.....')

I tested this script using an Outlook (hotmail.com) account. I haven't tested it to make sure it works with a Gmail account, it's a very popular email system, but there are plenty of options. If you use a Hotmail account, then you just need to enter your data into the code in order for everything to work.

If you want to understand what exactly is done in the individual sections of the code for this function, you can copy them and play around with them. Experimenting with code is the only way to truly understand it.

Ready system

Now that everything we talked about is done, we can create a simple loop in which our functions are called. The script asks the user for information about cities and dates. When testing with constantly restarting the script, you are unlikely to want to enter this data manually each time, so the corresponding lines, for the duration of testing, can be commented out by uncommenting those that come below them, in which the data necessary for the script is hard-coded.

city_from = input('From which city? ')
city_to = input('Where to? ')
date_start = input('Search around which departure date? Please use YYYY-MM-DD format only ')
date_end = input('Return when? Please use YYYY-MM-DD format only ')

# city_from = 'LIS'
# city_to = 'SIN'
# date_start = '2019-08-21'
# date_end = '2019-09-07'

for n in range(0,5):
    start_kayak(city_from, city_to, date_start, date_end)
    print('iteration {} was complete @ {}'.format(n, strftime("%Y%m%d-%H%M")))
    
    # Ждём 4 часа
    sleep(60*60*4)
    print('sleep finished.....')

Here's what a test run of the script looks like.
Python is an assistant in finding cheap flights for those who love to travel
Test run of the script

Results

If you've made it to this point, congratulations! Now you have a working web scraper, although I already see many ways to improve it. For example, it can be integrated with Twilio so that it sends text messages instead of emails. You can use a VPN or something else to get results from multiple servers at the same time. There is also a recurring problem with checking the user of the site to see if he is a person, but this problem can also be solved. In any case, you now have a base that you can expand if you wish. For example, make sure that the Excel file is sent to the user as an attachment to an email.

Python is an assistant in finding cheap flights for those who love to travel

Only registered users can participate in the survey. Sign in, you are welcome.

Do you use web scraping technologies?

  • Yes

  • No

8 users voted. 1 user abstained.

Source: habr.com

Add a comment