Skip to content

naomistuart/flight-scraping

Repository files navigation

✈️ flight-scraping

The goal of this project is to build a web application to scrap Sydney Airport's flight listing pages, and to display all of today's flights in a single HTML page.

Table of Contents

Background
Features of the completed web app
Built with
Method
How to use
Tricky bits and challenges
References

Background

Sydney Airport's flight listing page enables a user to choose between Arrivals or Departures, and between Domestic or International flights. However, this is no option to display all flights on a single page. For plane spotting purposes, it would be helpful to have a complete listing of the current day's flights in a single place.

Features of the completed web app

  • Dynamic loading screen while data is fetched
  • Results page displays complete listing of today's flights at Sydney Airport. Icons are used to differentiate between arrivals and departures, and between domestic and international flights
  • Data is refreshed every 15 minutes and synced to a PostgreSQL database
  • Web application is deployed to heroku

Built with

  • BeautifulSoup - Python library used to scrap HTML pages
  • Selenium ChromeDriver - used to automate Chrome browser interaction from Python
  • Flask - Python framework used to build the web application
  • APScheduler - Python library used to schedule automatic cron refreshes of the flight data
  • Heroku Postgres - used to cache flight data

Method

A brief description of the steps used to create the web app:

1. Web scraping

  • Use Selenium ChromeDriver to grab HTML content from Sydney Airport's flight listing page. There are four possible combinations of domestic / international and arrival / departures, so there are four HTML pages to grab
  • Use BeautifulSoup to parse the HTML content and pull out relevant information. For each flight, relevant info includes its origin / destination, stopover (if any), airline, flight number, status (arrived, departed, delayed, cancelled etc.), scheduled time, and estimated time
  • Store parsed data in a Pandas dataframe. Each row corresponds to a single flight today.

2. Build Flask app

  • Create Flask app with two routes - one for loading screen while waiting for data to be fetched, and one to display the results table
  • Use CSS templates to style the loading and result pages.

3. Connect to database

  • Set up a SQLite database for local testing
  • Write python function to store scraped data in the database table, and use APScheduler library to schedule automatic refreshes of the flight data
  • Connect the /results route of the Flask app to fetch data from the database, instead of performing a fresh web scrap every time this endpoint is called.

4. Deploy to Heroku

  • Define a Procfile to declare the command to be executed to start the app. Gunicorn is used as the web server.
  • Define a requirements.txt to specify the Python package dependencies on Heroku via pip
  • Create a new Heroku app and set up the PostgreSQL database add-on
  • Deploy by connecting to GitHub repository.

How to use

Local use

  1. Uncomment the following lines of code in main.py to configure the SQLite database for local testing.

    basedir = os.path.abspath(os.path.dirname(__file__))
    app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///' + os.path.join(basedir, 'data.sqlite')
  2. Navigate to the project root directory and run the following commands:

    $ pip install -r requirements.txt
    $ python main.py
  3. Open localhost http://127.0.0.1:5000/. You should see the webpage rendered.

On the web

Go to https://sydneyflights.herokuapp.com/.

Tricky bits and challenges

Scraping dynamic HTML pages

  • Initially attempted to simply use the get() function from Python's requests library to grab HTML content from Sydney Airport's flight listing page
  • This did not work - close examination of the scraped content showed that not all HTML from the web page was fetched. Specifically, the scraped content was missing the HTML tables for the individual flight listings
  • This is because the airport's flight listings are not static HTML pages - rather the content is dynamically generated by JavaScript
  • As a result, webdriver.chrome() from the Selenium library was needed to perform dynamic scraping.

Truncated content in dataframe

  • When using pandas.DataFrame.to_html to render the flights dataframe as an HTML table, some of the content (specifically URLs for the airline logos) was truncated.
  • This resulted in the airline logos not displaying correctly in the table
  • Interestingly, this issue only materialised when rendering the HTML table on a browser. There was no issue when rendering an identical table in Jupyter Notebook
  • pd.options.display.max_colwidth = 200 was used to adjust the display settings of the dataframe, and successfully fixed the issue.

Running Chrome Driver on Heroku

  • Configuring Chrome Driver to work on Heroku was challenging

  • Chrome Driver was downloaded as an .exe file - this worked fine while testing locally, but posed problems when attempting to deploy to Heroku

  • Successful solution, based on Andres Sevilla's YouTube video:
    1. Initialise instance of Chrome Driver:

    from selenium import webdriver
    import os
    
    chrome_options = webdriver.ChromeOptions()
    chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--no-sandbox")
    driver = webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"), chrome_options=chrome_options)

    2. Add the following buildpacks to the Heroku app:

    Note: Buildpacks can be easily added by going to the Settings tab of the app on Heroku

    3. Add the following config vars to the Heroku app:

    • CHROMEDRIVER_PATH = /app/.chromedriver/bin/chromedriver
    • GOOGLE_CHROME_BIN = /app/.apt/usr/bin/google-chrome

    Note: Config vars can be easily edited by going to the Settings tab of the app on Heroku

Heroku dyno sleeping

  • This app uses a free web dyno on Heroku. The app goes to sleep if the dyno receives no web traffic in a 30-minute period
  • This poses a problem for the database's scheduled updates - the data will not be updated unless the python code is running on a server
  • To solve this, a function was added to ping the app every 20 minutes
  • This ensures the database always contains up-to-date data, and users encounter minimal load times when opening the app.

Heroku memory quotas

  • Heroku has memory restrictions (maximum 512 MB per app for a free web dyno), and exceeding this causes a "Memory quota exceeded" error
  • This can sometimes prevent successful scraping and database updates
  • Have experimented with reducing the frequency of scheduled database updates to circumvent the memory quota issue
  • It appears the chromedriver used for web scraping causes an excessive memory load, and the problem cannot be completely solved unless upgrading to a paid Heroku tier.

References

Key tutorials

Chrome Driver examples & trouble-shooting

Image sources

About

Flask app for web scraping of sydney flights, deployed to Heroku

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published