Udacity's Data Engineering With AWS Nano-Degree - Project: Data Warehouse

This is a fictional project for lesson 2 of the Udacity's Data Engineering with AWS nano-degree course to be review by Udacity. Please find detailled instructions for the project in the file sparkify_Instructions_from_Udacity.md.

Table of Contents

The (Fictional) Task In A Nutshell
Structure of this Repository
Usage
Data, Model and ETL
1. Data
  1. Log Data
  2. Song Data
2. Model
Recommendations for the team

1. The (Fictional) Task In A Nutshell

A music streaming startup named Sparkify has grown their user base and song database and want to move their processes and data onto the cloud of AWS. Sparkify`s data already resides in S3-buckets,

in a directory of JSON logs on user activity on the app, as well as
in a directory with JSON metadata on the songs.
For the JSON logs, there's also a corresponding JSON log-json-path.json file helping to parse the logs.

In order to enable Sparkify to analyze the data, I've been asked to build a data warehouse in AWS Redshift. So, my task to create the infrastructure needed and to build an ETL pipeline that extracts the data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for my colleagues from the analytics team to continue finding insights into what songs our users are listening to.

2. Structure of this Repository

Besides this README.md the following scripts are relevant for the project:

sql_queries.py
create_tables.py
etl.py

Additionally, I've build a script using the boto3 library to create the AWS resources needed for the project:

infra.py

Futhermore, sparkify_exploration.ipynb contains an extensive exploration of the data done locally with data downloaded from the S3-buckets before building the ETL pipeline.

There's also a dwh.cfg file which contains the configuration for the AWS resources. Of course, this does not contain AWS credentials needed to build the infrastructure. These are stored in a .env file which is not part of this repository.

Please note that, in addition to the project itself, this repository also contains notebooks regarding the relevant exercises in the lesson which are not relevant for the project.

3. Usage

The enviroment for this project is a Python 3.11 environment with psycopg2 installed. For doing the lessons there are many other packages installed, but they are not needed for the project.

For your convenience, I've created a requirements.txt file which you can use to install the necessary packages for this project. Please note that this file does also contain the packages used for the lessons.

pip install -r requirements.txt

To run the scripts, simply run them from the command line:

python create_tables.py
python etl.py

create_tables.py drops the tables, if existant, and rebuilds them. You should run this file to reset your tables before each time you run your ETL scripts.

4. Data and Model

4.1 Data

4.1.1 Log Data

A typical JSON log file looks like this:

{
    'artist': 'Sound 5',
    'auth': 'Logged In',
    'firstName': 'Jacob',
    'gender': 'M',
    'itemInSession': 2,
    'lastName': 'Klein',
    'length': 451.21261,
    'level': 'paid',
    'location': 'Tampa-St. Petersburg-Clearwater, FL',
    'method': 'PUT',
    'page': 'NextSong',
    'registration': 1540558108796.0,
    'sessionId': 518,
    'song': 'Latin Static',
    'status': 200,
    'ts': 1542462343796,
    'userAgent': '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/7.0.6 Safari/537.78.2"',
    'userId': '73'
}

As we are aiming to build a data warehouse for querying song plays, it is important to notice that not all log entries are relevant for the data warehouse. Files where auth is not Logged In or where lenght is 0 are those not containing the relevant information. So, we need to filter out those files before loading them into the final files.
We are also provided with a JSON file log_json_path.json which contains the path to the relevant information in the log files. This is used to parse the log files. The schema for saving the log_data files in Redshift is:

Column Name	Data Type
artist	VARCHAR(200)
auth	VARCHAR(50)
firstName	VARCHAR(50)
gender	CHAR(1)
itemInSession	INTEGER
lastName	VARCHAR(50)
length	FLOAT
level	CHAR(4)
location	VARCHAR(200)
method	VARCHAR(10)
page	VARCHAR(50)
registration	FLOAT
sessionId	INTEGER
song	VARCHAR(200)
status	INTEGER
ts	BIGINT
userAgent	VARCHAR(200)
userId	INTEGER

4.1.2 Song Data

A typical JSON song file looks like this:

{
    'artist_id': 'ARLYGIM1187FB4376E',
    'artist_latitude': None,
    'artist_location': '',
    'artist_longitude': None,
    'artist_name': 'Joe Higgs',
    'duration': 162.82077,
    'num_songs': 1,
    'song_id': 'SOQOOPI12A8C13B040',
    'title': 'Wake up And Live',
    'year': 1975
}

It is important to notice here, that the song_data files are not matching well with the log_data files. This may be due to different sources or naming convention, misspellings, etc. So, we need to be careful when joining the data and concentrate on data from the log_data files.

The schema for saving the song_data files in Redshift is:

Column Name	Data Type
artist_id	VARCHAR(50)
artist_latitude	FLOAT
artist_location	VARCHAR(200)
artist_longitude	FLOAT
artist_name	VARCHAR(200)
duration	FLOAT
num_songs	INTEGER
song_id	VARCHAR(50)
title	VARCHAR(200)
year	INTEGER

Please note that year is sometimes 0, which is not a valid year. So, we need to convert these values to NULL before loading them into the final files.

4.2 Model

The database schema is a star schema with one fact table and four dimension tables.
The fact table is

songplays

and the dimension tables are

users,
songs,
artists, and
time:

Please note the following:

The songplays table
- has a composite unique primary key consisting of session_id and songplay_id, the latter is build on itemInSession,
- has a foreign key to the time table (start_time), the users table (user_id), the songs table (song_id), and the artists table (artist_id),
- is evenly distributed across the nodes of the cluster, as the later queries are not clear yet, and
- is sorted by start_time.
- All columns are not nullable.
The time table
- has a unique primary key (start_time),
- is evenly distributed across the nodes (using the DISTSTYLE EVEN strategy) assuming times when songs are played is growing over time, and would sooner or later not fit on a single node, and
- is sorted by start_time to enable faster joins with the songplays table.
- All columns are not nullable.
The users table
- has a unique primary key (user_id),
- is available on all nodes of the cluster (using the DISTSTYLE ALL strategy) assuming that the number of users is small enough to fit on all nodes, and to enable fast joins with the songplays table, and
- is sorted by last_name, first_name, gender, and level.
- All columns are not nullable.
- The level column here, represents the lastest known subscription level of the user. This is different than in the songplays table, where the level column represents the subscription level at the time of the song play.
The artists table
- has a unique primary key (artist_id),
- is available on all nodes of the cluster (using the DISTSTYLE ALL strategy) assuming that the number of artists is small enough to fit on all nodes, and to enable fast joins with the songplays table, and
- is sorted by name.
- The columns location, latitude, and longitude are nullable. They are gathered from the song files, which are not always complete, and are only partly matching the artist information from the log files.
The songs table
- has a unique primary key (song_id),
- is available on all nodes of the cluster (using the DISTSTYLE ALL strategy) assuming that the number of songs is small enough to fit on all nodes, and to enable fast joins with the songplays table, and
- is sorted by title, artist_id, and year.
- It is also linked to the artists table using the artist_id column, which is a deviation from the original star schema principle, but helps combining these specific data.
- The columns duration and year are nullable. They are gathered from the song files, which are not always complete, and are only partly matching the song information from the log files.
Don't confuse the artist_id and the song_id from the song_data files with the artist_id and the song_id in this schema. The song_data files are not matching well the entries in the log_data files regarding names and titles. Therefore, own keys are used for the songs and artists tables, and as a consequence, for the songplays table as well.

4. Recommendations for the team

The data in the log_data files regarding artist names and song titles does not match the data from the song_data files. Also the song_data files seem to have some issues that look like duplicates and wrong entries. To build a good data source for analytics, the corresponding processes should be reviewed and improved.
The usage in terms of common queries and the future size of the different tables is not clear yet. Therefore, the distribution strategy for the tables is not optimized yet. This should be reviewed once the usage and the size of the data is clearer.
Right now, using a Redshift cluster is not really necessary. The data is small enough to be processed on a single node or even a more traditional SQL database. However, if the data is growing, and the cluster can be easily scaled up. Therefore, it is recommended to use a Redshift cluster from the beginning, but this should be reviewed once the future growth of Sparkify is clearer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Udacity's Data Engineering With AWS Nano-Degree - Project: Data Warehouse

1. The (Fictional) Task In A Nutshell

2. Structure of this Repository

3. Usage

4. Data and Model

4.1 Data

4.1.1 Log Data

4.1.2 Song Data

4.2 Model

4. Recommendations for the team

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
cloud_datawarehouse		cloud_datawarehouse
data/pagila-0.10.1		data/pagila-0.10.1
images		images
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
checks.ipynb		checks.ipynb
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sparkify_Instructions_from_Udacity.md		sparkify_Instructions_from_Udacity.md
sparkify_exploration.ipynb		sparkify_exploration.ipynb
sql_queries.py		sql_queries.py

DecisioNaut/cloud_warehouse

Folders and files

Latest commit

History

Repository files navigation

Udacity's Data Engineering With AWS Nano-Degree - Project: Data Warehouse

1. The (Fictional) Task In A Nutshell

2. Structure of this Repository

3. Usage

4. Data and Model

4.1 Data

4.1.1 Log Data

4.1.2 Song Data

4.2 Model

4. Recommendations for the team

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages