Sparkify Data Modeling

A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to the data they collected. The case study depicts the choices that can be made by Sparkify to model and engineer the data.

Relational Data Modeling with PostgreSQL.

Created a relational database using PostgreSQL
Developed a Star Schema database using optimized definitions of Fact and Dimension tables. Normalization of tables.
Built out an ETL pipeline to optimize queries in order to understand what songs users listen to.

Technologies used: Python, PostgreSql, Star Schema, ETL pipelines, Normalization

Non-Relational Modeling with Cassandra.

Created a nosql database using Apache Cassandra
Developed denormalized tables optimized for a specific set queries and business needs

Technologies used: Python, Apache Cassandra, Denormalization

Data Warehouse in Cloud.

Creating a Redshift Cluster, IAM Roles, Security groups.
Develop an ETL Pipeline that copies data from S3 buckets into staging tables to be processed into a star schema
Developed a star schema with optimization to specific queries required by the data analytics team.

Technologies used: Python, Amazon Redshift, SQL, PostgreSQL

Data Lake

Scaled up the current ETL pipeline by moving the data warehouse to a data lake.
Create an EMR Hadoop Cluster
Further develop the ETL Pipeline copying datasets from S3 buckets, data processing using Spark and writing to S3 buckets using efficient partitioning and parquet formatting.
Fast-tracking the data lake buildout using (serverless) AWS Lambda and cataloging tables with AWS Glue Crawler.

Technologies used: Spark, S3, EMR, Parquet.

Data Pipelines with Apache Airflow

Using Airflow to automate ETL pipelines using Airflow, Python, Amazon Redshift.
Writing custom operators to perform tasks such as staging data, filling the data warehouse, and validation through data quality checks.
Transforming data from various sources into a star schema optimized for the analytics team's use cases.

Technologies used: Apache Airflow, S3, Amazon Redshift, Python.

Projects and resources developed in the DEND Nanodegree from Udacity.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Data Lakes		Data Lakes
Data Pipeline with Apache Airflow		Data Pipeline with Apache Airflow
Data Warehouse in Cloud		Data Warehouse in Cloud
Images		Images
Non-Relational Data Modeling using Cassendra		Non-Relational Data Modeling using Cassendra
Relational Data Modeling using Postgres		Relational Data Modeling using Postgres
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify Data Modeling

Relational Data Modeling with PostgreSQL.

Non-Relational Modeling with Cassandra.

Data Warehouse in Cloud.

Data Lake

Data Pipelines with Apache Airflow

About

Releases

Packages

Languages

sagrd/Sparkify-Data-Modeling

Folders and files

Latest commit

History

Repository files navigation

Sparkify Data Modeling

Relational Data Modeling with PostgreSQL.

Non-Relational Modeling with Cassandra.

Data Warehouse in Cloud.

Data Lake

Data Pipelines with Apache Airflow

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages