Data-Engineering-First-Project: Data warehouse tech stack with MySQL, DBT, Airflow

Project Description

This is a Data Engineering project created within 10 Academy Training! In this repository, we create a scalable data warehouse to handle and process the vehicle trajectory data collected by pNEUMA experiment. The aim is to build a pipeline that facilitates the data use by the city traffic department to improve traffic flow and support various undisclosed projects.

Data

pNEUMA is an open large-scale dataset of naturalistic trajectories of half a million vehicles that have been collected by a swarm of drones and static roadside cameras in the congested downtown area of Athens, Greece.

The vehicle trajectory data is extracted from analyzing traffic footage.
Each file for a single (area, date, time) is ~87 MB data.

Project Goals

Create a scalable data warehouse (DWH) to host and manage vehicle trajectory data.
Take into account future needs, organize the data such that downstream projects can query it efficiently.
Use the ELT framework using DBT (Data Build Tool). It is more suitable than ETL framework, because this way transformations can be set up on a need basis by analytic engineers.

Tech Stack

We have chosen a robust tech stack to implement the data warehouse and support the ELT process efficiently. The main components of our tech stack include:

PostgreSQL Database: PostgreSQL will serve as the backbone of our data warehouse. It is a well-established relational database management system that can handle large-scale data efficiently and reliably.
DBT (Data Build Tool): DBT will be the heart of our ELT framework. It enables us to transform the raw data into structured, queryable data models. The powerful templating and SQL-based workflow of DBT simplify the data transformation process and ensure maintainable code.
Airflow: Apache Airflow will be used to orchestrate the ELT workflows. It provides a powerful and flexible platform to schedule, monitor, and manage the data pipelines. With Airflow, we can automate the entire data processing pipeline and ensure smooth execution of tasks.

Repository Structure

The repository is structured in the following manner:

/data: This directory contains sample data used for testing and development purposes.
/database: This directory holds scripts needed to setup the database, and create the tables.
/dbt: The directory houses all the DBT-related files and configurations. It includes models, and schema definitions.
/dags: In this directory you find the Airflow DAG (Directed Acyclic Graph) files. These DAGs define the workflows for our data pipeline.
/scripts: This directory holds helper scripts that are needed for data preprocessing.
/screenshots: This directory stores multiple screenshots of the data pipeline, including DAG runs from Airflow UI and model documentation from dbt UI.

Getting Started

To get started with the project, follow these steps:

Clone the repository to your local development environment.

git clone https://github.com/emtinanseo/Traffic-Data-Warehousing.git

Set up the PostgreSQL and ensure it is accessible from your environment.

Set AIRFLOW_HOME to a local folder

export AIRFLOW_HOME=/absolute/path/to/folder/airflow

Install the required dependencies from requirements.txt file, including airflow and dbt
```
cd Traffic-Data-Warehousing
pip install -r requirements.txt
```
DBT Configuration: Configure DBT to connect to your PostgreSQL database by updating the necessary configurations in the dbt/profiles.yml file.
Airflow Configuration: in the AIRFLOW_HOME folder, set in step 3, open airflow.cfg file and change the value of dags_folder to
```
# airflow.cfg
dags_folder = /path/to/Traffic-Data-Warehousing/dags
```

Run the airflow pipelines

Initialize the database tables
```
airflow db init
```

Create a user

airflow users create --username admin --firstname <firstname> --lastname <lastname> --role Admin --email <email>

Run the scheduler
```
airflow scheduler
```
[optional] Open airflow web UI
```
airflow webserver --port 8080
```
Then open the browser to http://localhost:8080/

Contributing

We welcome contributions to enhance the data warehouse tech stack or improve the project's documentation. If you find any bugs or have suggestions for new features, please feel free to open an issue or submit a pull request.

Authors

👤 Emtinan Osman Email, GitHub, LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.dvc		.dvc
dags		dags
database		database
dbt/traffic		dbt/traffic
notebooks		notebooks
screenshots		screenshots
scripts		scripts
.dvcignore		.dvcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.dvc		data.dvc
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data-Engineering-First-Project: Data warehouse tech stack with MySQL, DBT, Airflow

Project Description

Data

Project Goals

Tech Stack

Repository Structure

Getting Started

Contributing

Authors

About

Releases

Packages

Languages

License

emtinanseo/Traffic-Data-Warehousing

Folders and files

Latest commit

History

Repository files navigation

Data-Engineering-First-Project: Data warehouse tech stack with MySQL, DBT, Airflow

Project Description

Data

Project Goals

Tech Stack

Repository Structure

Getting Started

Contributing

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages