CO2 Emission ETL

This project is a Near Real-Time ETL pipeline that will do the necessary transformations of the raw data which is in the form of CSV files, the end goal here is to be able to calculate the total emission on a brand level, car level, and driver level. There are a lot of different tools and frameworks that are used to build ETL pipelines. In this project, the ETL is built using Python, Docker, PostgreSQL, and Airflow.

Setup

Make sure you have Docker installed on your machine.
Change directories at the command line to be inside the emission-etl folder

cd emission-etl

Now run,

docker-compose up

Now three services will get initialized, Jupyter Lab(for testing), Postgres Database, and Pgadmin webserver to run queries on our tables.
- We can access Jupyter Lab at http://localhost:8080/
- Let's look at our database inside the pgadmin webserver
  - Go to http://localhost:5556/ and log in with the following credentials.
```
email id: [email protected]
password: password
```
  - Then follow the images for the next steps.
  Note: The database planetly will get initialized automatically at startup. If there are tables already existing inside the database then please DROP the tables before running the etl
  - If you want you can start doing the testing inside the Jupyter Lab, all the required libraries to import are present in the requirement.txt, inside the first cell execute.
```
!pip install -r requirements.txt
```
  and then you can start your testing.
Now, let's start our Airflow service.
- First, we will have to initialize the Airflow so that the Airflow database and Airflow USER gets created. It will take around 1 minute.
```
docker-compose -f airflow-docker-compose.yml up airflow-init
```
- Once we have the Airflow database and the Airflow USER, we can start the Airflow services.
```
docker-compose -f airflow-docker-compose.yml up
```
- If you don't want to see the logs on cmd and you want to run it in the background just run the below command instead of the command mentioned above.
```
docker-compose -f airflow-docker-compose.yml up -d
```
- Now we can access our Airflow webserver at http://localhost:8080
- Now we can run the emission-dag from the webserver.
- You can check the source code in this file emission-dag

Notes

The incoming data is stored in the data/tablename_incoming_data/ whenever the ETL will run the data will move inside the used data folder with a timestamp attached to it as a suffix.
Right now while building the pipeline it is assumed that the drivers data and the vehicle_consumptions data are static.

Improvements to do

Currently working on the use case where the drivers data and vehicle_consumptions data are not static.
To add the logic for the null handling.
To create a logic for --> if driver is updating his log then how to handle the duplication scenario.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.ipynb_checkpoints		.ipynb_checkpoints
af_docker		af_docker
dags		dags
data		data
logs		logs
pgadmin_steps		pgadmin_steps
.gitignore		.gitignore
README.md		README.md
airflow-docker-compose.yml		airflow-docker-compose.yml
airflow-dockerfile		airflow-dockerfile
db.env		db.env
docker-compose.yml		docker-compose.yml
initial.sql		initial.sql
planetly-data-warehouse-model.drawio		planetly-data-warehouse-model.drawio
requirements.txt		requirements.txt
simple_etl.ipynb		simple_etl.ipynb
sqltools_20220503145312_27140.log		sqltools_20220503145312_27140.log
upsert_query.text		upsert_query.text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CO2 Emission ETL

Setup

Notes

Improvements to do

About

Releases

Packages

Languages

aniketkapdule/emission-project

Folders and files

Latest commit

History

Repository files navigation

CO2 Emission ETL

Setup

Notes

Improvements to do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages