This project is a Near Real-Time ETL pipeline that will do the necessary transformations of the raw data which is in the form of CSV files, the end goal here is to be able to calculate the total emission on a brand level, car level, and driver level. There are a lot of different tools and frameworks that are used to build ETL pipelines. In this project, the ETL is built using Python, Docker, PostgreSQL, and Airflow.
- Make sure you have Docker installed on your machine.
- Change directories at the command line to be inside the
emission-etl
folder
cd emission-etl
- Now run,
docker-compose up
-
Now three services will get initialized, Jupyter Lab(for testing), Postgres Database, and Pgadmin webserver to run queries on our tables.
-
We can access Jupyter Lab at http://localhost:8080/
-
Let's look at our database inside the pgadmin webserver
- Go to http://localhost:5556/ and log in with the following credentials.
email id: [email protected] password: password
- Then follow the images for the next steps.
Note: The database planetly will get initialized automatically at startup. If there are tables already existing inside the database then please DROP the tables before running the etl
- If you want you can start doing the testing inside the Jupyter Lab, all the required libraries to import are present in the requirement.txt, inside the first cell execute.
!pip install -r requirements.txt
and then you can start your testing.
-
-
Now, let's start our Airflow service.
- First, we will have to initialize the Airflow so that the Airflow database and Airflow USER gets created. It will take around 1 minute.
docker-compose -f airflow-docker-compose.yml up airflow-init
- Once we have the Airflow database and the Airflow USER, we can start the Airflow services.
docker-compose -f airflow-docker-compose.yml up
- If you don't want to see the logs on cmd and you want to run it in the background just run the below command instead of the command mentioned above.
docker-compose -f airflow-docker-compose.yml up -d
- Now we can access our Airflow webserver at http://localhost:8080
- Now we can run the emission-dag from the webserver.
- You can check the source code in this file emission-dag
- First, we will have to initialize the Airflow so that the Airflow database and Airflow USER gets created. It will take around 1 minute.
- The incoming data is stored in the data/tablename_incoming_data/ whenever the ETL will run the data will move inside the used data folder with a timestamp attached to it as a suffix.
- Right now while building the pipeline it is assumed that the drivers data and the vehicle_consumptions data are static.
- Currently working on the use case where the drivers data and vehicle_consumptions data are not static.
- To add the logic for the null handling.
- To create a logic for --> if driver is updating his log then how to handle the duplication scenario.