🌦️ METAR Data Engineering and Machine Learning Project 🛫

Technologies • About the project • Conceptual architecture • Phase 1 • Phase 2 • Phase 3 - Final Stage • Data source • Looker report • Setup

Technologies

About the project

An educational project to build an end-to-end pipline for near real-time and batch processing of data further used for visualisation 👀 and a machine learning model 🧠.

The project is designed to enable the preparation of an analytical summary of the variability of METAR weather reports over the years for airports of European countries.

Conceptual architecture

The project is divided into 3 phases according to the attached diagrams:

👉 Phase 1

Retrieval of archive data from source. Initial transformation. Transfer of data to Data Lake - Google Cloud Storage. Transfer of data to Data Warehouse. Transformations using PySpark on Dataproc cluster. Visualisation of aggregated data on an interactive dashboard in Looker.

👉 Phase 2

Preparing the environment for near-real-time data retrieval. Transformations of archived and live data using PySpark, preparation of data for machine learning model. Training and tuning stage of the model.

👉 Phase 3 - Final stage 🥳

Collection of analytical reports for historical data, preparation of web dashboard with the ability to display the prediction of the nearest METAR report for a given airport and the likely trend of change.

Data source

💿 IOWA STATE UNIVERSITY ASOS-AWOS-METAR Data

📊 Looker report

The report generated in Looker provides averages of METAR data, broken down by temperature, winds, directions, and weather phenomena, with accompanying charts. The data was scraped via URL and stored in raw form in Cloud Storage. PySpark and Dataproc were then used to prepare SQL tables with aggregation functions, which were saved in BigQuery. The Looker report directly utilizes these tables from BigQuery.

Additionally, it's possible to prepare a similar report for other networks. Below is an example for PL__ASOS.

Check: PL__ASOS

For more information, please refer to the "Setup" section.

🛠️ Setup

Make sure you have Spark, PySpark, Google Cloud Platform SDK, Prefect and Terraform installed and configured.

Clone the repo

$ git clone https://github.com/MarieeCzy/METAR-Data-Engineering-and-Machine-Learning-Project.git

Create a new python virtual environment.
```
$ python -m venv venv
```
Activate the new virtual environment using source (Unix systems) or .\venv\Scripts\activate (Windows systems).
```
$ source venv/bin/activate
```
Install packages from requirements.txt using pip. Make sure the requirements.txt file is in your current working directory.
```
$ pip install -r requirements.txt
```
Create new project on the GCP platform, assign it as default and authorize:
```
$ gcloud config set project <your_project_name>
$ gcloud auth login
```
Configure variables for Terraform:

6.1. In:

terraform.tfvars

replace project name to the name of your project created within the Google Cloud Platform:

project = <your_project_name>

go to terraform directory:

$ cd terraform/

initialize, plan and apply cloud resource creation:
```
$terraform init
$terraform plan
$terraform apply
```
Configure the upload data, go to: ~/prefect_orchestration/deployments/flows/config.json

8.1. Complete the variables:
- network select one network e.g. FR__ASOS,
- start_year, start_month, start_day - complete the start date, make sure that the digits are not preceded by "0"
- batch_bucket_name - enter the name of the created Google Cloud Storgage bucket
Set up Perfect, the task orchestration tool:

9.1. Generate new KEY for storage service account:

On Google Platform go to

IAM & Admin > Service Accounts, click on

"storage-service-acc" go to

KEYS and click on ADD KEY > Create new key in JSON format.

Save it in a safe place, do not share it on GitHub or any other public place.

In order not to change the code in the gcp_credentials_blocks.py block, create a .secrets directory: ~/METAR-Data-Engineering-and-Machine-Learning-Project/.secrets and put the downloaded key in it under the name: gcp_credentials_key.json

9.2. Run Prefect server
```
$ prefect orion start
```
Go to: http://127.0.0.1:4200

9.3. In ~/prefect_orchestration/prefect_blocks run below commands in console to create Credentials and GCS Bucket blocks:
```
$ python gcp_credentials_blocks.py
$ python gcs_buckets_blocks.py 
```
9.4. Configure Perfect Deployment:
```
$ python prefect_orchestration/deployments/deployments_config.py
```
9.5. Run Prefect Agent to enable deployment in "default" queue
```
$ prefect agent start -q "default"
```
Start deployment stage 1 - S1: Downloading data and uploading to the Google Cloud Storage bucket

Go to: ~/prefect_orchestration/deployments and run in command line:
```
$ python deployments_run.py --stage="S1"
```
☝️ You can observe the running deployment flow in Prefect UI :

After the deployment is complete, you will find the data in the GCS bucket.
Configuration and commissioning stage 2 - S2: data transformation using PySpark and moving to BigQuery using Dataproc

11.1. Go to ~/prefect_orchestration/deployments in gcloud_submit_job.sh and check if given paths and names are correct:

As long as you haven't changed other names/settings other than those listed in this manual, everything should be fine.
```
$ gcloud dataproc jobs submit pyspark \
--cluster=metar-cluster \
--region=europe-west2 \
--jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
--files=gs://code-metar-bucket-2/code/sql_queries_config.yaml\
gs://code-metar-bucket-2/code/pyspark_sql.py \
-- \
    --input=gs://batch-metar-bucket-2/data/ES__ASOS/*/* \
    --bq_output=reports.ES__ASOS \
    --temp_bucket=dataproc-staging-bucket-metar-bucket-2
```
11.2. Upload the pyspark_sql.py and config file sql_queries_config.yaml to the bucket code.

In ~/prefect_orchestration/deployments/flows:
```
$ gsutil cp pyspark_sql.py gs://code-metar-bucket-2/code/pyspark_sql.py
$ gsutil cp sql_queries_config.yaml gs://code-metar-bucket-2/code/sql_queries_config.yaml
```
11.3. Run deployment stage S2 GCS -> BigQuery on Dataproc cluster:
```
$ python deployments_run.py --stage="S2"
```
If the Job was successful, you can go to BigQuery, where the generated data is located. Now you can copy my Looker report and replace the data sources, or prepare your own. 😎

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
docs		docs
prefect_orchestration		prefect_orchestration
terraform		terraform
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

prefect_orchestration

prefect_orchestration

terraform

terraform

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

🌦️ METAR Data Engineering and Machine Learning Project 🛫

Technologies

About the project

Conceptual architecture

👉 Phase 1

👉 Phase 2

👉 Phase 3 - Final stage 🥳

Data source

📊 Looker report

🛠️ Setup

About

Releases

Packages

Languages

MarieeCzy/METAR-Data-Engineering-and-Machine-Learning-Project

Folders and files

Latest commit

History

Repository files navigation

🌦️ METAR Data Engineering and Machine Learning Project 🛫

Technologies

About the project

Conceptual architecture

👉 Phase 1

👉 Phase 2

👉 Phase 3 - Final stage 🥳

Data source

📊 Looker report

🛠️ Setup

About

Topics

Resources

Stars

Watchers

Forks

Languages