Skip to content

Latest commit

 

History

History
511 lines (374 loc) · 21.1 KB

readme.md

File metadata and controls

511 lines (374 loc) · 21.1 KB

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Mimodast

A minimal modern data stack with working data pipelines in a single Docker container.
Explore the docs »

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact

About The Project

A minimal modern data stack with working data pipelines in a single Docker container. Useful for an exploration of the tools involved:

Explore the functionality of the tools by using the examples as-is; modify and expand on the examples for further exploration.

This is a convenient starting point for exploration. The project is not a showcase of all or even the best functionality that each tool has to offer.

(back to top)

Getting Started

Prerequisites

Have Docker Desktop installed.

(back to top)

Installation

  1. Create a container from the published docker image:
    docker create -p8093:8088 -p8085:8080 -p8094:8089 --name mimodast ghcr.io/ejoosterop/mimodast
    
  2. Optionally (required for the European Gas Inventory dataset; this can safely be done at a later stage) copy a .env file containing the API key as explained below:
    docker cp .env mimodast:/project/mimodast/.env
    
  3. Start the container (using the Docker Desktop UI or command line).
  4. For starters:
    • Open the docker container terminal and peruse the meltano.yml file and other files/folders at /project/mimodast/.
    • Navigate to localhost:8085 to see the Airflow orhestrator (incl scheduler) interface. Use admin/admin as username/password.
    • Navigate to localhost:8093 to see the Superset dashboard. Use admin/admin as username/password.
    • Navigate to localhost:8094 to see data pipeline documentation (from dbt).
  • NOTE: allow for some time (~1 minute) for the container to start up all processes. On first startup wait for the completion of the first run of the USGS pipeline before reviewing Superset.

(back to top)

API key

The image contains ELT pipelines for two data sets. The USGS Earthquake dataset can be used right out of the box. This dataset is a good enough dataset to explore the core functionalities.

For the GIE Gas Inventory dataset an API key is required. Create a free and immediate GIE account to obtain the key.

This key needs to be available as an environment variable (ENV_GIE_XKEY) in the Docker container (it is referenced in the meltano.yml configuration file) in order to run the GIE pipelines. One way to accomplish this is by creating a .env file in the /projet/mimodast/ folder containing:

ENV_GIE_XKEY="YOUR-API-KEY"

(back to top)

Usage

Two basic examples are used to illustrate an extensive date pipeline encompassing the following components:

The two pipelines involve USGS Earthquake data and European Gas Inventory data.

Below we highlight the core configuration for these components. For (much) more additional information refer to the respective documentation of the tools.

(back to top)

Definition and Configuration

The data pipelines are fully defined in a set of files. This includes the source definitions, schedules, dependencies, transformation logic, tests and documentation. The reporting/dashboards in Superset are defined within Superset, but can be exported from there.

These files are all found in the /project/mimodast/ folder in the Docker container. It is best practice to capture this folder in a version control tool. Git is included in the Docker image.

Some of the core files include:

  • /project/mimodast/meltano.yml - this includes configurations for data source specification, destination database, schedule and more.
  • /project/mimodast/orhestration/dags/gie_dag.py - python code defining how to orchestrate a data pipeline in Airflow. Note that the GIE data uses this manually created file, whereas the USGS data orhestration relies purely on logic defined in meltano.yml.
  • /project/mimodast/tranformation/ - this folder contains transformation logic (under models/). It also includes configuration for dbt, tests, documentation and various other items.

(back to top)

USGS Earthquake Data Pipeline

Obtaining the data from source

The meltano.yml file specifies how to obtain the data from source. Specifically, configuring the connection to the data source is centered around this tap section in meltano.yml (To be found at /project/mimodast/meltano.yml in the Docker container.):

  - name: stg_usgs
    inherit_from: tap-rest-api-msdk
    config:
      api_url: https://earthquake.usgs.gov/fdsnws

In the example, this tap is inherited from a publicly available tap for a REST API:

plugins:
  extractors:
  - name: tap-rest-api-msdk
    variant: widen
    pip_url: tap-rest-api-msdk

pip_url specifies the location of the source code for the tap plug-in. In this case the actual location is derived by Meltano based on the tap name. Note that the tap for the other data pipeline (Gas Inventory data) also inherits from the same tap, but is clearly configured differently.

meltano.yml contains the ful configuration of the tap. For example the following element specifies to select only earthquakes with a minimum magnitute:

          minmagnitude: 6

Many more items are configured for this tap.

(back to top)

Capturing data in a database

The following section in meltano.yml configures the database where the data will be stored:

  - name: target-duckdb
    pip_url: target-duckdb~=0.4
    config:
      add_metadata_columns: true
      default_target_schema: main
      filepath: $DB_LOCATION
      data_flattening_max_level: 10

A duckdb database is used in this example. This database contains the data captured from source and is accessed for reporting. The database file is located at /project/data/dev/data.duckdb. This database location is specified in meltano.yml.

The image contains the DuckDB command line interface. Use the command:

/project/duckdb_cli/duckdb /project/data/dev/data.duckdb

to start the CLI and browse the database using SQL (the CLI has a help function and is documented online).

Note that the database maybe unavailable if another process (pipeline, reporting) is already accessing it, resulting in an error message.

(back to top)

Transformation

Transformation is performed by dbt and specified by a set of SQL files. In the container these are located at: /project/mimodast/transform/models/usgs_rpt/.

For this example pipeline the transformation is fully captured in rpt_usgs_events.sql. In this case the 'transformation' simply copies some source attributes into a reporting table.

(back to top)

Scheduling

The following section of meltano.yml configures the pipeline's run schedule:

schedules:
- name: USGS-Earthquake
  interval: 35 */1 * * *
  job: usgs-to-duckdb-rpt
  start_date: 2023-01-01 15:40:21.295936

The scheduling interval uses the Cron job format. In this case the pipeline is schedulted to run 35 minutes after every hour.

This section refers to a job, wich is configured in the same file. This job consists of a tap to obtain data from source, a loader specifying where the data needs to be stored and a transformation component:

jobs:
- name: usgs-to-duckdb-rpt
  tasks:
  - stg_usgs target-duckdb-usgs dbt-duckdb:usgs

In this docker container Airflow is used as the scheduler/orchestrator. Meltano ensures that the specified schedule is set in Airflow. Note that the schedule can also be set directly in Airflow, as is the case for our GIE Gas Inventory pipeline.

To monitor job progress access the Airflow UI from localhost:8085, with admin/admin as user/password. There are three pipelines: one for USGS and two for GIE.

(back to top)

Testing

Data tests are specified in two locations:

  • SQL files in the /project/mimodast/transform/models/tests/ folder. (In this specific case just a basic test that the magnitude of an earthquake is not unrealistically high.) Tests are performed by dbt.
  • yml files in the /project/mimodast/transform/models/usgs_rpt/ folder. For example the following tests that the id field is not null:
      - name: id
        description: Unique ID of earthquake or related event assigned by USGS.
        tests:
          - not_null

Tests can be initiated manually from the command line in the Docker container:

cd /project/mimodast/
meltano invoke dbt-duckdb:test

(back to top)

Reporting

Superset is used to visualize/dashboard the data. The Superset UI can be accessed from localhost:8093, with admin/admin as user/password.

Initially the database in mimodast is empty so make sure to run the datapipeline(s) before reviewing the dashboards.

Due to an outstanding issue in mimodast, individual charts in a dashboard frequently do not show. For now:

  • Refresh each chart by selecting this option in the right top corner of each chart.
  • Possibly wait for other processes accessing the database (say the data pipelines) to complete

(back to top)

Documentation

dbt is used for documentation of data and processes. Documentation is maitained in .yml files and overview.md in the /project/mimodast/transform/models/ folder.

This documentation can be consulted at localhost:8094.

(back to top)

Gas Inventory Pipeline

The Gas pipeline is setup in a similar way as the USGS earthquake pipeline.

The following differences are noteworthy:

  1. The pipeline requires an account to be setup. This is free, quick and avaiable to everyone. Refer here for instructions.
    • Note that for this reasong the jobs/DAGs for this pipeline are initially paused in Airflow. Once the API key is obtained an setup in the docker container use the Airflow UI to unpause the jobs.
  2. The meltano.yml file is configured to obtain this key from an environment variable as follows:
      headers:
        x-key: $ENV_GIE_XKEY
  1. Schedule/orchestration is not configured using meltano.yml but instead with two manually coded Airflow DAGs. The Python file containing the code for these can be found at /project/mimodast/orchestrate/dags/gie_dag.py.
    • The backfill dag captures historic data from source. To specify the date range, two Airflow variables are used. These values can be changed using the Airflow UI.
    • It takes some time (<1 minute) for the new date range to be reflected in the DAG.
    • Note that using Airflow variables in a DAG in this way is not a best practice design but is used for simplicity.
  2. Meltano's inline data mapping functionality is used to create hashed ID fields when obtaining data from source using the following configuration in meltano.yml:
      stream_maps:
        stg_gie_storage:
          key_hash: md5(config['hash_seed'] + (gasDayStart + code))
  1. Additional test types are included. For example (from rpt_gie_storage.yml):
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns:
            - sso_eic
            - gasdaystart 

(back to top)

Roadmap

See the open issues for a full list of proposed features and known issues.

(back to top)

Contributing

Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information. The tools and the sample data are subject to their own respective licenses.

(back to top)

Contact

Project Link: https://github.com/EJOOSTEROP/mimodast

(back to top)