Dott Data Modelling Task

Introduction

Two datasets were provided in the form of CSV files and the task is to model the data to help business users answer questions like the below:

How many vehicles are deployed in each city/ country/ as a whole for a given day/ week?
How many vehicles are lost in each city/ country/ as a whole for a given day/week?
What is the 14-day running lost percentage (total loss vehicles in the D-21 and D-7 divided by the average deployed vehicles in the same period) in each city/country as a whole for a given day?
How many vehicles with battery level are less than 20% in each city ….?

Dataset

Telemetry

This dataset was provided in a csv file named data_modelling_test_tbl_telemetry.csv. It contains records of periodic gps signals received by Dott's vehicles; it also contains data on the vehicle's battery level and whether or not the vehicle is being charged at that particular moment. Typically, a vehicle will have several records in an hour (roughly every 15 minutes).

State

This dataset was provided in a csv file named data_modelling_test_tbl_states.csv. It contains information about the state of vehicles and what city/country they are being operated out of. There are three possible states a vehicle can have; is_deployed, is_broken, is_in_warehouse. Typically, a vehicle will have several records in an hour (roughly every 15 minutes).

Tools used

HDFS
HIVE
PRESTO

The steps on how to load the CSV files attached for this test using the tools above is given in loading_csv_steps.md

APPROACH TO THE TASK

After studying the data and the business questions, I made the below decisions on the data modelling:

Telemetry

I decided to model this as a snapshot fact table that shows the status of every vehicle at the end of every day. The field that determines the date is the time_gps field. The reason for this is because a lot of the business questions seemed to focus on activities that happened in a given day/week. As a result, it appears that the business will get the most value from having the data modelled on at least a daily grain. The model is incremental and builds on records from the previous day (Day minus 1). Every vehicle can only have one record for a given day. This model can help business users know the status of every vehicle at the end of a day, as well as how many days have passed since a vehicle last received a gps signal.

State

I decided to model this as a Type 2 Slowly Changing Dimension. This dataset contains data received every 15 minutes and shows the current state of a vehicle as defined by three states; is_deployed, is_broken, is_in_warehouse. However, based on the kind of questions the business is looking to ask of the data, most of this data is redundant since this is essentially a snapshot of vehicle state, roughly every 15 minutes. What will be of value to the business will be to show state changes only. A state change in this model is defined as when any of the three vehicle states changes from true to false and vice-versa. Modeled as a Type 2 Slowly Changing Dimension, this model will close a record and open a new one when there's a state change in one of the three states; it will also track which of the states had a change so that the business can use this to answer key business questions.

PROS AND CONS OF MODELLING APPROACH

Pros

Snapshot fact tables are very effective in helping the business track KPIs over a period of time. In our case, we will be able to see the status of vehicles as at the end of every day, and we can track this overtime to create insights.
Type 2 slowly changing dimensions will allow us to accurately keep historical information. In our case, we will be able to keep track of historical state changes of vehicles.

Cons

For both the snapshot fact table and the type 2 slowly changing dimension, the modelling approach can cause the size of the tables to grow fast, and overtime, performance and storage could be a concern. Although the performance concern can be mitigated on by applying proper partitioning/indexing.

Model structure

Snapshot Fact Table

Slowly Changing Dimension (Type 2)

Data Flow

Telemetry

State

SCHEMA AND OBJECTS

Schemas

staging
dott_vehicle

Tables

staging.tbl_telemetry
staging.tbl_states
dott_vehicle.daily_vehicle_snapshot
dott_vehicle.dim_vehicle_state

SQL QUERIES (PRESTO)

daily_vehicle_snapshot_ddl.sql : This is to create the table for dott_vehicle.daily_vehicle_snapshot
dim_vehicle_state_ddl.sql : This is to create the table for dott_vehicle.dim_vehicle_state
dim_vehicle_state_incremental_load.sql : This is the query to load dott_vehicle.dim_vehicle_state incrementally
vehicle_snapshot_table_initial_load.sql : This is the query to carry out the initial load (the very first day) for dott_vehicle.daily_vehicle_snapshot
vehicle_snapshot_table_incremental_load.sql : This is the query to load dott_vehicle.daily_vehicle_snapshot subsequently, after the first day has been loaded

Variables used in query

{var_d1} : Current Day minus 1
{var_d2} : Current Day minus 2

DATA TESTS

There are five data tests that were written for this task; two for dott_vehicle.dim_vehicle_state and three for dott_vehicle.daily_vehicle_snapshot. They can all be found in the data_tests.sql file.

Sample queries (Presto)

How many vehicles are deployed in each city/ country/ as a whole for a given day
SELECT
city_name
, country_name
, count(1) as deployed_vehicles
FROM
dott_vehicle.dim_vehicle_state
WHERE ((date_format(date_parse(substr(start_date, 1, 19), '%Y-%m-%d %H:%i:%s'), '%Y%m%d') = YYYYMMDD) AND (new_deploy_ind = 'Y'))
GROUP BY city_name, country_name ;

NB: I defined a deploy as an is_deploy state change from "false" to "true"

How many vehicles are lost in each city/ country/ as a whole for a given day?
SELECT
b.city_name
, b.country_name
, count(1) AS lost_vehicle_count
FROM
(dott_vehicle.daily_vehicle_snapshot a
LEFT JOIN (
SELECT
vehicle_id
,city_name
,country_name
FROM
dott_vehicle.dim_vehicle_state
WHERE (end_date = 'N/A')
) b ON (a.vehicle_id = b.vehicle_id))
WHERE ((a.date_key = YYYYMMDD) AND (a.lost_vehicle = 1))
GROUP BY b.city_name, b.country_name ;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dott Data Modelling Task

Introduction

Dataset

Telemetry

State

Tools used

APPROACH TO THE TASK

Telemetry

State

PROS AND CONS OF MODELLING APPROACH

Model structure

Snapshot Fact Table

Slowly Changing Dimension (Type 2)

Data Flow

Telemetry

State

SCHEMA AND OBJECTS

Schemas

Tables

SQL QUERIES (PRESTO)

Variables used in query

DATA TESTS

Sample queries (Presto)

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md
daily_vehicle_snapshot_ddl.sql		daily_vehicle_snapshot_ddl.sql
data_tests.sql		data_tests.sql
dim_vehicle_state_ddl.sql		dim_vehicle_state_ddl.sql
dim_vehicle_state_incremental_load.sql		dim_vehicle_state_incremental_load.sql
loading_csv_steps.md		loading_csv_steps.md
vehicle_snapshot_table_incremental_load.sql		vehicle_snapshot_table_incremental_load.sql
vehicle_snapshot_table_initial_load.sql		vehicle_snapshot_table_initial_load.sql

jolfad/dott-data-modelling

Folders and files

Latest commit

History

Repository files navigation

Dott Data Modelling Task

Introduction

Dataset

Telemetry

State

Tools used

APPROACH TO THE TASK

Telemetry

State

PROS AND CONS OF MODELLING APPROACH

Model structure

Snapshot Fact Table

Slowly Changing Dimension (Type 2)

Data Flow

Telemetry

State

SCHEMA AND OBJECTS

Schemas

Tables

SQL QUERIES (PRESTO)

Variables used in query

DATA TESTS

Sample queries (Presto)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages