Two datasets were provided in the form of CSV files and the task is to model the data to help business users answer questions like the below:
- How many vehicles are deployed in each city/ country/ as a whole for a given day/ week?
- How many vehicles are lost in each city/ country/ as a whole for a given day/week?
- What is the 14-day running lost percentage (total loss vehicles in the D-21 and D-7 divided by the average deployed vehicles in the same period) in each city/country as a whole for a given day?
- How many vehicles with battery level are less than 20% in each city ….?
This dataset was provided in a csv file named data_modelling_test_tbl_telemetry.csv. It contains records of periodic gps signals received by Dott's vehicles; it also contains data on the vehicle's battery level and whether or not the vehicle is being charged at that particular moment. Typically, a vehicle will have several records in an hour (roughly every 15 minutes).
This dataset was provided in a csv file named data_modelling_test_tbl_states.csv. It contains information about the state of vehicles and what city/country they are being operated out of. There are three possible states a vehicle can have; is_deployed, is_broken, is_in_warehouse. Typically, a vehicle will have several records in an hour (roughly every 15 minutes).
- HDFS
- HIVE
- PRESTO
The steps on how to load the CSV files attached for this test using the tools above is given in loading_csv_steps.md
After studying the data and the business questions, I made the below decisions on the data modelling:
I decided to model this as a snapshot fact table that shows the status of every vehicle at the end of every day. The field that determines the date is the time_gps field. The reason for this is because a lot of the business questions seemed to focus on activities that happened in a given day/week. As a result, it appears that the business will get the most value from having the data modelled on at least a daily grain. The model is incremental and builds on records from the previous day (Day minus 1). Every vehicle can only have one record for a given day. This model can help business users know the status of every vehicle at the end of a day, as well as how many days have passed since a vehicle last received a gps signal.
I decided to model this as a Type 2 Slowly Changing Dimension. This dataset contains data received every 15 minutes and shows the current state of a vehicle as defined by three states; is_deployed, is_broken, is_in_warehouse. However, based on the kind of questions the business is looking to ask of the data, most of this data is redundant since this is essentially a snapshot of vehicle state, roughly every 15 minutes. What will be of value to the business will be to show state changes only. A state change in this model is defined as when any of the three vehicle states changes from true to false and vice-versa. Modeled as a Type 2 Slowly Changing Dimension, this model will close a record and open a new one when there's a state change in one of the three states; it will also track which of the states had a change so that the business can use this to answer key business questions.
Pros
- Snapshot fact tables are very effective in helping the business track KPIs over a period of time. In our case, we will be able to see the status of vehicles as at the end of every day, and we can track this overtime to create insights.
- Type 2 slowly changing dimensions will allow us to accurately keep historical information. In our case, we will be able to keep track of historical state changes of vehicles.
Cons
- For both the snapshot fact table and the type 2 slowly changing dimension, the modelling approach can cause the size of the tables to grow fast, and overtime, performance and storage could be a concern. Although the performance concern can be mitigated on by applying proper partitioning/indexing.
- staging
- dott_vehicle
- staging.tbl_telemetry
- staging.tbl_states
- dott_vehicle.daily_vehicle_snapshot
- dott_vehicle.dim_vehicle_state
- daily_vehicle_snapshot_ddl.sql : This is to create the table for dott_vehicle.daily_vehicle_snapshot
- dim_vehicle_state_ddl.sql : This is to create the table for dott_vehicle.dim_vehicle_state
- dim_vehicle_state_incremental_load.sql : This is the query to load dott_vehicle.dim_vehicle_state incrementally
- vehicle_snapshot_table_initial_load.sql : This is the query to carry out the initial load (the very first day) for dott_vehicle.daily_vehicle_snapshot
- vehicle_snapshot_table_incremental_load.sql : This is the query to load dott_vehicle.daily_vehicle_snapshot subsequently, after the first day has been loaded
- {var_d1} : Current Day minus 1
- {var_d2} : Current Day minus 2
There are five data tests that were written for this task; two for dott_vehicle.dim_vehicle_state and three for dott_vehicle.daily_vehicle_snapshot. They can all be found in the data_tests.sql file.
- How many vehicles are deployed in each city/ country/ as a whole for a given day
SELECT
city_name
, country_name
, count(1) as deployed_vehicles
FROM
dott_vehicle.dim_vehicle_state
WHERE ((date_format(date_parse(substr(start_date, 1, 19), '%Y-%m-%d %H:%i:%s'), '%Y%m%d') = YYYYMMDD) AND (new_deploy_ind = 'Y'))
GROUP BY city_name, country_name ;
NB: I defined a deploy as an is_deploy state change from "false" to "true"
- How many vehicles are lost in each city/ country/ as a whole for a given day?
SELECT
b.city_name
, b.country_name
, count(1) AS lost_vehicle_count
FROM
(dott_vehicle.daily_vehicle_snapshot a
LEFT JOIN (
SELECT
vehicle_id
,city_name
,country_name
FROM
dott_vehicle.dim_vehicle_state
WHERE (end_date = 'N/A')
) b ON (a.vehicle_id = b.vehicle_id))
WHERE ((a.date_key = YYYYMMDD) AND (a.lost_vehicle = 1))
GROUP BY b.city_name, b.country_name ;