Data Quality Assessment Tool

The scripts included in this repository can be used as a way to audit the quality of a dataset through 6 metrics. Currently the focus is on data generated by IoT sensors, specifically used in the context of assessing various aspects of smart cities, such as AQM, ITMS, and Bangalore Ambulance Data.

A PDF report is generated as the output along with a JSON file. These reports contain the result of the evaluation procedure described below:

For each of the metrics that are used to quantify the quality of a dataset, the tool aims to provide a score between 0 and 1, where 1 is the highest possible score, indicating a 100% score. Currently, the tool is able to assess six parameters, namely:

Regularity of Inter-Arrival Time
Outliers of Inter-Arrival Time
Absence of Duplicate Values
Adherence to Attribute Format
Absence of Unknown Attributes
Adherence to Mandatory Attributes

A note to remember is that each dataset has a linked JSON Schema that defines the attributes that can be present in the dataset, whether there are any required attributes, and what units and datatypes these attributes need to have. Additionally, each dataset must have a config file assoociated with it to generate the output reports. The config files are included in this repository.

Regularity of Inter-Arrival Time

The regularity metric of the inter-arrival time is an evaluation of the uniformity of time intervals between receipt of consecutive data packets transmitted by IoT sensors.

Outliers of Inter-Arrival Time

The outlier metric of the inter-arrival time is an evaluation of the number of IAT values that show a significant deviation from the expected behaviour.

Duplicate Presence

This metric serves to check two columns that are input by the user for any duplicate values in the dataset. A value is considered to be a duplicate if both columns contain the exact same values for any data packet.

Attribute Format Adherence

This metric assesses the level of adherence of the data to its expected format as defined in the data schema. It is quantified by taking the ratio of packets adhering to the expected schemas to the total number of data packets.

Absence of Unknown Attributes

This metric checks whether there are any additional attributes present in the dataset apart from the list of required attributes.

Adherence to Mandatory Attributes

This metric checks whether all the required attributes defined in the schema are present in the dataset.

Some additional information about the dataset is also provided in the report. A more detailed description and evaluation criteria for all these metrics is provided in the output PDF report.

Generating Reports

The first step is to ensure that the IUDX SDK is installed on your computer using the following command:

pip install git+https://github.com/datakaveri/iudx-python-sdk

Once inside the directory where the repo was cloned, run:

pip install .

Running the tool

Clone the repo from:

git clone https://github.com/datakaveri/data-quality-assessment.git

Required libraries and packages

Once in the scripts folder, run the following command to install the package and library dependencies:

pip install -r requirements.txt

Present in the config folder is a config file in JSON format with the name of the dataset prepended to it. This file requires one to input the name of the datafile as well as select the attributes that one would like to check for duplicates. Ensure that the parameters folderName, dataFileNameJSON, schemaFileName are correctly filled according to the dataset to be analyzed. Additional parameters can also be changed, such as inputFields for 'duplicateDetection' as well as the value of 'alpha' - used to identify outliers. Ensure that all these parameters are appropriately set prior to running the main script.

Present in the data folder in the repository is a sample dataset of ITMS data from Surat, as well as a sample dataset of AQM data from Pune. Inside the schemas folder are the corresponding schemas for these datasets. In order to generate the report, simply run the following command:

python3 DQReportGenerator.py

and enter the name of the config file when prompted.

Ensure that the datasets are in JSON format and are located in the data folder.

The output report file will be generated in a JSON format and and a .pdf format and will be saved in the outputReports folder. These files will be prepended by the name of the dataset as taken from the config file. The plots and visualizations required for generating the PDF report will be stored in the plots folder and will be populated as the script is running. These will be overwritten everytime the script is run and do not need to be stored locally long-term.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Quality Assessment Tool

Regularity of Inter-Arrival Time

Outliers of Inter-Arrival Time

Duplicate Presence

Attribute Format Adherence

Absence of Unknown Attributes

Adherence to Mandatory Attributes

Generating Reports

Running the tool

Required libraries and packages

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
config		config
data		data
outputReports		outputReports
plots		plots
schemas		schemas
scripts		scripts
README.md		README.md

Vidish-Bijalwan/data-quality-assessment

Folders and files

Latest commit

History

Repository files navigation

Data Quality Assessment Tool

Regularity of Inter-Arrival Time

Outliers of Inter-Arrival Time

Duplicate Presence

Attribute Format Adherence

Absence of Unknown Attributes

Adherence to Mandatory Attributes

Generating Reports

Running the tool

Required libraries and packages

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages