Bigdata profiler

This is a tool to profile your incoming data, check if it adheres to registered schema and do custom data quality checks. At the end of all this, a human readable report is autogenerated that can be sent over to stakeholders.

Features

Config driven data profiling and schema validation
Autogeneration of report after every run
Integration with datadog monitoring system
Extensible and highly customizable.
Very little boiler plate code.
Support for versioned schema validation.

Dataformats currently supported

CSV
JSON
Parquet

can easily be extended to all the formats that Apache Spark supports for reads.

SQL support for custom data quality checks

Supports both ANSI-SQL as well as Hive QL. List of all supported SQL functions can be found here

Run Instructions

All one has to do is execute a python script papermill_notebook_runner.py. This script takes in the following arguments in order:

Path to the notebook to be run.
Path to the output notebook.
JSON configuration that will drive the notebook.

python papermill_notebook_runner.py data-validator.ipynb output/data-validator.ipynb '{"dataFormat":"json","inputDataLocation":"s3a://bucket/prefix/generated.json","appName":"cust-profile-data-validation","schemaRepoUrl":"http://schemarepohostaddress","scheRepoSubjectName":"cust-profile","schemaVersionId":"0","customQ1":"select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset","customQ1ResultThreshold":0,"customQ1Operator":"=","customQ2":"select CAST(length(phone) as Long) from dataset","customQ2ResultThreshold":17,"customQ2Operator":"=","customQ3":"select CAST(count(distinct gender) as Long) from dataset","customQ3ResultThreshold":3,"customQ3Operator":"<="}'

Install Instructions

There are several pieces involved.

First install jupyter notebooks. Install instructions here.
Next install spark magic. Install instructions here
Configure sparkmagic with your own Apache Livy endpoints. Config file should look like this
Install papermill from source after adding spark-magic kernels. Clone papermill project from here.
Update the translators file to add sparkmagic kernels at the very end of the file.

papermill_translators.register("sparkkernel", ScalaTranslator)
papermill_translators.register("pysparkkernel", PythonTranslator)
papermill_translators.register("sparkrkernel", RTranslator)

Next install schema repo. Install instructions here.

More details

Find more details on this guide

That should be it. Enjoy Profiling !!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
output		output
.gitignore		.gitignore
README.md		README.md
data-validator.ipynb		data-validator.ipynb
flow.jpeg		flow.jpeg
generated.json		generated.json
guide.md		guide.md
papermill_notebook_runner.py		papermill_notebook_runner.py
schema.avsc		schema.avsc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output

output

.gitignore

.gitignore

README.md

README.md

data-validator.ipynb

data-validator.ipynb

flow.jpeg

flow.jpeg

generated.json

generated.json

guide.md

guide.md

papermill_notebook_runner.py

papermill_notebook_runner.py

schema.avsc

schema.avsc

Repository files navigation

Bigdata profiler

Features

Dataformats currently supported

SQL support for custom data quality checks

Contents

Run Instructions

Install Instructions

More details

About

Releases

Packages

Languages

Nordstrom/bigdata-profiler

Folders and files

Latest commit

History

Repository files navigation

Bigdata profiler

Features

Dataformats currently supported

SQL support for custom data quality checks

Contents

Run Instructions

Install Instructions

More details

About

Topics

Resources

Stars

Watchers

Forks

Languages