azure-data-manager

Introduction

Organizations which produce large volumes of data are increasingly investing in exploring better ways to discover, analyze and extract key insights from the data. These organizations face regular challenges in the shape of data ponds and often struggle to make the right dataset available to their primary users i.e. Data Analysts/Data Scientists.

This project aims to provide a template for defining, ingesting, transforming, analyzing and showcasing data using Azure Data platform. We've leveraged Azure Cosmos DB SQL API as storage layer for data catalog. Azure Blob Storage serves as the defacto store for all semi-structured data (e.g. JSON, CSV, Parquet files). Azure Data Factory (ADF) v2 performs the orchestration duties with Azure Databricks providing the compute for all transformations.

The front-end interface is an ASP.NET Core web app which reads catalog definitions and creates ADF entities using the ADF .NET Core SDK. For visualizing data lineage, vis.js networks were used.

Architecture

Getting Started

In this solution our catalog definition consists of 2 data sources;

a) Time Series JSON files from IoT sensors

b) SQL Database Table containing sensors metadata

Our data pipeline simply extracts the metadata from SQL Database into tabular form, joins it with time series data and finally publishes it to a REST endpoint.

ADF pipeline view

Web app data lineage view

The pipeline can either be triggered manually using the web app's REST API or in case of dynamic data sources i.e. Time Series, an event trigger is automatically created.

Event triggers unfortunately have performance limitations hence creating more than 100 dynamic data sources is currently not supported.

Configuration

For configuring the ASP.NET Core web app please follow this document.

Security

To authenticate the web app we create an Azure AD application. This application also has to be assigned contributor role on the resource group so that ADF and its entities can be provisioned.

In order to access credentials of our data sources, ADF relies on Azure Key Vault. When an ADF resource is provisioned in Azure, a Service Identity is automatically generated. That Service Identity has to be granted Get permission in the Key Vault access policies.

Deployment

Please follow this document for deployment to Azure.

Testing APIs locally

TODO

Leverage ADF Data Flow for common transforms.
Use Azure Data Lake Store Gen2 as the underlying storage layer.

Team

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
DataManager.Web		DataManager.Web
DataManager		DataManager
docs		docs
notebooks		notebooks
sample-data		sample-data
.gitignore		.gitignore
DataManager.sln		DataManager.sln
LICENSE		LICENSE
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
azuredeploy.json		azuredeploy.json
setup.sh		setup.sh

License

syedhassaanahmed/azure-data-manager

Folders and files

Latest commit

History

Repository files navigation

azure-data-manager

Table of contents

Introduction

Architecture

Getting Started

Configuration

Security

Deployment

Testing APIs locally

TODO

Team

About

Topics

Resources

License

Stars

Watchers

Forks

Languages