Skip to content

syedhassaanahmed/azure-data-manager

Repository files navigation

azure-data-manager

Build Status

Table of contents

Introduction

Organizations which produce large volumes of data are increasingly investing in exploring better ways to discover, analyze and extract key insights from the data. These organizations face regular challenges in the shape of data ponds and often struggle to make the right dataset available to their primary users i.e. Data Analysts/Data Scientists.

This project aims to provide a template for defining, ingesting, transforming, analyzing and showcasing data using Azure Data platform. We've leveraged Azure Cosmos DB SQL API as storage layer for data catalog. Azure Blob Storage serves as the defacto store for all semi-structured data (e.g. JSON, CSV, Parquet files). Azure Data Factory (ADF) v2 performs the orchestration duties with Azure Databricks providing the compute for all transformations.

The front-end interface is an ASP.NET Core web app which reads catalog definitions and creates ADF entities using the ADF .NET Core SDK. For visualizing data lineage, vis.js networks were used.

Architecture

architecture.png

Getting Started

In this solution our catalog definition consists of 2 data sources;

a) Time Series JSON files from IoT sensors

b) SQL Database Table containing sensors metadata

Our data pipeline simply extracts the metadata from SQL Database into tabular form, joins it with time series data and finally publishes it to a REST endpoint.

architecture.png ADF pipeline view

lineage.png Web app data lineage view

The pipeline can either be triggered manually using the web app's REST API or in case of dynamic data sources i.e. Time Series, an event trigger is automatically created.

Event triggers unfortunately have performance limitations hence creating more than 100 dynamic data sources is currently not supported.

Configuration

For configuring the ASP.NET Core web app please follow this document.

Security

To authenticate the web app we create an Azure AD application. This application also has to be assigned contributor role on the resource group so that ADF and its entities can be provisioned.

In order to access credentials of our data sources, ADF relies on Azure Key Vault. When an ADF resource is provisioned in Azure, a Service Identity is automatically generated. That Service Identity has to be granted Get permission in the Key Vault access policies.

Deployment

Please follow this document for deployment to Azure.

Testing APIs locally

Run in Postman

TODO

  • Leverage ADF Data Flow for common transforms.
  • Use Azure Data Lake Store Gen2 as the underlying storage layer.

Team

Matthieu Lefebvre

Sofiane Yahiaoui

Igor Pagliai

Engin Polat

Christopher Harrison

Syed Hassaan Ahmed