Skip to content

Containerized application for declaring initial classification state of Purview assets using Glossary Terms

Notifications You must be signed in to change notification settings

mdrakiburrahman/purview-asset-ingestor

Repository files navigation

Azure Purview - Containerized app for declaring custom sensitivity labels to assets as glossary terms

A containerized Python flask app that exposes an API for interacting with Azure Purview to implement Business Logic, using:

Table of Contents

Overview

Currently, the three top-level functionalities implemented:

  1. Create a list of glossary terms to track Custom/organization specific Sensitivity Labels (using a minified JSON)

    💡 Today, Purview only offers Automatic Labelling via Microsoft 365 Sensitivity Labels - this method using Glossary Terms offers a workaround for organizations not leveraging M365 labels, and allows us to programmatically query Purview's REST API to interrogate the asset for declared state (as declared by Data Teams).

  2. Create an entire asset chain for an Azure SQL Database, and apply glossary terms to serve as Custom Data Classifications (using a minified JSON)

    💡 The core value add here is that the Asset Columns will have the declared state available at time of provisioning, which allows us to monitor for classification drift using the methods demonstrated here. This is not possible without having the Asset present with the Custom labels within Purview before the first scan runs, i.e. without this capability, we are not able to track the initial state.

  3. Trigger Scan to establish end-to-end asset relationships and have Purview apply Classifications

Pre-reqs

  • Azure SQL DB Data Source has been registered with Purview (one-time activity)

  • A Scan has been created on the Data Source, but not run (one-time activity):

    Save scan

Note that this could have been done using an API call as well if required.

  • We start with no Assets in this particular demo, but other assets can exist (assuming no conflict): No Assets

  • We start with no Glossary Terms in this particular demo, but other Terms can exist (assuming no conflict): No GLossary Terms

Run container on Docker Desktop

Clone this repo - then to run the container locally on Docker Desktop, run:

# Build container from Dockerfile
docker build -t purview-asset-ingestor .

# Start container by injecting environment variables
docker run `
  -e "PURVIEW_NAME=<your--purview--account>" `
  -e "AZURE_CLIENT_ID=<your--client--id>" `
  -e "AZURE_CLIENT_SECRET=<your--client--secret>" `
  -e "AZURE_TENANT_ID=<your--azure--tenant--id>" `
  -p 5000:5000 `
  --rm -it purview-asset-ingestor

Run container on Docker Desktop

And the container can be called via Postman at http://127.0.0.1:5000 as a GET request:

Call API

Run container on Kubernetes

Use the deployment.yaml file to create a Kubernetes deployment:

# Create namespace, deployment and external service
kubectl create namespace purview
kubectl apply -f "secret-sample.yaml"
kubectl apply -f "deployment.yaml"
kubectl expose deployment purview-asset-ingestor --type=LoadBalancer --name=purview-asset-ingestor-service -n purview

# Tail logs
kubectl logs purview-asset-ingestor-6c7d49b4bf-x4mrl -n purview --follow

Demonstration

Step 1: Create a list of glossary terms to track Custom/organization specific Classification Labels(using a minified JSON)

The following minified JSON payload represents our Organization's Custom Classification Labels:

[
  {
    "longDescription": "Passwords, access code, security questions or similar.",
    "name": "Contoso_IC_Restricted"
  },
  {
    "longDescription": "Sensitive Personal Info, Material Business Information",
    "name": "Contoso_IC_Sensitive"
  },
  {
    "longDescription": "Financial, Personal, Business, Product, Project or Proprietary Information.",
    "name": "Contoso_IC_Confidential"
  },
  {
    "longDescription": "Internal phone directory, employeed IDs, HR Policies, Client info not combined with PII",
    "name": "Contoso_IC_Internal"
  },
  {
    "longDescription": "Published public information that can be found on the internet.",
    "name": "Contoso_IC_Public"
  }
]

We perform a POST request to http://127.0.0.1:5000/api/glossary/terms using Postman with the above JSON in the Body:
Call API

And we see the Glossary Terms get created within Purview:
Glossary Terms get created

Step 2: Create an entire asset chain for an Azure SQL Database, and apply glossary terms to serve as Custom Data Classifications (using a minified JSON)

The following minified JSON payload represents Azure SQL Database we are looking to onboard - containing the Application Specific Data Schema and declared classifications:

{
  "serverName": "aemigration",
  "collectionId": "aia-purview-new",
  "databaseName": "contosoHR_AE",
  "schemaName": "dbo",
  "table": {
    "name": "Employees",
    "columns": [
      {
        "name": "Salary",
        "data_type": "varbinary",
        "classification": "Contoso_IC_Confidential"
      },
      {
        "name": "EmployeeID",
        "data_type": "int",
        "classification": "Contoso_IC_Internal"
      },
      {
        "name": "LastName",
        "data_type": "nvarchar",
        "classification": "Contoso_IC_Confidential"
      },
      {
        "name": "FirstName",
        "data_type": "nvarchar",
        "classification": "Contoso_IC_Confidential"
      },
      {
        "name": "SSN",
        "data_type": "varbinary",
        "classification": "Contoso_IC_Sensitive"
      }
    ]
  }
}

We perform a POST request to http://127.0.0.1:5000/api/assets using Postman with the above JSON in the Body: Call API

And we see the Assets get created within Purview (including the Columns and classifications): Assets get created

Step 3: Trigger Scan to establish end-to-end asset relationships and have Purview apply Classifications

The following JSON payload asks Purview to run a scan against the Data Source we already established in pre-reqs:

{
  "dataSourceName" : "contosoHR",
  "scanName" : "Scan-AE"
}

We perform a POST request to http://127.0.0.1:5000/api/scan using Postman with the above JSON in the Body: Call API

And we see the scan begins on the asset: Scan begins

Step 4: Observe Assets with Custom Sensitivity labels (i.e. glossary terms) applied per column

Once the Scan is Completed: Scan Completed

We see the Assets have the Glossary Terms applied on search facet, and the Term layer: Asset labelled Asset labelled

And the Asset is labelled at the column level: Asset labelled

As desired.

Additional Resources

About

Containerized application for declaring initial classification state of Purview assets using Glossary Terms

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published