Skip to content

This is a sample repo for setting up a CosmosDB - Gremlin API, Azure Search and Synapse Spark to ingest and visualize data as graph

Notifications You must be signed in to change notification settings

lordlinus/cosmosdb-graph-demo

Repository files navigation

page_type languages products
sample
python
cosmosdb
azure-synapse
azure-cognitive-search
container-instance-service
container-registry

Cosmos DB Gremlin Graph Demo

Problem statement

Bank transactions have traditionally been stored in transactional databases and analysed using SQL queries and to increase scale they are now analysed in distributed systems using Apache Spark. While SQL is great to analyse this data finding relationships between transactions and accounts can be challenging. In this scenario we want to visualize 2 level of customer relationships i.e. if A sends to B and B send to C then we want to identify C when we look at transactions made by A together with C and vice versa.

Contents


Overview


Graph helps solve complex problems by utilizing power of relationships between objects, some of these can be modeled as SQL statements but gremlin api provide a more concise way to express and search relationships. In this solution we are using Azure Cosmos Graph DB to store transactions data with customer account id as vertices and transaction as edges and transactional amount as properties of the edges.Since running fan out queries on Cosmos DB is not ideal we are leveraging Azure cognitive search to index data in Cosmos DB and leverage search api perform full scan/search queries. Additionally, Azure search will give us the flexibility to search for account either received or sent. This provides a scalable solution that can scale for any number of transactions and keeping the RU requirement for Cosmos queries low.

Features:


  1. Synapse spark is used to bulk load data into gremlin using SQL api NOTE: Cosmos gremlin expects to have certain json fields in the edge properties. Since cosmos billing is charged per hour we need to adjust the RU's accordingly to minimize cost, a spark cluster with 4 nodes and cosmos throughput at 20,000 RU/s ( single region) both edges (9 Million ) and vertices (6 Million) records can be ingested in an hour.
  2. All search fan-out queries are done using Azure cognitive search api, Cosmos indexer can be scheduled at regular intervals to update the index
  3. To keep the RU's low, Gremlin query is constructed to include account list e.g. when you search for account xyz all account send or received from xyz is created as vertices_list and gremlin query to get 2 level of transactions is executed as g.V().has('accountId',within({vertices_list})).optional(both().both()).bothE().as('e').inV().as('v').select('e', 'v') you can customize this query based on your use-case

Prerequisites


Installing this connector requires the following:

  1. Azure subscription-level role assignments for both Contributor and User Access Administrator.
  2. Azure Service Principal with client ID and secret - How to create Service Principal.

Getting started


Step.1 Deploy infrastructure

There are three deployment options for this demo:

  1. Option 1:

    1. Click on link to deploy the template.

    Deploy to Azure

  2. Option 2.

    1. Open a browser to https://shell.azure.com, Azure Cloud Shell is an interactive, authenticated, browser-accessible shell for managing Azure resources. It provides the flexibility of choosing the shell experience that best suits the way you work, either Bash or PowerShell
    2. Select the Cloud Shell icon on the Azure portal
    3. Select Bash
    4. git clone this repo and cd into infra directory
    5. Update settings.sh file with required values, use code command in bash shell to open the file in VS Code
    6. Run ./infra-deployment.sh to deploy infrastructure

    The above deployment should create container instance with a sample dashboard

  3. Option 3.

    1. Use GiHub actions to deploy services. Go to github_action_infra_deployment to see how to deploy services.

Step.2 Post install access setup

  1. Add client ip to allow access to Synapse workspace. Navigate to resource group -> Synapse workspace -> Networking -> Click "Add client IP" and Save

  2. Add yourself as a user to Synapse workspace. Navigate to Synapse workspace -> manage -> Access control -> Add -> scope "workspace" -> role "Synapse Administrator" -> select user "[email protected]" -> Apply

  3. Add yourself as a user to Synapse Apache Spark administrator. Navigate to Synapse workspace -> manage -> Access control -> Add -> scope "workspace" -> role "Synapse Apache Spark administrator" -> select user "[email protected]" -> Apply

  4. Create data container.Navigate to storage account and create container e.g. "data" and upload CSV file into this container

  5. Assign read/write access to storage account.Navigate to Synapse workspace -> select "Data" sec -> select and expand "Linked" storage -> select Primary storage account and container e.g. data > right click on container "data" and click "Manage access" -> Add -> search and select user "[email protected]" -> assign read and write -> click Apply

Step.3 Load data

  1. Upload CSV file PS_20174392719_1491204439457_log.csv into Synapse default storage account. Data source: Kaggle Fraud Transaction Detection.( NOTE: you need to use git-lfs to download the csv file locally )

Step.4 Data ingestion using PySpark

  1. Import notebook "Load_Bank_transact_data.ipynb"
  2. Update linkedService , cosmosEndpoint, cosmosMasterKey, cosmosDatabaseName and cosmosContainerName in notebook
  3. Run notebook and monitor the progress of data load from Cosmos DB insights view ( NOTE: Cosmos billing is per hour so adjust your RU's accordingly to minimize cost)

Step.5 Sample dashboard app

A sample python webapp is deployed as part of infra deployment. Navigate to the public url from container instances and start exploring the data.screenshot of dashboard dashboard

Limitations

  • User authentication is not implemented yet for dashboard app

References

About

This is a sample repo for setting up a CosmosDB - Gremlin API, Azure Search and Synapse Spark to ingest and visualize data as graph

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published