Building a data pipeline using Azure Durable Functions

Introduction

Features

Project structure explanation

Data source configuration

Function configurations

Database design

Project flow and functions explanation

Introduction

github-repo-stats is a data pipeline built using Azure Durable Functions. Posts Github repo stats to Cosmos DB and publish event to Azure Event Grid for starting any downstream processing.

Example of repo stats obtained :

{
"repo": "octokit/octokit.rb",
"isArchived": false,
"isTemplate": false,
"repoUpdatedAt": "2021-01-19T20:44:18Z",
"openIssues": 46,
"closedIssues": 455,
"totalIssues": 501,
"openPRs": 13,
"closedPRs": 150,
"mergedPRs": 598,
"totalPRs": 501,
"stars": 3378
}

Harnesses the power of durable functions - Function chaining, Fan out/ Fan in to run tasks in parallel and passing data from one step to other.

Techstack used:

Azure durable functions using python 3.8
Azure Cosmos DB
Azure Event Grid
Azure Application Insights
Azure Log Analytics
SendGrid
Github Actions

Features:

Easily configure to pull stats for entire org or individual repos through simple json file
Get stats from multiple orgs at same time or a combination of orgs and individual repos
Exclude repos that doesn't interest you from full orgs
Throttle requests to Github GraphQL API using configurable values
Retry failed requests to Github API after cool down
Takes Cosmos DB throughput into consideration to prevent request dropping
Throttle Cosmos DB requests based on RU's configured
Switch between Cosmos DB serverless and provisioned mode to take full advantage of Cosmos DB infinite scaling
Send notifications about run start and completion report at end using SendGrid
Publish event in event grid for starting any down stream process
Run on schedule using Azure functions time trigger or on demand using HTTP trigger
Run completely on local machine for development and testing using Azure emulators
Use Github actions for deploying to Function App

Project structure explanation

The code for all the functions in a specific function app is located in a root project folder that contains a host configuration file and one or more subfolders.

Each subfolder contains the code for a separate function. The folder structure is as below

github-repo-stats-azure-function
 ┣ .github
 ┃ ┗ workflows
 ┃ ┃ ┗ main_githubrepostats.yml
 ┣ AppendIndividualRepos
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ CreateGraphqlQuery
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ CreateRunId
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ Data
 ┃ ┗ sources.json
 ┣ DurableFunctionsHttpStart
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ ExecuteGraphqlQuery
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ GetReposForOrg
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ GetReposFromSource
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ GetRepoStatsOrchestrator
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ Helpers
 ┃ ┣ CosmosDBClient.py
 ┃ ┣ EventGridClient.py
 ┃ ┗ SendEmails.py
 ┣ OrchestratorTimeTrigger
 ┃ ┣ function.json
 ┃ ┣ sample.dat
 ┃ ┗ __init__.py
 ┣ ParseCosmosDBResults
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ ParseGraphqlQueryResult
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ PublishRunInfoToEventGrid
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ SendEmailNotifications
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ UpdateRunInfoWithStatus
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ UploadQueryResultsToCosmosDB
 ┃ ┣ function.json
 ┃ ┗ __init__.py
 ┣ .funcignore
 ┣ host.json
 ┣ local.settings.json
 ┣ proxies.json
 ┗ requirements.txt

In the above file tree structure apart from folders that hold individual functions there are some configuration files at root level.

.funcignore : Contains files that needs to be ignored while deploying app to Azure.
host.json : Metadata file that contains global configuration which effects all functions. It contains details about version, log level, application insights settings and Azure Functions extension bundle version.
local.settings.json : Stores app settings, connection strings, Any settings included in this file can be accessed using python OS module. These settings are only used when running locally and these settings need to be exported to your function app configurations. This file is ignored during deployment to prevent any configuration mismatch.
proxies.json : Azure functions proxies is a toolkit that allows you to modify the requests and responses of your APIs. You can expose multiple function apps built as a microservice architecture in a single unified endpoint. (Not used in this project)
requirements.txt : Used for listing project package dependencies which gets installed when publishing to Azure

Data source configuration

The repos to pull stats are configured in sources.json file located in Data folder

{
"fullOrgs" :[
    {
        "orgName" : "microsoft",
        "exclude" : "FluidExamples,maro"
    },
    {
        "orgName" : "Esri",
        "exclude" : ""
    }
],
"individualRepos" : ["octokit/octokit.rb", "Azure/azure-cli"]
}

The fullOrgs array contains org names that you want get stats for all the repos in an org. Optionally you can provide a CSV list of repos that you want to ignore for the org.

The individualRepos array contains a CSV list of repos that you want to get stats

Function configurations

The local.settings.json file has the following settings that are configurable as required.

Github_Token : Github api key that has access to pull repo stats
NumberOfReposToQueryPerCall : This setting is used to batch the Github GraphQL query call.

Eg: If this is set to 65, one GraphQL call will batch 65 repos and get data for all of them in a single call. Don't increase this number too high as Github API will result in timeout.
CosmosDB_Endpoint : Cosmos DB account endpoint. Can use local emulator while development
CosmosDB_PrimaryKey: Cosmos DB account key
CosmosDB_DBName : Cosmos DB database id
CosmosDB_DataContainerName : Cosmos DB container which holds stats data
CosmosDB_RunInfoContainerName : Cosmos DB container which holds run info data
CosmosDB_ServerlessMode : Specifies whether Cosmos Db is in serverless mode. If set to "true" create item operations will not throttle and will create items in parallel
CosmosDB_ProvisionedThroughput : Throughput allowed for this app. Used for throttling requests
CosmosDB_RU_NeededForEachWrite : Throughput needed for each 1KB write in Cosmos DB
SendEmailNotifications : Specifies whether to send email notifocation about run start, end and in error conditions
SendGrid_API_Key : SendGrid account API key, used for sending emails
SendGrid_VerifiedFromSenderEmail : Verified email in SendGrid account, used for sending emails
SendGrid_ToEmail : Email for receiving run notifications
PublishToEventGrid : Specifies whether to publish events in Azure Event Grid after run completion for starting any other processes
EventGridEndpoint : Event Grid topic endpoint
EventGridKey : Event Grid topic key

Database design

This app uses two Cosmos DB containers which are configurable (In local.settings.json during development and in function configuration in Azure)

CosmosDB_DataContainerName : Cosmos DB container which holds stats data
CosmosDB_RunInfoContainerName : Cosmos DB container which holds run information. This container is updated with runId before the run starts and run details after completion. Sample run info document
```
{
"id": "1611127298",
"date": "20210119",
"totalReceived": 1,
"totalProcessed": 1,
"totalCreatedCount": 1,
"totalFailedCount": 0,
"createdList": "octokit/octokit.rb",
"failedList": ""
}
```

The id created in run info container is appended to repo name and used as id in stats table. Sample data container document

{
    "id": "octokit.octokit.rb.1611127298",
    "repo": "octokit/octokit.rb",
    "isArchived": false,
    "isTemplate": false,
    "repoUpdatedAt": "2021-01-19T20:44:18Z",
    "openIssues": 46,
    "closedIssues": 455,
    "totalIssues": 501,
    "openPRs": 13,
    "closedPRs": 150,
    "mergedPRs": 598,
    "totalPRs": 501,
    "stars": 3378
}

repo is used as partition key for data container.

After the run is completed run id is published to Event Grid. Using the run id we can get the repos processed for this run from run info container and do a point read by forming id for each repo which is a combination of repo name with owner and run id.

Project flow and functions explanation

As noted in the function configurations the NumberOfReposToQueryPerCall property controls the number of repos to batch in a single GraphQL query.

The CosmosDB_ServerlessMode property controls the rate at which items are created in Cosmos DB. If this property is set to "true" items are created in Cosmos DB in parallel without taking throughput into consideration. If this property is set to "false" the CosmosDB_ProvisionedThroughput and CosmosDB_RU_NeededForEachWrite are used for throttling the rate at which items are created in Cosmos DB.

Sequence diagram for processing 195 repos

GetRepoStatsOrchestrator : Handles all other function calls is triggered using a timer function for every 6 hours
CreateRunId : Will create id for current run in run info container.
SendEmailNotifications : Will send notification about run start
GetReposFromSource : Will parse the sources.json file for processing
GetReposForOrg : Will get repos for orgs mentioned in fullOrgs property in sources.json Also filters the orgs in exclude property from the list obtained. This function is executed in for each org serially. Github API doesn't allow calling a same org in parallel
AppendIndividualRepos : Will append individual repos in sources.json file to the list of repos obtained for orgs. Final list of repos to pull stats are formed in this step.
CreateGraphqlQuery : Will Create batches of GraphQL queries to execute. Uses NumberOfReposToQueryPerCall as batch size. This function is executed in parallel on the list of repos to pull data. The parallel thread count is TotalNumberOfRepos/BatchSize
ExecuteGraphqlQuery : Will execute the batches of GraphQL queries created in previous step and returns the result. This function in executed serially one batch after other and will retry a query if it fails execution. A 60 second cool down period is implemented if error occurs.
ParseGraphqlQueryResult : Will parse the results of ExecuteGraphqlQuery function. This function is executed in parallel on the results obtained. The parallel thread count is TotalNumberOfRepos/BatchSize
UploadQueryResultsToCosmosDB : Will upload parsed query results to Cosmos DB. This function is executed in parallel or in serial depending on Cosmos DB configuration.
ParseCosmosDBResults : Will parse Cosmos DB create item operation results to create a report on run status like the number of items processed, number of successful creates, number of failures etc.
UpdateRunInfoWithStatus : Will update the run info container with current run status. This helps the downstream processes to go point reads on data container.
PublishRunInfoToEventGrid : Will publish a event to Azure Event Grid about run completion status and run details. This helps in starting any downstream processes like analytics and dashboard creation
SendEmailNotifications : Will send notification about run completion and report on current run

Gantt chart for processing 195 reports

Deployment

Github Actions are used on the main branch to deploy code to Azure Functions on push event.

Github Actions

QA and Monitoring

Tested with 6000 repos scheduled for every 6 hours and had no issues.
Exceptions are handled to retry failed GraphQL queries
Exceptions are handled on Cosmos DB create and replace item operations to ensure data quality
Reports are sent after the run is completed about the run status
Notifications are sent on function failures and logging is implemented in functions to log any errors
Log Analytics dashboards and alerts are set up for function app and Cosmos DB
Will receive notifications on deployment failures from Github actions.

Next Steps

Develop analytics pipeline to start after data is ingested into Cosmos DB
Develop visualization dashboards
Develop pipeline to archive data and free space in Cosmos DB after analytics are done so that free tier limit is not crossed
Develop unit tests suite and data quality tests Github Project Board

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a data pipeline using Azure Durable Functions

Introduction

Features:

Project structure explanation

Data source configuration

Function configurations

Database design

Project flow and functions explanation

Sequence diagram for processing 195 repos

Gantt chart for processing 195 reports

Deployment

QA and Monitoring

Next Steps

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
.vscode		.vscode
AppendIndividualRepos		AppendIndividualRepos
CreateGraphqlQuery		CreateGraphqlQuery
CreateRunId		CreateRunId
Data		Data
DurableFunctionsHttpStart		DurableFunctionsHttpStart
ExecuteGraphqlQuery		ExecuteGraphqlQuery
GetRepoStatsOrchestrator		GetRepoStatsOrchestrator
GetReposForOrg		GetReposForOrg
GetReposFromSource		GetReposFromSource
Helpers		Helpers
OrchestratorTimeTrigger		OrchestratorTimeTrigger
ParseCosmosDBResults		ParseCosmosDBResults
ParseGraphqlQueryResult		ParseGraphqlQueryResult
PublishRunInfoToEventGrid		PublishRunInfoToEventGrid
SendEmailNotifications		SendEmailNotifications
UpdateRunInfoWithStatus		UpdateRunInfoWithStatus
UploadQueryResultsToCosmosDB		UploadQueryResultsToCosmosDB
.funcignore		.funcignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
host.json		host.json
local.settings.json		local.settings.json
proxies.json		proxies.json
requirements.txt		requirements.txt

License

RangaAmirapu/github-repo-stats-azure-function

Folders and files

Latest commit

History

Repository files navigation

Building a data pipeline using Azure Durable Functions

Introduction

Features:

Project structure explanation

Data source configuration

Function configurations

Database design

Project flow and functions explanation

Sequence diagram for processing 195 repos

Gantt chart for processing 195 reports

Deployment

QA and Monitoring

Next Steps

About

Topics

Resources

License

Stars

Watchers

Forks

Languages