Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft cosmos data extractor #116

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
*.user
*.userosscache
*.sln.docstates
*.DS_Store

# ignore appsettings
**/appsettings.development.json
Expand Down
52 changes: 52 additions & 0 deletions ModelTraining/CosmosData/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Extracting additional training data from CosmosDB

An example of an item the metadata store appears as follows in our CosmosDB database:

```
{
"id": "1ed0f937-3f63-4f6a-a680-1b4b4ef24fb9",
"modelId": "AudioSet",
"audioUri": "https://livemlaudiospecstorage.blob.core.windows.net/audiowavs/rpi_orcasound_lab_2020_09_27_21_15_03_PDT.wav",
"imageUri": "https://livemlaudiospecstorage.blob.core.windows.net/spectrogramspng/rpi_orcasound_lab_2020_09_27_21_15_03_PDT.png",
"reviewed": true,
"timestamp": "2020-09-28T04:15:03.495901Z",
"whaleFoundConfidence": 80.18461538461537,
"location": {
"id": "rpi_orcasound_lab",
"name": "Haro Strait",
"longitude": -123.2166658,
"latitude": 48.5499978
},
"source_guid": "rpi_orcasound_lab",
"predictions": [
{
"id": 0,
"startTime": 2.5,
"duration": 2.5,
"confidence": 0.914
},
{
"id": 1,
"startTime": 7.5,
"duration": 2.5,
"confidence": 0.624
},
{
"id": 2,
"startTime": 12.5,
"duration": 2.5,
"confidence": 0.869
},
{
"id": 3,
"startTime": 15,
"duration": 2.5,
"confidence": 0.918
}
]
}
```

Attached is a .NET application that allows you to create a cross-product of each observation to the predictions property, resulting in a JSON array of all the possible permutations between the relevant observation metadata and each unique prediction.

The steps required to leverage this .NET application is detailed in this [link](https://learn.microsoft.com/en-us/training/paths/connect-to-azure-cosmos-db-sql-api-sdk/). It requires downloading the `Microsoft.Azure.Cosmos` package from `nuget.org`, connecting to our online account and executing the SQL query as specified in the `script.cs`. To build this application, please run `dotnet run`.
9 changes: 9 additions & 0 deletions ModelTraining/CosmosData/app.csproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net6.0</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.Azure.Cosmos" Version="3.22.1" />
</ItemGroup>
</Project>
19 changes: 19 additions & 0 deletions ModelTraining/CosmosData/prediction.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
public class Prediction
{
public string id { get; set; }
public string modelId { get; set; }
public string audioUri { get; set; }
public string imageUri { get; set; }
public bool reviewed { get; set; }
public string timestamp { get; set; }
public double whaleFoundConfidence { get; set; }
public string location_id { get; set; }
public string location_lat { get; set; }
public string location_name { get; set; }
public string location_long { get; set; }
public string source_guid { get; set; }
public string prediction_id { get; set; }
public string prediction_startTime { get; set; }
public string prediction_duration { get; set; }
public string prediction_confidence { get; set; }
}
35 changes: 35 additions & 0 deletions ModelTraining/CosmosData/script.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
using System;
using Microsoft.Azure.Cosmos;

string endpoint = "https://aifororcasmetadatastore.documents.azure.com:443/";
string key = "[INSERT PRIMARY KEY HERE]";

CosmosClient client = new CosmosClient(endpoint, key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use connection string instead of endpoint and key? If so, we don't need to hardcode endpoint.

Would also be nice to expose as flag (see below comment).

AccountProperties account = await client.ReadAccountAsync();

// Sanity check
Console.WriteLine($"Account Name:\t{account.Id}");

// Get the database
Database database = await client.CreateDatabaseIfNotExistsAsync("predictions");
Container container = await database.CreateContainerIfNotExistsAsync("metadata", "/source_guid");

string sql = "SELECT m.id, m.modelId, m.audioUri, m.imageUri, m.reviewed, m.timestamp, m.whaleFoundConfidence, m.location.id AS location_id, m.location.name AS location_name, m.location.longitude AS location_long, m.location.latitude AS location_lat, m.source_guid, p.id AS prediction_id, p.startTime AS prediction_startTime, p.duration AS prediction_duration, p.confidence AS prediction_confidence FROM metadata m JOIN p IN m.predictions";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add line breaks? :) See http://net-informations.com/q/faq/multilines.html for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also have a dumb question: what is "p"?

QueryDefinition query = new (sql);

QueryRequestOptions options = new ();
options.MaxItemCount = 50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have this set from an external flag? This would be useful if we distribute the tool as a binary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the developer is using Visual Studio, they can put the url and key in their User Secrets so that it does not go into the checked in code. Otherwise they will need to put it into appsettings.Development.json file and make sure that file does not get checked in as part of the pull request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a console application, hence my suggestion. User secrets/appsettings.Development.json are generally used for web apps.


FeedIterator<Prediction> iterator = container.GetItemQueryIterator<Prediction>(query, requestOptions: options);

while (iterator.HasMoreResults)
{
FeedResponse<Prediction> predictions = await iterator.ReadNextAsync();
foreach (Prediction pred in predictions)
{
Console.WriteLine($"[{pred.prediction_id}]\t[{pred.prediction_startTime,40}]\t[{pred.prediction_duration,10}]\t[{pred.prediction_duration,40}]\t[{pred.prediction_confidence,40}]\t");
}
Console.WriteLine("Press any key for next page of results");
Console.ReadKey();
}