Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: Add PostgresVectorStore Memory connector. #9324

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

lossyrob
Copy link
Contributor

@lossyrob lossyrob commented Oct 18, 2024

This PR adds a PostgresVectorStore and related classes to Microsoft.SemanticKernel.Connectors.Postgres.

Motivation and Context

As part of the move to having memory connectors implement the new Microsoft.Extensions.VectorData.IVectorStore architecture (see https://github.com/microsoft/semantic-kernel/blob/main/docs/decisions/0050-updated-vector-store-design.md), each memory connector needs to be updated with the new architecture. This PR tackles updating the existing Microsoft.SemanticKernel.Connectors.Postgres package to include this implementation. This will supercede the PostgresMemoryStore implementation.

Some high level comments about design:

  • PostgresVectorStore and PostgresVectorStoreRecordCollection get injected with an IPostgresVectorStoreDbClient. This abstracts the database communication and allows for unit tests to mock database interactions.
  • The PostgresVectorStoreDbClient gets passed in a NpgsqlDataSource from the user, which is used to manage connections to the database. The responsibility of connection pool lifecycle management is on the user.
  • The IPostgresVectorStoreDbClient is designed to accept and produce the storage model, which in this case is a Dictionary<string, object?> . This is the intermediate type that is mapped to by the IVectorStoreRecordMapper.
  • The PostgresVectorStoreDbClient also takes a IPostgresVectorStoreCollectionSqlBuilder, which generates SQL command information for interacting with the database. This abstracts the SQL queries related to each task, and allows for future expansion. This is particularly targeted at creating a AzureDBForPostgre vector store that will enable alternate vector implementations like DiskANN, while leveraging the same database client as the Postgres connector.
  •  The integration tests for the vector store utilize Docker.Net to bring up a pgvector/pgvector docker container, which test are run against.

Contribution Checklist

Work in progress, some methods are not implemented yet.
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Oct 18, 2024
Rob Emanuele added 2 commits October 28, 2024 14:55
e.g. Change type of property 'CollectionInts' from 'System.Collections.Generic.ICollection<int>?' to 'System.Collections.Generic.HashSet<int>?' for improved performance
@lossyrob
Copy link
Contributor Author

@westey-m thanks for all the detailed feedback. I think I've addressed all the comments, and this is good for another review.

@@ -64,4 +74,15 @@ internal static class PostgresConstants
/// <summary>The name of the column that returns distance value in the database.</summary>
/// <remarks>It is used in the similarity search query. Must not conflict with model property.</remarks>
public const string DistanceColumnName = "sk_pg_distance";

/// <summary>The default index kind.</summary>
public const string DefaultIndexKind = IndexKind.Hnsw;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's also OK to use Flat here as the default. When I was suggesting using Hnsw earlier, I didn't realise that postgres can also work without, and that's a totally acceptable default.

/// <param name="options">Optional options to further configure the <see cref="IVectorStore"/>.</param>
/// <param name="serviceId">An optional service id to use as the service key.</param>
/// <returns>The service collection.</returns>
public static IServiceCollection AddPostgresVectorStore(this IServiceCollection services, string connectionString, PostgresVectorStoreOptions? options = default, string? serviceId = default)
public static IServiceCollection AddPostgresVectorStore(this IServiceCollection services, NpgsqlDataSource dataSource, PostgresVectorStoreOptions? options = default, string? serviceId = default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this overload is that useful. Where someone builds their DI container, they will not typically be able to create a NpgsqlDataSource in a way that will get it destroyed in the right timeframe. E.g. if they use a using clause it'll just get destroyed immediately before the app really starts. If they register it with DI to allow it to handle it's lifecycle, they will use a different overload.

FYI: We typically follow a philosophy of avoiding adding extra overloads or helpers unless we have a good justification, simply because it means more maintenance and more chances of breaking changes in future. So less is more 😄

/// <param name="options">Optional options to further configure the <see cref="IVectorStoreRecordCollection{TKey, TRecord}"/>.</param>
/// <param name="serviceId">An optional service id to use as the service key.</param>
/// <returns>Service collection.</returns>
public static IServiceCollection AddPostgresVectorStoreRecordCollection<TKey, TRecord>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, not sure we need this one.

{
yield return collection;
}
return this._postgresClient.GetTablesAsync(cancellationToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to also wrap any errors here in VectorStoreOperationException

public void MapFromDataToStorageModelWithNumericKeyReturnsValidStorageModel()
{
// Arrange
var definition = GetRecordDefinition<string>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is also using a string key instead of numeric as the test name suggests.

var result = mapper.MapFromStorageToDataModel(storageModel, new() { IncludeVectors = includeVectors });

// Assert
Assert.Equal("key", result.Key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, looks like this should still be changed to be numeric?

this._serviceCollection.AddSingleton<NpgsqlDataSource>(dataSource);

// Act
this._serviceCollection.AddPostgresVectorStoreRecordCollection<string, TestRecord>("testcollection");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numeric Key?

new VectorStoreRecordDataProperty("code", typeof(int)),
new VectorStoreRecordDataProperty("rating", typeof(float?)),
new VectorStoreRecordDataProperty("description", typeof(string)),
new VectorStoreRecordDataProperty("parking_is_included", typeof(bool)),
Copy link
Contributor

@westey-m westey-m Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth having a record definition where some of the properties have storage names that are different to the main property name, to make sure these are used correctly in the command. Same with the other tests here that have to build names into the command.

{
return new VectorStoreRecordDefinition
{
Properties = new List<VectorStoreRecordProperty>
Copy link
Contributor

@westey-m westey-m Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add one or two enumerables to the property list as well to verify that they word as expected.

{
Properties = new List<VectorStoreRecordProperty>
{
new VectorStoreRecordKeyProperty("HotelId", typeof(ulong)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this anywhere? From the constants list I don't believe we support ulong, so I would expect using this to fail.

[InlineData(typeof(long), 7L)]
[InlineData(typeof(string), "key1")]
[InlineData(typeof(Guid), null)]
public async Task ItCanGetAndDeleteRecordAsync(Type idType, object? key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also possible to make test methods generic, e.g. public async Task ItCanGetAndDeleteRecordAsync<TKey>(Type idType, TKey? key), which means you can use TKey in the method and you don't have to use dynamic.

namespace SemanticKernel.IntegrationTests.Connectors.Memory.Postgres;

[Collection("PostgresVectorStoreCollection")]
public sealed class PostgresVectorStoreRecordCollectionTests(PostgresVectorStoreFixture fixture)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a test that reads some data that wasn't upserted by the Collection. E.g. if you create a table and upsert some records using SQL, and then read them using the Collection. The reason being that we want to ensure that the data in the DB is actually what we want it to be. E.g. if we do two complimentary things wrong, on upsert and read, so it looks fine when using the Collection, but the data isn't stored in the way we need it to in the DB, a test where we didn't write the data via the Collection should catch this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation kernel Issues or pull requests impacting the core kernel memory .NET Issue or Pull requests regarding .NET code
Projects
Status: Community PRs
Development

Successfully merging this pull request may close these issues.

5 participants