Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev:main merge] #308

Merged
merged 4 commits into from
Feb 13, 2024
Merged

[dev:main merge] #308

merged 4 commits into from
Feb 13, 2024

Conversation

dluc
Copy link
Collaborator

@dluc dluc commented Feb 13, 2024

Motivation and Context (Why the change? What's the scenario?)

The current memory storage doesn't capture the position of a partition, making it hard or impossible to sort memories in relation to their position from the original file. Position can be useful to retrieve "adjacent" memories by position in the doc.

Similarly, memory records do not capture the "section" from which they are extracted, e.g. the page number, worksheet number, slide number, segment number, scene number etc.

Synchronize config files schema across projects (service, examples, tests)

High level description (Approach, Design)

  • Add Partition Number to memory records. Add example showing how to retrieve adjacent records.
  • Update text extractors to return Section Number. Save the information in Content Storage. The information is not stored with memory records yet. See changes from "DocToText" to "ExtractContent".
  • Config files schema updates
  • Update Redis tests
  • Add ElasticSearch config section

dluc and others added 3 commits February 9, 2024 20:56
New SectionNumber and PartitionNumber properties added to MemoryRecord.
Add example showing how to leverage partition numbers to load adjacent records.
Add MemoryRecord extensions methods to decouple code from constants.
Log warnings in handlers when files are missing, e.g. a handler has been removed from the default pipeline.

Breaking changes:
* changes to ABSTRACTIONS assembly
* new tags added to records
* new props added to pipeline status file
The terminology used is "section number", which can be applied to
multiple data formats.

Formats:
* Powerpoint: page number => slide number
* Excel: page number => worksheet number
* PDF: section number is a reliable page number
* Word: section number is an approximate page number. The value is not
reliable because OOXML doesn't capture how a document is rendered, and
word documents can be paginated differently across different editors.

Page number are not stored yet, there's work required on the text
chunker, to connect chunks to page numbers.
service/Core/Handlers/GenerateEmbeddingsHandler.cs Dismissed Show dismissed Hide dismissed
service/Core/Handlers/GenerateEmbeddingsHandler.cs Dismissed Show dismissed Hide dismissed
service/Core/Handlers/SaveRecordsHandler.cs Dismissed Show dismissed Hide dismissed
service/Core/Handlers/SaveRecordsHandler.cs Dismissed Show dismissed Hide dismissed
service/Core/Handlers/SaveRecordsHandler.cs Dismissed Show dismissed Hide dismissed
service/Core/Handlers/SaveRecordsHandler.cs Dismissed Show dismissed Hide dismissed
service/Core/Handlers/TextPartitioningHandler.cs Dismissed Show dismissed Hide dismissed
service/Core/Handlers/TextPartitioningHandler.cs Dismissed Show dismissed Hide dismissed
… fixes (#309)

## Motivation and Context (Why the change? What's the scenario?)

Some functional tests config used a different schema, making it harder
to copy settings from service to examples to tests.
Redis config needs an update given the new tag for partition number.
Redis functional tests checking for error message were not aligned with
latest code.
ElasticSearch config settings were missing in Service appsettings.json

## High level description (Approach, Design)

* Add "KernelMemory" config prefix where missing.
* Update Redis config and Redis settings.
* Add ES default settings.
@dluc dluc merged commit c3b67f8 into main Feb 13, 2024
4 checks passed
@dluc dluc deleted the dev branch February 13, 2024 02:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant