[Bug] Setting OverlappingTokens (via appsettings) reduces the configured MaxTokensPerParagraph #318

SeanHusmann · 2024-02-21T16:20:54Z

Context / Scenario

Setting different values for MaxTokensPerParagraph and OverlappingTokens to test out the optimal chunking strategy for answering a set of test questions on a document.

What happened?

When I leave MaxTokensPerParagraph as is (1000) and incrementally increase the value of only OverlappingTokens (via appsettings) between test sessions by increments of +100 per test, the resulting chunk/paragraph size keeps decreasing, the chunks keep becoming smaller.

I finally ended up with settings of MaxTokensPerParagraph: 1000 and OverlappingTokens: 800 resulting in paragraph/chunk sizes that were only around 200 tokens large as counted by the Open AI Tokenizer.

What I expected to happen was either the resulting chunk size to be:

MaxTokensPerParagraph + OverlappingTokens
MaxTokensPerParagraph (where the OverlappingTokens are included in the MaxTokensPerParagraph)

I tested with a single 46 pages document, which I re-ingested with the exact same call to ImportDocumentAsync() between each test session (upserting/replacing (?) previous chunks for the same document id), leaving MaxTokensPerParagraph as is, but increasing OverlappingTokens between each test, saving appsettings, restarting the service, re-ingesting the same document.

Right now the issue can be circumvented by simultaneously increasing both MaxTokensPerParagraph and OverlappingTokens by the same amount if you want the resulting chunk size to be roughly equivalent to the specified MaxTokensPerParagraph.

Importance

a fix would make my life easier

Platform, Language, Versions

Windows 10, C#, Kernel Memory 0.27.240205.2

Relevant log output

No response

dluc · 2024-03-13T20:23:58Z

Spent some time inspecting the chunker in KM, which we took from SK. The initial version I wrote in 2022 didn't have the overlapping tokens logic, which was introduced later with this PR microsoft/semantic-kernel#1206 by @MonsterCoder . Later the code got considerable changes in

If someone wants to check all the changes...

I'm considering a complete rewrite, taking the opportunity to simplify the public methods and clarify the behavior:

Remove the split by sentence step, I don't think this is needed externally, it's just an internal implementation detail
Remove the "header" feature, it's not needed in KM and not the best approach to tag content
When choosing to overlap, define if the overlap is an exact number of tokens that can break sentences, or a best effort keeping sentences intact. E.g. one might choose 10 tokens overlap using sentences, in which case the exact number depends on the length of the sentences to carry over from a previous partition.

SeanHusmann added bug Something isn't working triage labels Feb 21, 2024

dluc self-assigned this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Setting OverlappingTokens (via appsettings) reduces the configured MaxTokensPerParagraph #318

[Bug] Setting OverlappingTokens (via appsettings) reduces the configured MaxTokensPerParagraph #318

SeanHusmann commented Feb 21, 2024 •

edited

Loading

dluc commented Mar 13, 2024 •

edited

Loading

[Bug] Setting OverlappingTokens (via appsettings) reduces the configured MaxTokensPerParagraph #318

[Bug] Setting OverlappingTokens (via appsettings) reduces the configured MaxTokensPerParagraph #318

Comments

SeanHusmann commented Feb 21, 2024 • edited Loading

Context / Scenario

What happened?

Importance

Platform, Language, Versions

Relevant log output

dluc commented Mar 13, 2024 • edited Loading

SeanHusmann commented Feb 21, 2024 •

edited

Loading

dluc commented Mar 13, 2024 •

edited

Loading