Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the maximum word proximity from 8 to 4 #3820

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

loiclec
Copy link
Contributor

@loiclec loiclec commented Jun 7, 2023

This is an experiment to evaluate the impact of storing fewer word pairs. I am not 100% sure that it is implemented perfectly, but it is good enough to run some initial experiments I think.

It reduces the size of the indexed smol-wiki-articles-3_4.csv dataset from 3.49GB to 2.19GB, a reduction of 37.5%. In combination with #3819 (review) , we reduced the index size from 4.19GB to 2.18GB. This means that the index size would be almost halved between v1.2 and v1.3.

For movies.json, we go from 212MB to 116MB.

While I haven't launched any benchmark yet, indexing also feels significanty faster to me.

Search latency should also benefit a lot from it. However, I don't know what the impact on relevancy will be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant