Better implementation of JsonVectorCollection than the "fake words" approach #1890

lintool · 2022-05-29T21:47:15Z

To index JsonVectorCollection sparse vectors, we currently use the "fake words" trick - just duplicate the word X times, where X is the score. This might be a better solution: https://lucene.apache.org/core/8_11_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilter.html

tteofili · 2022-05-30T06:38:31Z

I did an experiment for this a while ago, the behavior for some reason is not exactly the same (expected) as the fw one. I can dig into this.

JMMackenzie · 2022-06-01T23:44:55Z

Possibly different due to #1843?

JMMackenzie mentioned this issue Mar 22, 2023

Linking to Anserini "FakeWords" Issue thongnt99/learned-sparse-retrieval#4

Open

thongnt99 mentioned this issue Mar 23, 2023

Faster indexing for learned sparse retrieval #2080

Open

Provide feedback