Unique terms not available in IndexReaderUtils #2052

djoerd · 2023-01-23T10:22:00Z

I want to know the number of unique terms in my index and got: -1

Steps:
IndexCollection -collection TrecCollection -input /home/hiemstra/Data/robust04/ -index lucene-index.robust04.pos+docvectors -threads 16 -storePositions -storeDocvectors
IndexReaderUtils -stats -index lucene-index.robust04.pos+docvectors/

Results:
Index statistics
----------------
documents: 528030
documents (non-empty): 528030
unique terms: -1
total terms: 174540872

Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.

I already solved this myself: I will add a pull request.

lintool · 2023-01-30T21:44:27Z

To get an accurate count of the vocab size, you have to use the -optimize flag, which merges all the index segments down into a single one.

djoerd · 2023-01-31T14:43:24Z

Thanks a lot (also for doing this on SIGIR deadline day!): That solves my problem. The optimize flag is costly for a large index though, so the PR may still be helpful. I checked and it gives the exact same number of unique terms for my (non-optimized) Robust04 index: 923436 unique terms, i.e., the Lucene term iterator seems to work correctly on multiple segments.

BTW, I forgot to remove the "terms" declaration from the original getIndexStats() method (should have installed Eclipse directly)

djoerd mentioned this issue Jan 23, 2023

counts unique terms if not available in Lucene #2053

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique terms not available in IndexReaderUtils #2052

Unique terms not available in IndexReaderUtils #2052

djoerd commented Jan 23, 2023

lintool commented Jan 30, 2023

djoerd commented Jan 31, 2023

Unique terms not available in IndexReaderUtils #2052

Unique terms not available in IndexReaderUtils #2052

Comments

djoerd commented Jan 23, 2023

lintool commented Jan 30, 2023

djoerd commented Jan 31, 2023