Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique terms not available in IndexReaderUtils #2052

Open
djoerd opened this issue Jan 23, 2023 · 2 comments
Open

Unique terms not available in IndexReaderUtils #2052

djoerd opened this issue Jan 23, 2023 · 2 comments

Comments

@djoerd
Copy link

djoerd commented Jan 23, 2023

I want to know the number of unique terms in my index and got: -1

Steps:
IndexCollection -collection TrecCollection -input /home/hiemstra/Data/robust04/ -index lucene-index.robust04.pos+docvectors -threads 16 -storePositions -storeDocvectors
IndexReaderUtils -stats -index lucene-index.robust04.pos+docvectors/

Results:
Index statistics
----------------
documents: 528030
documents (non-empty): 528030
unique terms: -1
total terms: 174540872

Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.

I already solved this myself: I will add a pull request.

@lintool
Copy link
Member

lintool commented Jan 30, 2023

To get an accurate count of the vocab size, you have to use the -optimize flag, which merges all the index segments down into a single one.

@djoerd
Copy link
Author

djoerd commented Jan 31, 2023

Thanks a lot (also for doing this on SIGIR deadline day!): That solves my problem. The optimize flag is costly for a large index though, so the PR may still be helpful. I checked and it gives the exact same number of unique terms for my (non-optimized) Robust04 index: 923436 unique terms, i.e., the Lucene term iterator seems to work correctly on multiple segments.

BTW, I forgot to remove the "terms" declaration from the original getIndexStats() method (should have installed Eclipse directly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants