-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unique terms not available in IndexReaderUtils #2052
Comments
To get an accurate count of the vocab size, you have to use the |
Thanks a lot (also for doing this on SIGIR deadline day!): That solves my problem. The optimize flag is costly for a large index though, so the PR may still be helpful. I checked and it gives the exact same number of unique terms for my (non-optimized) Robust04 index: 923436 unique terms, i.e., the Lucene term iterator seems to work correctly on multiple segments. BTW, I forgot to remove the "terms" declaration from the original getIndexStats() method (should have installed Eclipse directly) |
I want to know the number of unique terms in my index and got: -1
Steps:
IndexCollection -collection TrecCollection -input /home/hiemstra/Data/robust04/ -index lucene-index.robust04.pos+docvectors -threads 16 -storePositions -storeDocvectors
IndexReaderUtils -stats -index lucene-index.robust04.pos+docvectors/
Results:
Index statistics
----------------
documents: 528030
documents (non-empty): 528030
unique terms: -1
total terms: 174540872
Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.
I already solved this myself: I will add a pull request.
The text was updated successfully, but these errors were encountered: