-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update bm25 algorithm with term matching and field normalization #190
base: master
Are you sure you want to change the base?
Conversation
First of all, thanks for this PR. It surely took a lot of effort to create it. Lets first focus a little bit on the BM25 algorithm. You changed some of the weights and introduced the document length variable again. In earlier versions we already had the document length variable, but it turned out better by ignoring it. In theory it's good to have it, but the praxis showed something different. There was a blog post from sphinx search that covered the reason why ignoring it is a good choice. Maybe you could point to some elasticsearch resources to have a further look on it |
Wrote these changes because we have an internal project that uses tntsearch and our result ranking was weirdly off in some cases. Wasn't sure why but after reading the implementation I realized the BM25 ranking implementation was different than what I expected. I have not read the Sphinx blog post you're referring to, but I can imagine why document length can be ignored in some cases. In my case it produces better results. To be honest document length improved the result ranking a bit but what improved result ranking even more was field normalization. This really depends on the data set you're using but having matches rank higher by fields like category or title which are usually much smaller than the document body is very important to us. The weights I changed because their defaults elastic uses seem a bit more reasonable based on the documents I read so for general purpose usage. |
Do you think we could come up with a compromise by creating another class that would extend the current one and have those changes in it. So the end user would be able to choose and see what works best for him |
I think we can leave this as is and if anybody needs to change the default behavior they can be pointed to this PR. These changes also introduce a fair bit of complexity and a performance penalty for somebody looking to just search a list of cities or some other smaller objects. |
You're changes are good and could be useful to a lot of people, I'll try to find a way to include them into the package without introducing breaking changes. Cannot think of a compact name for the class but something like The indexer could stay as is only with an option that would run the 2 functions and if someone uses the |
There's another question on how others use tntsearch ? Also a better way would be to use composition and/or feature flagging. This PR introduces 3 major changes and if you agree we could create 3 config flags which change both the indexer and the search algorithm as needed:
|
This fixes the way BM25 implementation works and introduces field normalization all inspired by Elasticsearch.
This breaks fuzzy matching but if this looks like a change in the right direction I'm open to completing this change.