Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update bm25 algorithm with term matching and field normalization #190

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

agologan
Copy link

@agologan agologan commented Jul 6, 2019

This fixes the way BM25 implementation works and introduces field normalization all inspired by Elasticsearch.
This breaks fuzzy matching but if this looks like a change in the right direction I'm open to completing this change.

@nticaric
Copy link
Contributor

nticaric commented Jul 7, 2019

First of all, thanks for this PR. It surely took a lot of effort to create it. Lets first focus a little bit on the BM25 algorithm. You changed some of the weights and introduced the document length variable again. In earlier versions we already had the document length variable, but it turned out better by ignoring it. In theory it's good to have it, but the praxis showed something different. There was a blog post from sphinx search that covered the reason why ignoring it is a good choice. Maybe you could point to some elasticsearch resources to have a further look on it

@agologan
Copy link
Author

agologan commented Jul 7, 2019

Wrote these changes because we have an internal project that uses tntsearch and our result ranking was weirdly off in some cases.

Wasn't sure why but after reading the implementation I realized the BM25 ranking implementation was different than what I expected.
Read a few papers and checked if any current projects still use it and if they made any changes to it.
After reading this: https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables I implemented a few changes which ended up in quite noticeable improvements.

I have not read the Sphinx blog post you're referring to, but I can imagine why document length can be ignored in some cases. In my case it produces better results.

To be honest document length improved the result ranking a bit but what improved result ranking even more was field normalization. This really depends on the data set you're using but having matches rank higher by fields like category or title which are usually much smaller than the document body is very important to us.

The weights I changed because their defaults elastic uses seem a bit more reasonable based on the documents I read so for general purpose usage.

@nticaric
Copy link
Contributor

nticaric commented Jul 7, 2019

Do you think we could come up with a compromise by creating another class that would extend the current one and have those changes in it. So the end user would be able to choose and see what works best for him

@agologan
Copy link
Author

agologan commented Jul 7, 2019

I think we can leave this as is and if anybody needs to change the default behavior they can be pointed to this PR.
This PR exists because I found these changes to be useful but they may not necessarily be the best defaults for this project.
For example I found it strange that by default boolean search doesn't do ranking or that the ranked results don't provide an option to match all terms but that's just me and my use-case.

These changes also introduce a fair bit of complexity and a performance penalty for somebody looking to just search a list of cities or some other smaller objects.
For engineers looking for a simple solution for document searching these can be beneficial but if they understand the changes I'm proposing here they may be better off using a full fledge solution like Solr, Sphinx or Elasticsearch.

@nticaric
Copy link
Contributor

nticaric commented Jul 7, 2019

You're changes are good and could be useful to a lot of people, I'll try to find a way to include them into the package without introducing breaking changes.
My current way of thinking is to write a class that would extend TNTSEarch class and include those changes.

Cannot think of a compact name for the class but something like TNTSearchWithFieldWeighting. I'm open to suggestions

The indexer could stay as is only with an option that would run the 2 functions saveHitList and saveDocInfo. This wouldn't affect performance for the default behavior

and if someone uses the TNTSearchWithFieldWeighting class on an index without the required tables and data, we could simply throw an exception with a message to do a reindex with the option on.

@agologan
Copy link
Author

agologan commented Jul 7, 2019

There's another question on how others use tntsearch ?
Our engineer used it because he was familiar with Laravel and this solution using the scout driver looked the easiest to integrate for the project he was starting.
If used directly extending via a subclass should make it easy enough to use, but if used through Scout it probably needs some flag in the config.

Also a better way would be to use composition and/or feature flagging.
Instead of subclassing adding a flag in the config and then branching into functions should keep all code in the same context encouraging reuse and refactoring.

This PR introduces 3 major changes and if you agree we could create 3 config flags which change both the indexer and the search algorithm as needed:

  • document length weights
  • field length weights
  • match all terms for ranked results

detain added a commit to detain/tntsearch that referenced this pull request Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants