Document boosting #4189

curquiza · 2023-11-02T14:37:22Z

Related product team resources: PRD (internal only)
Related spec: WIP

Motivation

Add Promoting and boosting features to cater for e-commerce use cases and close the competitor gap.

Usage

Refer to: https://www.notion.so/meilisearch/Document-boosting-API-usage-20aae06bc85e41dba828a90331a69f2c

TODO

Define scope of implementation
Release a prototype
If prototype validated, merge changes into main
Update the spec

Impacted teams

@meilisearch/docs-team @meilisearch/integration-team

The text was updated successfully, but these errors were encountered:

curquiza · 2023-12-12T18:04:37Z

Sorry, not possible for v1.6.0, other priorities have been done.
I remove it from the milestone

sandstrom · 2024-02-22T20:40:28Z

Is this something that will be similar to 'Function scoring' in ElasticSearch[1]? If not, I think you should consider it.

It's a powerful primitive that solves many types of problems in search:

"Advanced Rank/Sort"
"Scored Filter"
"Field Weighting"
"Promoting documents"

To give some concrete use-cases of the above:

Rank nearby restaurants/hotels/etc higher
Rank old records (or new records) higher
Reddit-style tradeoff, recency vs. many votes
Leniency in price-range queries (show some things outside the range, but only when they are close; taper heavily for those far outside)

It's one of those foundational features that will solve a lot of "long-tail problems".

When you are planning the roadmap, sometimes it's better to build one slightly larger feature that solves 10+ "smaller requests", than building 10 smaller features to solve individual problems.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html

cc @curquiza @Kerollmops

Kerollmops · 2024-02-27T10:06:26Z

Hello @sandstrom,

Is this something that will be similar to 'Function scoring' in ElasticSearch[1]? If not, I think you should consider it.

It is not the same. Meilisearch isn't a score-based search engine but rather a bucket-sort-based one. It doesn't rely on a global score by document to determine the order of the document but rather on buckets spilling into other smaller buckets. This algorithm is helpful when each ranking rule has a different level of importance.

However, thanks to the Score Details feature, Meilisearch can output a global ranking score as a simple number. Our Hybrid Search system uses it to rank documents using the semantic similarity score (a number) and the keyword search score.

To give some concrete use-cases of the above [..]

It is already possible to Rank nearby restaurants/hotels/etc higher with the Geo Sort feature. You can already sort documents by asc/desc creation date with the Search-side Sort or Ranking Rule Sort features.

Regarding the Reddit-style tradeoff, recency vs. many votes I agree that it is currently impossible without updating the documents regularly. However, even with the Elasticsearch Function Scoring feature, the document score depends on an external $now timestamp parameter to compute the final score. The formula should be something like that: doc.upvotes / ( ($now - doc.date) * magicScaling ) 🤔

About the Leniency in price-range queries I understand the use case of showing documents that are outside of a filtered range but not too far from the original range given. However, wouldn't it be possible to increase the range slightly? This way, it shows more documents, and you decide on the distance gap. There is currently no way to do conditional filtering, and this feature will evolve that way. We first want to be able to apply filtering based on the query content: add specific documents in certain conditions...

When you are planning the roadmap, sometimes it's better to build one slightly larger feature that solves 10+ "smaller requests", than building 10 smaller features to solve individual problems.

Thank you for your insights and have a great day 🍥

sandstrom · 2024-02-27T10:31:20Z

Thanks @Kerollmops for an extensive answer!

To be more specific, the main thing we are missing is some kind of decay functionality (in our case based on time, but could be a numerical value for others).

We currently use Elastic Search Decay Function Scores. See query below.

Having to run a scheduled job every day, to calculate e.g. ($now - doc.date) * magicScaling ), iterate over all docs (hundreds of thousands?) and store it as a property on each doc, seems like a needlessly complicated step, instead of having that computed at query-time.

There also isn't an atomic update-all document REST endpoint available, afaik.

But maybe there is an inherent restriction in the bucket vs. score-based engine, that makes this very difficult or impossible?

// if you are curious, this is the ElasticSearch query we are using today

query_body = {
  'query' => {
    'bool' => {
      'must' => [{
        'function_score' => {
          'query' => {
            'multi_match' => {
              'query' => query,
              'type' => 'cross_fields',
              'fields' => ['title', 'reference', 'user'],
              'operator' => 'and',
            },
          },
          'functions' => [{
            # reduce score for older reports with exponential decay
            'exp' => {
              'submitted_at' => {
                'origin' => Time.now,
                'offset' => '30d', # no score reduction (i.e. modifier = 1.0) within 30 days
                'scale' => '730d', # 2 year falloff until `decay` value
                'decay' => 0.5, # modifier at `offset`+`scale`, i.e. 2y+30d old => modifier = 0.5
              },
            },
          }],
          'score_mode' => 'multiply', # multiply score modifier values for final score
        },
      }],
      'filter' => {
        'bool' => {
          'must' => [
            { 'term' => { 'company_id' => company.id } }, # scope to company
            { 'bool' => { 'must_not' => { 'exists' => { 'field' => 'archived_at' } } } }, # exclude archived
          ],
        },
      },
    },
  },
}

Kerollmops · 2024-02-28T16:57:08Z

To be more specific, the main thing we are missing is some kind of decay functionality (in our case based on time, but could be a numerical value for others).
[..]
There also isn't an atomic update-all document REST endpoint available, afaik.

Indeed, there is no /update-all route or decay functionality in Meilisearch right now. Would you say that an /indexes/{name}/update route with a body that looks like the following would help you use Meilisearch and decay the document scores?

{
  "filter": "category = gaming", // only update this subset
  // inserts the new score and scoreUpdatedAt fields with the given computation in the filtered documents
  "formulas": {
    // Upvotes and timestamp are fields already in the documents
    "score": "upvotes / ((1709139095 - timestamp) * 20.0)",
    "scoreUpdatedAt": "1709139095"
  }
}

But maybe there is an inherent restriction in the bucket vs. score-based engine, that makes this very difficult or impossible?

No, I don't think so. You can define the right ranking rules to sort your documents and change the numeric field value accordingly. Nothing is inherently impossible or hard to do.

Have a great day 🍡

sandstrom · 2024-02-28T17:07:00Z

@Kerollmops Yes, it would!

As long as it isn't too expensive to run (we'd be running it on all docs in an index with ~1M rows and growing), this would do the job!

There is a slight win in having scores computed on-the-fly, in the sense that we'll only need to update a query to change the behavior (developer + ops ergonomics). But in the grander scheme of things, that's a small thing. We could easily setup a scheduled job that does this daily or weekly.

As an aside, I've followed MeiliSearch since ~2001 -- wrote about this back then. It's a great project and we would love to use Meili! This has been the only remaining blocker for a while now, that keeps us from dropping ElasticSearch.

pepijn-vanvlaanderen · 2024-03-12T13:43:30Z

Is there an eta on the implementation of this feature?

macraig · 2024-03-13T21:00:51Z

Is there an eta on the implementation of this feature?

Hi @pepijn-vanvlaanderen , we had to deprioritize the feature in favor of more urgent work so don't have an ETA yet. You're welcome to add your use case and vote to the Function Scoring discussion if it aligns with your needs, or you can also start a new discussion.

Tadaz · 2024-03-22T17:20:22Z

This is an important feature for our organization as well. Without it, our search results become less and less relevant. Our website is built on Laravel and we love that it is compatible with Meilisearch. Meilisearch lets us store our search engine's database locally, due to our privacy & policy, we can't store it on the cloud.

Anyway, we have so-called publication documents and for our search results to stay relevant we need to introduce time decay on the following parameters:

Publication date (older publications are less relevant)
Authors' resignation/retirement dates (the longer an author is not with our company anymore, the less relevant their publications are to our clients)
Average daily readership (most read publications should be more relevant)
As per @sandstrom comments above, documents' promotion would be very useful as well. Sometimes some publications are more important than others and we would like to promote them.

With the current filtering and sorting mechanism it is just simply not achievable. If we manage to get newer results, then they are less relevant query-wise, if we manage to get more relevant results then most of them are old.

Sadly, if this functionality is not something that will be prioritized in the near future, we will need to think of other alternatives.

Thank you for your time reading this.

curquiza · 2024-03-25T17:35:41Z

Thank you @Tadaz for your feedback, I informed the product team 👌

Kerollmops · 2024-05-09T22:24:59Z

Hey @sandstrom and @Tadaz 👋

I just released a first prototype of a way to edit documents by using a Rhai function. You can read more on the Public Usage page. Please, tell me more about what you think. The documentation may be improved but it's only the first prototype.

Have a nice day 🍾

sandstrom · 2024-05-10T05:58:05Z

Sounds interesting!

However the docs link shows a login screen. Still locked?

Kerollmops · 2024-05-10T07:38:08Z

Sorry @sandstrom, fixed it 🤭

sandstrom · 2024-05-10T15:43:19Z

@Kerollmops Looks great!

Using Rhai seems like a very good idea.

I've only read the docs (not tried it), but this should do the trick.

In away on a short weekend trip so I cannot easily put together a quick test on my computer right now, but I'll try to get someone on the team to evaluate this approach and hopefully switch over to meili!

curquiza added this to the v1.6.0 milestone Nov 2, 2023

curquiza changed the title ~~Document boostin~~ Document boosting Nov 2, 2023

Kerollmops linked a pull request Nov 8, 2023 that will close this issue

Introduce document boosting with the new boostingFilter search parameter #4199

Draft

curquiza closed this as completed Dec 12, 2023

curquiza removed this from the v1.6.0 milestone Dec 12, 2023

curquiza reopened this Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document boosting #4189

Document boosting #4189

curquiza commented Nov 2, 2023 •

edited by Kerollmops

curquiza commented Dec 12, 2023

sandstrom commented Feb 22, 2024

Kerollmops commented Feb 27, 2024 •

edited

sandstrom commented Feb 27, 2024 •

edited

Kerollmops commented Feb 28, 2024

sandstrom commented Feb 28, 2024 •

edited

pepijn-vanvlaanderen commented Mar 12, 2024

macraig commented Mar 13, 2024

Tadaz commented Mar 22, 2024

curquiza commented Mar 25, 2024

Kerollmops commented May 9, 2024 •

edited

sandstrom commented May 10, 2024

Kerollmops commented May 10, 2024

sandstrom commented May 10, 2024

Document boosting #4189

Document boosting #4189

Comments

curquiza commented Nov 2, 2023 • edited by Kerollmops

Motivation

Usage

TODO

Impacted teams

curquiza commented Dec 12, 2023

sandstrom commented Feb 22, 2024

Kerollmops commented Feb 27, 2024 • edited

sandstrom commented Feb 27, 2024 • edited

Kerollmops commented Feb 28, 2024

sandstrom commented Feb 28, 2024 • edited

pepijn-vanvlaanderen commented Mar 12, 2024

macraig commented Mar 13, 2024

Tadaz commented Mar 22, 2024

curquiza commented Mar 25, 2024

Kerollmops commented May 9, 2024 • edited

sandstrom commented May 10, 2024

Kerollmops commented May 10, 2024

sandstrom commented May 10, 2024

curquiza commented Nov 2, 2023 •

edited by Kerollmops

Kerollmops commented Feb 27, 2024 •

edited

sandstrom commented Feb 27, 2024 •

edited

sandstrom commented Feb 28, 2024 •

edited

Kerollmops commented May 9, 2024 •

edited