Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document boosting #4189

Open
4 tasks
curquiza opened this issue Nov 2, 2023 · 14 comments · May be fixed by #4199
Open
4 tasks

Document boosting #4189

curquiza opened this issue Nov 2, 2023 · 14 comments · May be fixed by #4199
Labels
enhancement New feature or improvement impacts docs This issue involves changes in the Meilisearch's documentation impacts integrations This issue involves changes in the Meilisearch's integrations missing usage in PRD Description of the feature usage is missing in the PRD

Comments

@curquiza
Copy link
Member

curquiza commented Nov 2, 2023

Related product team resources: PRD (internal only)
Related spec: WIP

Motivation

Add Promoting and boosting features to cater for e-commerce use cases and close the competitor gap.

Usage

Refer to: https://www.notion.so/meilisearch/Document-boosting-API-usage-20aae06bc85e41dba828a90331a69f2c

TODO

  • Define scope of implementation
  • Release a prototype
  • If prototype validated, merge changes into main
  • Update the spec

Impacted teams

@meilisearch/docs-team @meilisearch/integration-team

@curquiza curquiza added enhancement New feature or improvement impacts docs This issue involves changes in the Meilisearch's documentation impacts integrations This issue involves changes in the Meilisearch's integrations missing usage in PRD Description of the feature usage is missing in the PRD labels Nov 2, 2023
@curquiza curquiza added this to the v1.6.0 milestone Nov 2, 2023
@curquiza curquiza changed the title Document boostin Document boosting Nov 2, 2023
@curquiza
Copy link
Member Author

Sorry, not possible for v1.6.0, other priorities have been done.
I remove it from the milestone

@curquiza curquiza removed this from the v1.6.0 milestone Dec 12, 2023
@curquiza curquiza reopened this Dec 12, 2023
@sandstrom
Copy link
Contributor

Is this something that will be similar to 'Function scoring' in ElasticSearch[1]? If not, I think you should consider it.

It's a powerful primitive that solves many types of problems in search:

  • "Advanced Rank/Sort"
  • "Scored Filter"
  • "Field Weighting"
  • "Promoting documents"

To give some concrete use-cases of the above:

  • Rank nearby restaurants/hotels/etc higher
  • Rank old records (or new records) higher
  • Reddit-style tradeoff, recency vs. many votes
  • Leniency in price-range queries (show some things outside the range, but only when they are close; taper heavily for those far outside)

It's one of those foundational features that will solve a lot of "long-tail problems".

When you are planning the roadmap, sometimes it's better to build one slightly larger feature that solves 10+ "smaller requests", than building 10 smaller features to solve individual problems.

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html

cc @curquiza @Kerollmops

@Kerollmops
Copy link
Member

Kerollmops commented Feb 27, 2024

Hello @sandstrom,

Is this something that will be similar to 'Function scoring' in ElasticSearch[1]? If not, I think you should consider it.

It is not the same. Meilisearch isn't a score-based search engine but rather a bucket-sort-based one. It doesn't rely on a global score by document to determine the order of the document but rather on buckets spilling into other smaller buckets. This algorithm is helpful when each ranking rule has a different level of importance.

However, thanks to the Score Details feature, Meilisearch can output a global ranking score as a simple number. Our Hybrid Search system uses it to rank documents using the semantic similarity score (a number) and the keyword search score.

To give some concrete use-cases of the above [..]

It is already possible to Rank nearby restaurants/hotels/etc higher with the Geo Sort feature. You can already sort documents by asc/desc creation date with the Search-side Sort or Ranking Rule Sort features.

Regarding the Reddit-style tradeoff, recency vs. many votes I agree that it is currently impossible without updating the documents regularly. However, even with the Elasticsearch Function Scoring feature, the document score depends on an external $now timestamp parameter to compute the final score. The formula should be something like that: doc.upvotes / ( ($now - doc.date) * magicScaling ) 🤔

About the Leniency in price-range queries I understand the use case of showing documents that are outside of a filtered range but not too far from the original range given. However, wouldn't it be possible to increase the range slightly? This way, it shows more documents, and you decide on the distance gap. There is currently no way to do conditional filtering, and this feature will evolve that way. We first want to be able to apply filtering based on the query content: add specific documents in certain conditions...

When you are planning the roadmap, sometimes it's better to build one slightly larger feature that solves 10+ "smaller requests", than building 10 smaller features to solve individual problems.

Thank you for your insights and have a great day 🍥

@sandstrom
Copy link
Contributor

sandstrom commented Feb 27, 2024

Thanks @Kerollmops for an extensive answer!

To be more specific, the main thing we are missing is some kind of decay functionality (in our case based on time, but could be a numerical value for others).

We currently use Elastic Search Decay Function Scores. See query below.

Having to run a scheduled job every day, to calculate e.g. ($now - doc.date) * magicScaling ), iterate over all docs (hundreds of thousands?) and store it as a property on each doc, seems like a needlessly complicated step, instead of having that computed at query-time.

There also isn't an atomic update-all document REST endpoint available, afaik.

But maybe there is an inherent restriction in the bucket vs. score-based engine, that makes this very difficult or impossible?

// if you are curious, this is the ElasticSearch query we are using today

query_body = {
  'query' => {
    'bool' => {
      'must' => [{
        'function_score' => {
          'query' => {
            'multi_match' => {
              'query' => query,
              'type' => 'cross_fields',
              'fields' => ['title', 'reference', 'user'],
              'operator' => 'and',
            },
          },
          'functions' => [{
            # reduce score for older reports with exponential decay
            'exp' => {
              'submitted_at' => {
                'origin' => Time.now,
                'offset' => '30d', # no score reduction (i.e. modifier = 1.0) within 30 days
                'scale' => '730d', # 2 year falloff until `decay` value
                'decay' => 0.5, # modifier at `offset`+`scale`, i.e. 2y+30d old => modifier = 0.5
              },
            },
          }],
          'score_mode' => 'multiply', # multiply score modifier values for final score
        },
      }],
      'filter' => {
        'bool' => {
          'must' => [
            { 'term' => { 'company_id' => company.id } }, # scope to company
            { 'bool' => { 'must_not' => { 'exists' => { 'field' => 'archived_at' } } } }, # exclude archived
          ],
        },
      },
    },
  },
}

@Kerollmops
Copy link
Member

To be more specific, the main thing we are missing is some kind of decay functionality (in our case based on time, but could be a numerical value for others).
[..]
There also isn't an atomic update-all document REST endpoint available, afaik.

Indeed, there is no /update-all route or decay functionality in Meilisearch right now. Would you say that an /indexes/{name}/update route with a body that looks like the following would help you use Meilisearch and decay the document scores?

{
  "filter": "category = gaming", // only update this subset
  // inserts the new score and scoreUpdatedAt fields with the given computation in the filtered documents
  "formulas": {
    // Upvotes and timestamp are fields already in the documents
    "score": "upvotes / ((1709139095 - timestamp) * 20.0)",
    "scoreUpdatedAt": "1709139095"
  }
}

But maybe there is an inherent restriction in the bucket vs. score-based engine, that makes this very difficult or impossible?

No, I don't think so. You can define the right ranking rules to sort your documents and change the numeric field value accordingly. Nothing is inherently impossible or hard to do.

Have a great day 🍡

@sandstrom
Copy link
Contributor

sandstrom commented Feb 28, 2024

@Kerollmops Yes, it would!

As long as it isn't too expensive to run (we'd be running it on all docs in an index with ~1M rows and growing), this would do the job!

There is a slight win in having scores computed on-the-fly, in the sense that we'll only need to update a query to change the behavior (developer + ops ergonomics). But in the grander scheme of things, that's a small thing. We could easily setup a scheduled job that does this daily or weekly.


As an aside, I've followed MeiliSearch since ~2001 -- wrote about this back then. It's a great project and we would love to use Meili! This has been the only remaining blocker for a while now, that keeps us from dropping ElasticSearch.

@pepijn-vanvlaanderen
Copy link

Is there an eta on the implementation of this feature?

@macraig
Copy link

macraig commented Mar 13, 2024

Is there an eta on the implementation of this feature?

Hi @pepijn-vanvlaanderen , we had to deprioritize the feature in favor of more urgent work so don't have an ETA yet. You're welcome to add your use case and vote to the Function Scoring discussion if it aligns with your needs, or you can also start a new discussion.

@Tadaz
Copy link

Tadaz commented Mar 22, 2024

This is an important feature for our organization as well. Without it, our search results become less and less relevant. Our website is built on Laravel and we love that it is compatible with Meilisearch. Meilisearch lets us store our search engine's database locally, due to our privacy & policy, we can't store it on the cloud.

Anyway, we have so-called publication documents and for our search results to stay relevant we need to introduce time decay on the following parameters:

  • Publication date (older publications are less relevant)
  • Authors' resignation/retirement dates (the longer an author is not with our company anymore, the less relevant their publications are to our clients)
  • Average daily readership (most read publications should be more relevant)
  • As per @sandstrom comments above, documents' promotion would be very useful as well. Sometimes some publications are more important than others and we would like to promote them.

With the current filtering and sorting mechanism it is just simply not achievable. If we manage to get newer results, then they are less relevant query-wise, if we manage to get more relevant results then most of them are old.

Sadly, if this functionality is not something that will be prioritized in the near future, we will need to think of other alternatives.

Thank you for your time reading this.

@curquiza
Copy link
Member Author

Thank you @Tadaz for your feedback, I informed the product team 👌

@Kerollmops
Copy link
Member

Kerollmops commented May 9, 2024

Hey @sandstrom and @Tadaz 👋

I just released a first prototype of a way to edit documents by using a Rhai function. You can read more on the Public Usage page. Please, tell me more about what you think. The documentation may be improved but it's only the first prototype.

Have a nice day 🍾

@sandstrom
Copy link
Contributor

Sounds interesting!

However the docs link shows a login screen. Still locked?

@Kerollmops
Copy link
Member

Sorry @sandstrom, fixed it 🤭

@sandstrom
Copy link
Contributor

@Kerollmops Looks great!

Using Rhai seems like a very good idea.

I've only read the docs (not tried it), but this should do the trick.

In away on a short weekend trip so I cannot easily put together a quick test on my computer right now, but I'll try to get someone on the team to evaluate this approach and hopefully switch over to meili!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or improvement impacts docs This issue involves changes in the Meilisearch's documentation impacts integrations This issue involves changes in the Meilisearch's integrations missing usage in PRD Description of the feature usage is missing in the PRD
Projects
No open projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

6 participants