[Dataset quality] Adjust degraded docs query #179227

yngrdyn · 2024-03-22T09:45:10Z

📓 Summary

When elastic/elasticsearch#59946 lands we would need to improve the query to get degraded logs.

✔️ Acceptance criteria

Perform field caps request to check if they are datastreams not supporting _ignored aggregation in the selected timeframe.
If there are some datastreams not supporting the aggregation let the users know about it, we could use a banner for this with a popup indicating which are the datastream not supporting the aggregation and recommending user how to fix this (through manual rollovers)

❓ Open questions

Changing the query will result in partial failures for old indices. How can we handle such cases? Options that I see:

We let them fail partially (for the shards of the old indices) and just put an indicator that data is incomplete because index X, Y, Z couldn't be checked.
We let them fail but we retry for those indices with another mechanism (runtime field? old query?).
We perform an initial check using field caps API, if an index doesn't support aggregation on _ignored we let users know that this will take more time than expected and perform the query with the aggregation on the indices that support it and find an alternative (related to 2) to get the information for indices that don't support the aggregation

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-03-22T09:45:12Z

Pinging @elastic/obs-ux-logs-team (Team:obs-ux-logs)

ruflin · 2024-03-26T07:29:17Z

Engineering effort wise, what is the least effort? I'm thinking of starting with 1 and then follow up with 3. Or if 3 is as simple as one, directly jump to 3.

One thing we have to consider is also how much load we put on the cluster. I assume for the default of 24h it should be doable, but for longer periods, it could take a long time and make Elasticsearch read all the data.

yngrdyn · 2024-03-26T09:39:11Z

Thanks for your input @ruflin, I have now another question: is there a way to move those indices that don't support the aggregation to a state where they support it?

ruflin · 2024-03-26T10:10:00Z

is there a way to move those indices that don't support the aggregation to a state where they support it?

For data streams, my understanding is as soon as a rollover is triggered and the new index is created, it will be available on all the new data. @salvatore-campagna Can you confirm?

yngrdyn · 2024-03-26T10:14:51Z

For data streams, my understanding is as soon as a rollover is triggered and the new index is created

We could guide users to do this, so next time they can benefit fully from the aggregation

yngrdyn · 2024-05-06T08:27:22Z

@mdbirnstiehl could you help us here with some communications?

The request might take longer due to datastreams X, Y, Z not supporting _ignored aggregation
In order to support the aggregation for future data, and consequently improving performance in the page, we recommend users to perform manual rollovers to the aforementioned datastreams

mdbirnstiehl · 2024-05-06T14:44:40Z

@mdbirnstiehl could you help us here with some communications?

The request might take longer due to datastreams X, Y, Z not supporting _ignored aggregation

In order to support the aggregation for future data, and consequently improving performance in the page, we recommend users to perform manual rollovers to the aforementioned datastreams

How about:

Your request may take longer to complete due to datastreams X, Y, Z not supporting _ignored aggregation. Manually [rollover](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html) these indices to prevent future delays.

My concern is whether or not "supporting ignored aggregation" is meaningful to users, but I'm not sure of a more meaningful way to put that without making the message too long.

felixbarny · 2024-05-07T07:40:55Z

Just my 2c here but I'd go with option 1 and warn users about partial results and encourage them to roll over the data stream. I'd rather not give users a foot-gun that can affect their cluster's performance. If we choose to still do a fallback query for indices that don't support aggregations for _ignored, I'd implement it as an opt-in, i.e. the user needs to click a button to trigger the slow aggregation.

yngrdyn · 2024-05-07T08:37:27Z

Hey Felix, initially I though we would need to change the query to something like

"aggs": {
    "degraded": {
      "terms": {
        "field": "_ignored"
      }
    }
  }

With that type of query we will get failures and partial results, the thing is that query will only tell us how many documents we have containing a value of an specific _ignored property
a bucket will look like this in the response

{
   "key": {
     "dataset": "synth",
     "namespace": "default"
   },
   "doc_count": 34260,
   "degraded": {
     "doc_count_error_upper_bound": 0,
     "sum_other_doc_count": 0,
     "buckets": [
       {
         "key": "cloud.availability_zone",
         "doc_count": 3460
       },
       {
         "key": "log.level",
         "doc_count": 260
       }
     ]
   }
}

The problem with that type of response is that we are not able to tell how many documents in that dataset actually have _ignored properties, a document could have only cloud.availability_zone or only log.level or maybe cloud.availability_zone and log.level ignored.

Maybe I'm missing something but still I feel we would need to make use of exists as in the original query at least at this stage.

felixbarny · 2024-05-07T09:50:07Z

I think I've found a simple solution to get the ratio of degraded documents only for indices that have doc_values for the _ignored field. Instead of doing an exists query, we could use a missing aggregation like so:

{
  "size": 0,
  "aggs": {
    "non_degraded": {
      "missing": {
        "field": "_ignored"
      }
    }
  }
}

Based on some quick testing, the aggregation fails on shards where the field doesn't have doc_values but succeeds on other shards.

Click to expand example

PUT /test-missing-agg
{
  "mappings": {
    "properties": {
      "ignored":{
        "type": "keyword",
        "doc_values": false,
        "store": true
      }
    }
  }
}

PUT /test-missing-agg2
{
  "mappings": {
    "properties": {
      "ignored":{
        "type": "keyword"
      }
    }
  }
}


POST test-missing-agg/_doc?refresh
Authorization: Basic {{username}} {{password}}
Content-Type: application/json
{
  "foo": "bar",
  "ignored": ["foo", "bar"]
}


POST test-missing-agg2/_doc?refresh
Authorization: Basic {{username}} {{password}}
Content-Type: application/json
{
  "foo": "bar",
  "ignored": ["foo", "bar"]
}

POST /test-missing-agg/_doc?refresh
{
  "foo": "bar"
}


POST /test-missing-agg2/_doc?refresh
{
  "foo": "bar"
}

POST /test-missing-agg/_search
{
  "size": 0,
  "aggs": {
    "non_degraded": {
      "missing": {
        "field": "ignored"
      }
    }
  }
}


### response

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 1,
    "skipped": 0,
    "failed": 1,
    "failures": [
      {
        "shard": 0,
        "index": "test-missing-agg",
        "node": "0G1taQ10S6KjBFKEnwdOkA",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Can't load fielddata on [ignored] because fielddata is unsupported on fields of type [keyword]. Use doc values instead."
        }
      }
    ]
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "non_degraded": {
      "doc_count": 1
    }
  }
}

yngrdyn · 2024-05-07T13:18:24Z

Thanks @felixbarny, I did some quick testings in edge-lite.
The following tables will contain the queries used, the amount of time (ms) that took every run (from 1st to 10th), and lastly the avg of time took in ms

For 5d, including non aggregatable indices

query	1st	2nd	3rd	4rd	5th	6th	7th	8th	9th	10th	avg (ms)
exists	87990	57915	98406	59884	92024	48997	73397	79090	74474	94948	76712
missing*	75216	68595	94793	49559	90212	55243	96095	92396	64222	87188	77351

*missing is not returning complete data, instead it returns failures for the indices not supporting the _ignored aggregation

For latest 24h, all indices supporting the aggregation

query	1st	2nd	3rd	4rd	5th	6th	7th	8th	9th	10th	avg (ms)
exists	31388	49289	32990	24559	43409	56830	32525	55583	60163	69522	45625
missing	63874	33511	45171	34414	35407	36441	51615	57189	39663	63487	46077

Avg in performance is very similar, do you think for higher amount of data we could conclude that using missing will be more beneficial?

I was also trying to split the query because I think that will speed things up for most of the cases (I expect total of documents with ignored property to be a reduced number compared to the rest) if we could request the queries in parallel

Queries

##### Number of documents with ignored property per dataset
GET /logs-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
          {"range":
            {
              "@timestamp":{
                "gte": "now-24h",
                "lte": "now",
                "format":"epoch_millis"
              }
            }
          },
          {"term":{"data_stream.type":"logs"}
        }
      ],
      "must": {
        "exists": {
          "field": "_ignored"
        }
      }
    }
  },
  "aggs": {
    "datasets":{
      "composite": {
        "size":10000,
        "sources":[
          {"dataset":{"terms":{"field":"data_stream.dataset"}}},
          {"namespace":{"terms":{"field":"data_stream.namespace"}}}
        ]
      }
    }
  }
}

##### Total documents per dataset
GET /logs-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
          {"range":
            {
              "@timestamp":{
                "gte": "now-24h",
                "lte": "now",
                "format":"epoch_millis"
              }
            }
          },
          {"term":{"data_stream.type":"logs"}
        }
      ]
    }
  },
  "aggs": {
    "datasets":{
      "composite": {
        "size":10000,
        "sources":[
          {"dataset":{"terms":{"field":"data_stream.dataset"}}},
          {"namespace":{"terms":{"field":"data_stream.namespace"}}}
        ]
      }
    }
  }
}

Doing that I got the following numbers

For 5d, including non aggregatable indices

query	1st	2nd	3rd	4rd	5th	6th	7th	8th	9th	10th	avg (ms)
documents with ignored	4622	1535	2437	36	118	88	744	42	24	30	967
total documents	20698	33684	41278	40503	27794	36216	22184	22100	32606	34497	31156

For latest 24h, all indices supporting the aggregation

query	1st	2nd	3rd	4rd	5th	6th	7th	8th	9th	10th	avg (ms)
documents with ignored	204	22	31	173	66	19	45	3339	12	22	393
total documents	30441	17446	12658	22563	17267	21037	13381	24503	26429	15642	20136

If we could do that, parallelise, we might lowering the time taken by elastic search to the ~half because the critical path will be on the longest query. wdyt?

Relates to #179227. After gathering some numbers around possible tweaks of the current degradedDocs query ([more information](#179227 (comment))), I decided to move forward and split the query to reduce the time taken by elastic search aggregating on data streams. This PR contains the following changes: - `mSearch` method was added to `DatasetQualityESClient` to allow the usage of [multi search](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html). - `degradedDocsRt` was changed to now include not only the amount of degradedDocs but also the total docs for the datastreams within the timerange selected Nothing visible has changed in terms of functionality https://github.com/elastic/kibana/assets/1313018/1d836446-7d35-4ff2-8356-16a8087ab505

…ation (#183183) Closes #179227. ## 📝 Summary This PR adds a warning to main page and flyout whenever a dataStream has indices that doesn't support `_ignored` aggregation ## 🎥 Demo https://github.com/elastic/kibana/assets/1313018/c6d5fd81-d9a9-4fcc-92d0-8e65b996df9c #### Flyout https://github.com/elastic/kibana/assets/1313018/2972e104-8b10-413a-a430-3efc5252b757

salvatore-campagna · 2024-05-20T07:21:24Z

I think we should run the query only for indices that have doc values on _ignored, discovering them using the field caps api. For other indices we can have the slow option to be triggered explicitly by the user.

ruflin · 2024-05-21T07:59:42Z

@salvatore-campagna What you propose above adds a lot of complexity on the UI side for a "temporary" problem for our users. We opted to go for a banner on the top.

yngrdyn added Team:obs-ux-logs Observability Logs User Experience Team Feature:Dataset Health labels Mar 22, 2024

yngrdyn mentioned this issue Apr 17, 2024

[Dataset quality] Enable: navigation from Logs explorer, sideNav and home in stack management #179230

Open

yngrdyn self-assigned this Apr 30, 2024

mohamedhamed-ahmed mentioned this issue May 7, 2024

[Dataset quality] Reactive Estimated Data Summary Panel #182873

Closed

yngrdyn mentioned this issue May 9, 2024

[Dataset quality] using msearch to speed up degradedDocs query #183023

Merged

yngrdyn mentioned this issue May 10, 2024

[Dataset quality] Warning for datasets not supporting _ignored aggregation #183183

Merged

yngrdyn closed this as completed in #183183 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset quality] Adjust degraded docs query #179227

[Dataset quality] Adjust degraded docs query #179227

yngrdyn commented Mar 22, 2024 •

edited

elasticmachine commented Mar 22, 2024

ruflin commented Mar 26, 2024

yngrdyn commented Mar 26, 2024

ruflin commented Mar 26, 2024

yngrdyn commented Mar 26, 2024

yngrdyn commented May 6, 2024

mdbirnstiehl commented May 6, 2024

felixbarny commented May 7, 2024

yngrdyn commented May 7, 2024 •

edited

felixbarny commented May 7, 2024

yngrdyn commented May 7, 2024 •

edited

salvatore-campagna commented May 20, 2024

ruflin commented May 21, 2024

[Dataset quality] Adjust degraded docs query #179227

[Dataset quality] Adjust degraded docs query #179227

Comments

yngrdyn commented Mar 22, 2024 • edited

📓 Summary

✔️ Acceptance criteria

❓ Open questions

elasticmachine commented Mar 22, 2024

ruflin commented Mar 26, 2024

yngrdyn commented Mar 26, 2024

ruflin commented Mar 26, 2024

yngrdyn commented Mar 26, 2024

yngrdyn commented May 6, 2024

mdbirnstiehl commented May 6, 2024

felixbarny commented May 7, 2024

yngrdyn commented May 7, 2024 • edited

felixbarny commented May 7, 2024

yngrdyn commented May 7, 2024 • edited

salvatore-campagna commented May 20, 2024

ruflin commented May 21, 2024

yngrdyn commented Mar 22, 2024 •

edited

yngrdyn commented May 7, 2024 •

edited

yngrdyn commented May 7, 2024 •

edited