Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sidecar: Add /api/v1/flush endpoint #7358

Closed
wants to merge 42 commits into from

Conversation

Nashluffy
Copy link

@Nashluffy Nashluffy commented May 14, 2024

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Adds a sidecar API with one endpoint: /api/v1/flush which calls the TSDB snapshot endpoint on the prometheus instance, then uploads all not-already-present blocks in the snapshot to object store.

There are a few issues that explain the motivation:

Essentially if this is the last time sidecar will be running (ie. cluster is being deleted, shard being removed, etc...) then without some flushing mechanism you will permanently lose up to 2 hours of data.

Verification

Beside the unit tests, running prometheus locally and calling the endpoint works as expected.

image

Nashluffy and others added 29 commits May 14, 2024 17:25
Signed-off-by: mluffman <[email protected]>
If the prometheus that belongs to a sidecar is down we dont need to
query the sidecar. This PR makes it so that we take the sidecar out of
the endpoint set then. We do the same for all other store APIs by
retuning an error in the info/Info gRPC call if they are marked as not
ready.

Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: mluffman <[email protected]>
…nos-io#7305)

* Query|Receiver|Store: Do not log full request on ProxyStore by default

We had a problem on our production where a sudden increase in requests with long matchers was putting our receivers under a lot of pressure.
Upon checking profiles we saw that the problem was calls to Log()

Signed-off-by: Pedro Tanaka <[email protected]>

* Adding changelog

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
* *: Updating hashicorp LRU cache to v2

Signed-off-by: Pedro Tanaka <[email protected]>

* Adding some new comments regarding removing complexity of TTL

Signed-off-by: Pedro Tanaka <[email protected]>

* Using new version everywhere

Signed-off-by: Pedro Tanaka <[email protected]>

* rephrase the comment

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
Remove a long-standing TODO item in the code - let's use the great loser
tree implementation by Bryan. It is faster than the heap because less
comparisons are needed. Should be a nice improvement given that the heap
is used in a lot of hot paths.

Since Prometheus also uses this library, it's tricky to import the "any"
version. I tried doing bboreham/go-loser#3 but
it's still impossible to do that. Let's just copy/paste the code, it's
not a lot.

Bench:

```
goos: linux
goarch: amd64
pkg: github.com/thanos-io/thanos/pkg/store
cpu: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
             │   oldkway   │               newkway               │
             │   sec/op    │    sec/op     vs base               │
KWayMerge-16   2.292m ± 3%   2.075m ± 15%  -9.47% (p=0.023 n=10)

             │   oldkway    │               newkway               │
             │     B/op     │     B/op      vs base               │
KWayMerge-16   1.553Mi ± 0%   1.585Mi ± 0%  +2.04% (p=0.000 n=10)

             │   oldkway   │              newkway               │
             │  allocs/op  │  allocs/op   vs base               │
KWayMerge-16   27.26k ± 0%   26.27k ± 0%  -3.66% (p=0.000 n=10)
```

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: mluffman <[email protected]>
Batch TSDB Infos for bucket store for blocks with overlapping ranges.

Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: mluffman <[email protected]>
…io#7310)

* Proxy: acceptance test for proxy store with replica labels

Signed-off-by: Michael Hoffmann <[email protected]>

* Stores: handle replica labels in label_value and label_names grpcs

Signed-off-by: Michael Hoffmann <[email protected]>

---------

Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Kartikay <[email protected]>
Signed-off-by: mluffman <[email protected]>
This commit adds a resource_attributes field to the OTLP tracing configuration.

Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: mluffman <[email protected]>
For thanos-io#6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: mluffman <[email protected]>
Adding a minimal test case for issue thanos-io#6775 - reproduces the panic in the
compactor.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: mluffman <[email protected]>
This commit adds a new tracing span for remotely delegated queries
with attributes related to the query and remote engine.

Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: mluffman <[email protected]>
* Adding repro case for broken query with distributed engine

Signed-off-by: Pedro Tanaka <[email protected]>

* Fixing problem with distributed queries and xfunctios

Signed-off-by: Pedro Tanaka <[email protected]>

* Adding support for extended functions in tenancy enforcement

Signed-off-by: Pedro Tanaka <[email protected]>

* Moving custom parser to new package

Signed-off-by: Pedro Tanaka <[email protected]>

* fixing go-lint

Signed-off-by: Pedro Tanaka <[email protected]>

* Using same opts and reorganize imports

Signed-off-by: Pedro Tanaka <[email protected]>

* fixing problem with query format

Signed-off-by: Pedro Tanaka <[email protected]>

* fixing flaky tests

Signed-off-by: Pedro Tanaka <[email protected]>

* removing extra test

Signed-off-by: Pedro Tanaka <[email protected]>

* yet another flaky test

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Vanshikav123 <[email protected]>
Signed-off-by: mluffman <[email protected]>
* rule

Signed-off-by: Vanshikav123 <[email protected]>

* rule-changes

Signed-off-by: Vanshikav123 <[email protected]>

* prettier

Signed-off-by: Vanshikav123 <[email protected]>

* Rebuild

Signed-off-by: Vanshikav123 <[email protected]>

* changes after make react-app

Signed-off-by: Vanshikav123 <[email protected]>

---------

Signed-off-by: Vanshikav123 <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
When using the exemplars proxy to search for exemplars on receivers, if one receiver had tenants that did not match the selector on the external label it would get
skipped completely even if it had a tenant that actually matched

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
pedro-stanaka and others added 12 commits May 14, 2024 17:25
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
* Update minio-go to v7.0.70

Add support for EKS Pod Identity
fix issue: thanos-io#7157

Signed-off-by: farhad <[email protected]>

* Changelog - support for EKS Pod Identity

Updated changelog

Signed-off-by: farhad <[email protected]>

---------

Signed-off-by: farhad <[email protected]>
Signed-off-by: mluffman <[email protected]>
thanos-io#7338)

* fixing extended functions support in more places

Signed-off-by: Pedro Tanaka <[email protected]>

* Adding new failint for the Parse() method

Signed-off-by: Pedro Tanaka <[email protected]>

* Adding new method for ParseMetricSelector

Signed-off-by: Pedro Tanaka <[email protected]>

* Fixing missing imports

Extending test to check behavior

More missing imports

Signed-off-by: Pedro Tanaka <[email protected]>

* Fixing method name

Signed-off-by: Pedro Tanaka <[email protected]>

* Solving references to forbidden functions

Signed-off-by: Pedro Tanaka <[email protected]>

* Treating promql validation from ParseExpr

Signed-off-by: Pedro Tanaka <[email protected]>

* fixing funcs

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: mluffman <[email protected]>
Bumps [webpack](https://github.com/webpack/webpack) from 5.70.0 to 5.91.0.
- [Release notes](https://github.com/webpack/webpack/releases)
- [Commits](webpack/webpack@v5.70.0...v5.91.0)

---
updated-dependencies:
- dependency-name: webpack
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: mluffman <[email protected]>
* Align tenant pruning according to wall clock.

Pruning a tenant currently acquires a lock on the tenant's TSDB,
which blocks reads from incoming queries. We have noticed spikes in
query latency when tenants get decomissioned since each receiver will
prune the tenant at a different time.

To reduce the window where queries get degraded, this commit makes sure that
pruning happens at predictable intervals by aligning it to the wall clock, similar
to how head compaction is aligned.

The commit also changes the tenant deletion condition to look at the duration
from the min time of the tenant, rather than from the last append time.

Signed-off-by: Filip Petkovski <[email protected]>

* Improve tests

Signed-off-by: Filip Petkovski <[email protected]>

---------

Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: mluffman <[email protected]>
Bumps [ip](https://github.com/indutny/node-ip) from 1.1.5 to 1.1.9.
- [Commits](indutny/node-ip@v1.1.5...v1.1.9)

---
updated-dependencies:
- dependency-name: ip
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: mluffman <[email protected]>
…hanos-io#7348)

Bumps [webpack-dev-middleware](https://github.com/webpack/webpack-dev-middleware) from 5.3.1 to 5.3.4.
- [Release notes](https://github.com/webpack/webpack-dev-middleware/releases)
- [Changelog](https://github.com/webpack/webpack-dev-middleware/blob/v5.3.4/CHANGELOG.md)
- [Commits](webpack/webpack-dev-middleware@v5.3.1...v5.3.4)

---
updated-dependencies:
- dependency-name: webpack-dev-middleware
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: mluffman <[email protected]>
Signed-off-by: mluffman <[email protected]>
@Nashluffy Nashluffy reopened this May 14, 2024
@Nashluffy Nashluffy closed this May 14, 2024
@Nashluffy Nashluffy deleted the flush-endpoint branch May 14, 2024 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet