Skip to content

Commit

Permalink
[release-v2.4] [DOC] Update monitoring doc (#3550)
Browse files Browse the repository at this point in the history
* [DOC] Update monitoring doc (#3535)

* Update monitoring doc

* Updates for typos and formatting

* Create new folder structure for montoring

* Fix page title

* Update docs/sources/tempo/operations/monitor/set-up-monitoring.md

* Update docs/sources/tempo/operations/monitor/set-up-monitoring.md

(cherry picked from commit a824a4e)

* Update docs/sources/tempo/configuration/polling.md

---------

Co-authored-by: Kim Nylander <[email protected]>
  • Loading branch information
github-actions[bot] and knylander-grafana committed Apr 9, 2024
1 parent 46e42aa commit 9fa984a
Show file tree
Hide file tree
Showing 7 changed files with 438 additions and 111 deletions.
3 changes: 2 additions & 1 deletion docs/sources/tempo/configuration/polling.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,11 @@ storage:
[blocklist_poll_stale_tenant_index: <duration>]
```

Due to the mechanics of the [tenant index]({{< relref "../operations/polling" >}}), the blocklist will be stale by
Due to the mechanics of the [tenant index]({{< relref "../operations/monitor/polling" >}}), the blocklist will be stale by
at most 2 times the configured `blocklist_poll` duration. There are two configuration options that need to be balanced
against the `blockist_poll` to handle this:


The ingester `complete_block_timeout` is used to hold a block in the ingester for a given period of time after
it has been flushed. This allows the ingester to return traces to the queriers while they are still unaware
of the newly flushed blocks.
Expand Down
4 changes: 2 additions & 2 deletions docs/sources/tempo/operations/backend_search.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ weight: 90

# Tune search performance

Regardless of whether or not you are using TraceQL or the original search API, Tempo will search all of the blocks
Regardless of whether or not you are using TraceQL or the original search API, Tempo searches all of the blocks
in the specified time range.
Depending on your volume, this may result in slow queries.
This document contains suggestions for tuning your backend to improve performance.
Expand Down Expand Up @@ -111,7 +111,7 @@ query_frontend:

## Serverless environment

Serverless is not required, but with larger loads, serverless can be used to reduce costs.
Serverless is not required, but with larger loads, serverless can be used to reduce costs.
Tempo has support for Google Cloud Run and AWS Lambda. In both cases, you will use the following
settings to configure Tempo to use a serverless environment:

Expand Down
2 changes: 1 addition & 1 deletion docs/sources/tempo/operations/caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ sum by (status_code) (
)
```

This metric is also shown in [the monitoring dashboards]({{< relref "./monitoring" >}}) (the left panel):
This metric is also shown in [the monitoring dashboards]({{< relref "./monitor" >}}) (the left panel):

<p align="center"><img src="../caching_memcached_connection_limit.png" alt="QPS and latency of requests to memcached"></p>

Expand Down
111 changes: 111 additions & 0 deletions docs/sources/tempo/operations/monitor/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: Monitor Tempo
menuTitle: Monitor Tempo
description: Use polling, alerts, and dashboards to monitor Tempo in production.
weight: 20
aliases:
- ./monitoring ## https://grafana.com/docs/tempo/latest/operations/monitoring/
---

# Monitor Tempo

Tempo is instrumented to expose metrics, logs, and traces.
Furthermore, the Tempo repository has a [mixin](https://github.com/grafana/tempo/tree/main/operations/tempo-mixin) that includes a
set of dashboards, rules, and alerts.
Together, these can be used to monitor Tempo in production.

## Instrumentation

Metrics, logs, and traces from Tempo can be collected to observe its services and functions.

### Metrics

Tempo is instrumented with [Prometheus metrics](https://prometheus.io/) and emits RED metrics for most services and backends.
RED metrics are a standardized format for monitoring microservices, where R stands for requests, E stands for errors, and D stands for duration.

The [Tempo mixin](#dashboards) provides several dashboards using these metrics.

### Logs

Tempo emits logs in the `key=value` ([logfmt](https://brandur.org/logfmt)) format.

### Traces

Tempo uses the [Jaeger Golang SDK](https://github.com/jaegertracing/jaeger-client-go) for tracing instrumentation.
The complete read path and some parts of the write path of Tempo are instrumented for tracing.

You can configure the tracer [using environment variables](https://github.com/jaegertracing/jaeger-client-go#environment-variables).
To enable tracing, set one of the following: `JAEGER_AGENT_HOST` and `JAEGER_AGENT_PORT`, or `JAEGER_ENDPOINT`.

The Jaeger client uses remote sampling by default, if the management server is not available no traces are sent.
To always send traces (no sampling), set the following environment variables:

```
JAEGER_SAMPLER_TYPE=const
JAEGER_SAMPLER_PARAM=1
```

## Polling

Tempo maintains knowledge of the state of the backend by polling it on regular intervals. There are currently only two components that need this knowledge and, consequently, only two that poll the backend: compactors and queriers.

Refer to [Use polling to monitor Tempo's backend status]({{< relref "./polling" >}}).

## Dashboards

The [Tempo mixin](https://github.com/grafana/tempo/tree/main/operations/tempo-mixin) has four Grafana dashboards in the `yamls` folder that you can download and import into your Grafana UI.
These dashboards work well when you run Tempo in a Kubernetes (k8s) environment and metrics scraped have the
`cluster` and `namespace` labels.

### Tempo Reads dashboard

> This is available as `tempo-reads.json`.
The Reads dashboard gives information on Requests, Errors, and Duration (RED) on the query path of Tempo.
Each query touches the Gateway, Tempo-Query, Query-Frontend, Queriers, Ingesters, the backend, and Cache, if present.

Use this dashboard to monitor the performance of each of the mentioned components and to decide the number of
replicas in each deployment.

### Tempo Writes dashboard

> This is available as `tempo-writes.json`.
The Writes dashboard gives information on RED on the write/ingest path of Tempo.
A write query touches the Gateway, Distributors, Ingesters, and the backend.
This dashboard also gives information
on the number of operations performed by the Compactor to the backend.

Use this dashboard to monitor the performance of each of the mentioned components and to decide the number of
replicas in each deployment.

### Tempo Resources dashboard

> This is available as `tempo-resources.json`.
The Resources dashboard provides information on `CPU`, `Container Memory`, and `Go Heap Inuse`.
This dashboard is useful for resource provisioning for the different Tempo components.

Use this dashboard to see if any components are running close to their assigned limits.

### Tempo Operational dashboard

> This is available as `tempo-operational.json`.
The Tempo Operational dashboard deserves special mention because it is probably a stack of dashboard anti-patterns.
It's big and complex, doesn't use `jsonnet`, and displays far too many metrics in one place.
For just getting started, the RED dashboards are great places to learn how to monitor Tempo in an opaque way.

This dashboard is included in the Tempo repository for two reasons:

- The dashboard provides a stack of metrics for other operators to consider monitoring while running Tempo.
- We want the dashboard in our internal infrastructure and we vendor the `tempo-mixin` to do this.

## Rules and alerts

The Rules and Alerts are available as [YAML files in the compiled mixin](https://github.com/grafana/tempo/tree/main/operations/tempo-mixin-compiled) on the repository.

To set up alerting, download the provided JSON files and configure them for use on your Prometheus monitoring server.

Check the [runbook](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md) to understand the
various steps that can be taken to fix firing alerts.
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ description: Monitor Tempo's backend using polling
weight: 30
aliases:
- /docs/tempo/operations/polling
- ../polling
---

# Use polling to monitor Tempo's backend status
Expand All @@ -22,14 +23,16 @@ This is done once every `blocklist_poll` duration.
All other compactors and all queriers then rely on downloading this file, unzipping it and using the contained list.
Again, this is done once every `blocklist_poll` duration.

Due to this behavior, a given compactor or querier will often have an out-of-date blocklist.
Due to this behavior, a given compactor or querier often have an out-of-date blocklist.
During normal operation, it will stale by at most twice the configured `blocklist_poll`.

>**Note**: For details about configuring polling, see [polling configuration]({{< relref "../configuration/polling" >}}).
{{% admonition type="note" %}}
For details about configuring polling, refer to [polling configuration]({{< relref "../../configuration/polling" >}}).
{{% /admonition %}}

## Monitor polling with dashboards and alerts

See our Jsonnet for example [alerts](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/alerts.libsonnet) and [runbook entries](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md)
Refer to the Jsonnet for example [alerts](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/alerts.libsonnet) and [runbook entries](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md)
related to polling.

If you are building your own dashboards or alerts, here are a few relevant metrics:
Expand Down
Loading

0 comments on commit 9fa984a

Please sign in to comment.