[release-v2.4] [DOC] Update monitoring doc (#3550)

* [DOC] Update monitoring doc (#3535) * Update monitoring doc * Updates for typos and formatting * Create new folder structure for montoring * Fix page title * Update docs/sources/tempo/operations/monitor/set-up-monitoring.md * Update docs/sources/tempo/operations/monitor/set-up-monitoring.md (cherry picked from commit a824a4e) * Update docs/sources/tempo/configuration/polling.md --------- Co-authored-by: Kim Nylander <[email protected]>
grafana · Apr 9, 2024 · 9fa984a · 9fa984a
1 parent 46e42aa
commit 9fa984a
Show file tree

Hide file tree

Showing 7 changed files with 438 additions and 111 deletions.
diff --git a/docs/sources/tempo/configuration/polling.md b/docs/sources/tempo/configuration/polling.md
@@ -35,10 +35,11 @@ storage:
  [blocklist_poll_stale_tenant_index: <duration>]
 ```
 
-Due to the mechanics of the [tenant index]({{< relref "../operations/polling" >}}), the blocklist will be stale by
+Due to the mechanics of the [tenant index]({{< relref "../operations/monitor/polling" >}}), the blocklist will be stale by
 at most 2 times the configured `blocklist_poll` duration. There are two configuration options that need to be balanced
 against the `blockist_poll` to handle this:
 
+
 The ingester `complete_block_timeout` is used to hold a block in the ingester for a given period of time after
 it has been flushed. This allows the ingester to return traces to the queriers while they are still unaware
 of the newly flushed blocks.

diff --git a/docs/sources/tempo/operations/backend_search.md b/docs/sources/tempo/operations/backend_search.md
@@ -7,7 +7,7 @@ weight: 90
 
 # Tune search performance
 
-Regardless of whether or not you are using TraceQL or the original search API, Tempo will search all of the blocks
+Regardless of whether or not you are using TraceQL or the original search API, Tempo searches all of the blocks
 in the specified time range.
 Depending on your volume, this may result in slow queries.
 This document contains suggestions for tuning your backend to improve performance.
@@ -111,7 +111,7 @@ query_frontend:
 
 ## Serverless environment
 
-Serverless is not required, but with larger loads, serverless can be used to reduce costs. 
+Serverless is not required, but with larger loads, serverless can be used to reduce costs.
 Tempo has support for Google Cloud Run and AWS Lambda. In both cases, you will use the following
 settings to configure Tempo to use a serverless environment:
 

diff --git a/docs/sources/tempo/operations/caching.md b/docs/sources/tempo/operations/caching.md
@@ -32,7 +32,7 @@ sum by (status_code) (
 )
 ```
 
-This metric is also shown in [the monitoring dashboards]({{< relref "./monitoring" >}}) (the left panel):
+This metric is also shown in [the monitoring dashboards]({{< relref "./monitor" >}}) (the left panel):
 
 <p align="center"><img src="../caching_memcached_connection_limit.png" alt="QPS and latency of requests to memcached"></p>
 

diff --git a/docs/sources/tempo/operations/monitor/_index.md b/docs/sources/tempo/operations/monitor/_index.md
@@ -0,0 +1,111 @@
+---
+title: Monitor Tempo
+menuTitle: Monitor Tempo
+description: Use polling, alerts, and dashboards to monitor Tempo in production.
+weight: 20
+aliases:
+- ./monitoring ## https://grafana.com/docs/tempo/latest/operations/monitoring/
+---
+
+# Monitor Tempo
+
+Tempo is instrumented to expose metrics, logs, and traces.
+Furthermore, the Tempo repository has a [mixin](https://github.com/grafana/tempo/tree/main/operations/tempo-mixin) that includes a
+set of dashboards, rules, and alerts.
+Together, these can be used to monitor Tempo in production.
+
+## Instrumentation
+
+Metrics, logs, and traces from Tempo can be collected to observe its services and functions.
+
+### Metrics
+
+Tempo is instrumented with [Prometheus metrics](https://prometheus.io/) and emits RED metrics for most services and backends.
+RED metrics are a standardized format for monitoring microservices, where R stands for requests, E stands for errors, and D stands for duration.
+
+The [Tempo mixin](#dashboards) provides several dashboards using these metrics.
+
+### Logs
+
+Tempo emits logs in the `key=value` ([logfmt](https://brandur.org/logfmt)) format.
+
+### Traces
+
+Tempo uses the [Jaeger Golang SDK](https://github.com/jaegertracing/jaeger-client-go) for tracing instrumentation.
+The complete read path and some parts of the write path of Tempo are instrumented for tracing.
+
+You can configure the tracer [using environment variables](https://github.com/jaegertracing/jaeger-client-go#environment-variables).
+To enable tracing, set one of the following: `JAEGER_AGENT_HOST` and `JAEGER_AGENT_PORT`, or `JAEGER_ENDPOINT`.
+
+The Jaeger client uses remote sampling by default, if the management server is not available no traces are sent.
+To always send traces (no sampling), set the following environment variables:
+
+```
+JAEGER_SAMPLER_TYPE=const
+JAEGER_SAMPLER_PARAM=1
+```
+
+## Polling
+
+Tempo maintains knowledge of the state of the backend by polling it on regular intervals. There are currently only two components that need this knowledge and, consequently, only two that poll the backend: compactors and queriers.
+
+Refer to [Use polling to monitor Tempo's backend status]({{< relref "./polling" >}}).
+
+## Dashboards
+
+The [Tempo mixin](https://github.com/grafana/tempo/tree/main/operations/tempo-mixin) has four Grafana dashboards in the `yamls` folder that you can download and import into your Grafana UI.
+These dashboards work well when you run Tempo in a Kubernetes (k8s) environment and metrics scraped have the
+`cluster` and `namespace` labels.
+
+### Tempo Reads dashboard
+
+> This is available as `tempo-reads.json`.
+
+The Reads dashboard gives information on Requests, Errors, and Duration (RED) on the query path of Tempo.
+Each query touches the Gateway, Tempo-Query, Query-Frontend, Queriers, Ingesters, the backend, and Cache, if present.
+
+Use this dashboard to monitor the performance of each of the mentioned components and to decide the number of
+replicas in each deployment.
+
+### Tempo Writes dashboard
+
+> This is available as `tempo-writes.json`.
+
+The Writes dashboard gives information on RED on the write/ingest path of Tempo.
+A write query touches the Gateway, Distributors, Ingesters, and the backend.
+This dashboard also gives information
+on the number of operations performed by the Compactor to the backend.
+
+Use this dashboard to monitor the performance of each of the mentioned components and to decide the number of
+replicas in each deployment.
+
+### Tempo Resources dashboard
+
+> This is available as `tempo-resources.json`.
+
+The Resources dashboard provides information on `CPU`, `Container Memory`, and `Go Heap Inuse`.
+This dashboard is useful for resource provisioning for the different Tempo components.
+
+Use this dashboard to see if any components are running close to their assigned limits.
+
+### Tempo Operational dashboard
+
+> This is available as `tempo-operational.json`.
+
+The Tempo Operational dashboard deserves special mention because it is probably a stack of dashboard anti-patterns.
+It's big and complex, doesn't use `jsonnet`, and displays far too many metrics in one place.
+For just getting started, the RED dashboards are great places to learn how to monitor Tempo in an opaque way.
+
+This dashboard is included in the Tempo repository for two reasons:
+
+- The dashboard provides a stack of metrics for other operators to consider monitoring while running Tempo.
+- We want the dashboard in our internal infrastructure and we vendor the `tempo-mixin` to do this.
+
+## Rules and alerts
+
+The Rules and Alerts are available as [YAML files in the compiled mixin](https://github.com/grafana/tempo/tree/main/operations/tempo-mixin-compiled) on the repository.
+
+To set up alerting, download the provided JSON files and configure them for use on your Prometheus monitoring server.
+
+Check the [runbook](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md) to understand the
+various steps that can be taken to fix firing alerts.
diff --git a/docs/sources/tempo/operations/polling.md → ...urces/tempo/operations/monitor/polling.md b/docs/sources/tempo/operations/polling.md → ...urces/tempo/operations/monitor/polling.md
@@ -5,6 +5,7 @@ description: Monitor Tempo's backend using polling
 weight: 30
 aliases:
 - /docs/tempo/operations/polling
+- ../polling
 ---
 
 # Use polling to monitor Tempo's backend status
@@ -22,14 +23,16 @@ This is done once every `blocklist_poll` duration.
 All other compactors and all queriers then rely on downloading this file, unzipping it and using the contained list.
 Again, this is done once every `blocklist_poll` duration.
 
-Due to this behavior, a given compactor or querier will often have an out-of-date blocklist.
+Due to this behavior, a given compactor or querier often have an out-of-date blocklist.
 During normal operation, it will stale by at most twice the configured `blocklist_poll`.
 
->**Note**: For details about configuring polling, see [polling configuration]({{< relref "../configuration/polling" >}}).
+{{% admonition type="note" %}}
+For details about configuring polling, refer to [polling configuration]({{< relref "../../configuration/polling" >}}).
+{{% /admonition %}}
 
 ## Monitor polling with dashboards and alerts
 
-See our Jsonnet for example [alerts](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/alerts.libsonnet) and [runbook entries](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md)
+Refer to the Jsonnet for example [alerts](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/alerts.libsonnet) and [runbook entries](https://github.com/grafana/tempo/blob/main/operations/tempo-mixin/runbook.md)
 related to polling.
 
 If you are building your own dashboards or alerts, here are a few relevant metrics: