Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ✨ add otelcol internal metrics dashboard and update docs #14

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ Check out the directories in the repository root for following dashboards:
- [Hostmetrics Dashboard and related files](./hostmetrics/)
- [Kubernetes Infra Dashboard](./k8s-infra-metrics/)
- [Key Operation Dashboard](./key-operations/)
- [OtelCol Internal Metrics](./otelcol-metrics/)
80 changes: 80 additions & 0 deletions otelcol-metrics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# OtelCol Internal Metrics

The OtelCol Internal dashboards consists of charts that are useful for monitoring the health of the collector.

## Importing Dashboard

For a generic dashboard with `host_name` variable, you can import the
`otelcol-metrics.json` file in SigNoz UI.

If you have multiple environments and you are using `deployment.environment`
label, you can import the `otelcol-metrics-env.json` file in SigNoz UI.

## Metrics to look at for OtelCol

## Batch processor

- `batch_size_trigger_send` - Incremented when the batch trigger happens because of batch size.
- `timeout_trigger_send` - Incremented when the batch trigger happens because of timeout.
- `batch_send_size` - What is the batch size when it is pushed
- `batch_send_size_bytes` - What is the batch size in bytes when it is pushed

This helps us in understanding the typical batch size that is maintained in memory.

## Receiver

- `accepted_spans`
- `refused_spans`
- `accepted_metric_points`
- `refused_metric_points`
- `accepted_log_records`
- `refused_log_records`

Rate of change of `accepted_{spans,metric_points,log_records}` indicate
the data received as in its original form. This is ingestion metric we
are interested to look at the load. This should be used along with the
instance and receiver type to understand the trend.

Non-null rate of change of `refused_spans` probably indicates the loss
of data depending on the retry mechanism on client. Whenever someone
reports a loss of data you may want to confirm here first.

## Exporter

- `sent_spans`
- `failed_spans`
- `enqueue_failed_spans`
- `sent_metric_points`
- `failed_metric_points`
- `enqueue_failed_metric_points`
- `sent_log_records`
- `failed_log_records`
- `enqueue_failed_log_records`

`enqueue_failed_{spans,metric_points,log_records}` indicate the number of
span/metric points/log records failed to be added to the sending queue.
This may be caused by a queue full of unsettled elements, so you may need
to decrease your sending rate or horizontally scale collectors.

Non-zero rate of `failed_{spans,metric_points,log_records}` indicate
collector is not able to send data. This could mean DB is performing
poorly. This doesn’t necessarily mean there is a data loss because
the retry mechanism in place.

rate of `sent_{spans,metric_points,log_records}` show at what rate
the data is getting written to DB.

This needs to compared with the receiver rate and possibly relate
with queue settings.

## Queue length

`A/B`, where `A` is the `exporter_queue_size` and `B` is `exporter_queue_capacity`
indicate the what is the typical occupied behaviour of the queue.
This should indicate if the queue size enough for the ingestion.

Non zero rate of `enqueue_failed_{spans, metric_points, log_records}` indicate
that queue is full. It is either the ingestion is happening at high rate in
receiver but the exporter is not transmitting at the same rate.

The solution might be increase the collector instances if the queue size is already reasonable.
Loading