Skip to content

Latest commit

 

History

History
255 lines (224 loc) · 20 KB

metrics.md

File metadata and controls

255 lines (224 loc) · 20 KB

CRI-O Metrics

To enable the Prometheus metrics exporter for CRI-O, either start crio with --metrics-enable or add the corresponding option to a config overwrite, for example /etc/crio/crio.conf.d/01-metrics.conf:

[crio.metrics]
enable_metrics = true

The metrics endpoint serves per default on port 9090 via HTTP. This can be changed via the --metrics-port command line argument or via the configuration file:

metrics_port = 9090

If CRI-O runs with enabled metrics, then this can be verified by querying the endpoint manually via curl.

curl localhost:9090/metrics

It is also possible to serve the metrics via HTTPs, by providing an additional certificate and key:

[crio.metrics]
enable_metrics = true
metrics_cert = "/path/to/cert.pem"
metrics_key = "/path/to/key.pem"

Available Metrics

Beside the default golang based metrics, CRI-O provides the following additional metrics:

Metric Key Possible Labels or Buckets Type Purpose
crio_operations_total every CRI-O RPC* operation Counter Cumulative number of CRI-O operations by operation type.
crio_operations_latency_seconds_total every CRI-O RPC* operation,

network_setup_pod (CNI pod network setup time),

network_setup_overall (Overall network setup time)
Summary Latency in seconds of CRI-O operations. Split-up by operation type.
crio_operations_latency_seconds every CRI-O RPC* operation Gauge Latency in seconds of individual CRI calls for CRI-O operations. Broken down by operation type.
crio_operations_errors_total every CRI-O RPC* operation Counter Cumulative number of CRI-O operation errors by operation type.
crio_image_pulls_bytes_total mediatype, size
sizes are in bucket of bytes for layer sizes of 1 KiB, 1 MiB, 10 MiB, 50 MiB, 100 MiB, 200 MiB, 300 MiB, 400 MiB, 500 MiB, 1 GiB, 10 GiB
Counter Bytes transferred by CRI-O image pulls.
crio_image_pulls_skipped_bytes_total size
sizes are in bucket of bytes for layer sizes of 1 KiB, 1 MiB, 10 MiB, 50 MiB, 100 MiB, 200 MiB, 300 MiB, 400 MiB, 500 MiB, 1 GiB, 10 GiB
Counter Bytes skipped by CRI-O image pulls by name. The ratio of skipped bytes to total bytes can be used to determine cache reuse ratio.
crio_image_pulls_success_total Counter Successful image pulls.
crio_image_pulls_failure_total error Counter Failed image pulls by their error category.
crio_image_pulls_layer_size_{sum,count,bucket} buckets in byte for layer sizes of 1 KiB, 1 MiB, 10 MiB, 50 MiB, 100 MiB, 200 MiB, 300 MiB, 400 MiB, 500 MiB, 1 GiB, 10 GiB Histogram Bytes transferred by CRI-O image pulls per layer.
crio_image_layer_reuse_total Counter Reused (not pulled) local image layer count by name.
crio_containers_dropped_events_total Counter The total number of container events dropped.
crio_containers_oom_total Counter Total number of containers killed because they ran out of memory (OOM).
crio_containers_oom_count_total name Counter Containers killed because they ran out of memory (OOM) by their name.
The label name can have high cardinality sometimes but it is in the interest of users giving them the ease to identify which container(s) are going into OOM state. Also, ideally very few containers should OOM keeping the label cardinality of name reasonably low.
crio_containers_seccomp_notifier_count_total name, syscall Counter Forbidden syscall count resulting in killed containers by name.
crio_processes_defunct Gauge Total number of defunct processes in the node
crio_operations every CRI-O RPC* Counter (DEPRECATED: in favour of crio_operations_total) Cumulative number of CRI-O operations by operation type.
crio_operations_latency_microseconds_total every CRI-O RPC*,

network_setup_pod (CNI pod network setup time),

network_setup_overall (Overall network setup time)
Summary (DEPRECATED: in favour of crio_operations_latency_seconds_total) Latency in microseconds of CRI-O operations. Split-up by operation type.
crio_operations_latency_microseconds every CRI-O RPC* Gauge (DEPRECATED: in favour of crio_operations_latency_seconds) Latency in microseconds of individual CRI calls for CRI-O operations. Broken down by operation type.
crio_operations_errors every CRI-O RPC* Counter (DEPRECATED: in favour of crio_operations_errors_total) Cumulative number of CRI-O operation errors by operation type.
crio_image_pulls_by_digest name, digest, mediatype, size Counter (DEPRECATED: in favour of crio_image_pulls_bytes_total) Bytes transferred by CRI-O image pulls by digest.
crio_image_pulls_by_name name, size Counter (DEPRECATED: in favour of crio_image_pulls_bytes_total) Bytes transferred by CRI-O image pulls by name.
crio_image_pulls_by_name_skipped name Counter (DEPRECATED: in favour of crio_image_pulls_skipped_bytes_total) Bytes skipped by CRI-O image pulls by name.
crio_image_pulls_successes name Counter (DEPRECATED: in favour of crio_image_pulls_success_total) Successful image pulls by image name
crio_image_pulls_failures name, error Counter (DEPRECATED: in favour of crio_image_pulls_failure_total) Failed image pulls by image name and their error category.
crio_image_layer_reuse name Counter (DEPRECATED: in favour of crio_image_layer_reuse_total) Reused (not pulled) local image layer count by name.
crio_containers_oom name Counter (DEPRECATED: in favour of crio_containers_oom_count_total) Containers killed because they ran out of memory (OOM) by their name
  • Available CRI-O RPC's from the gRPC API: Attach, ContainerStats, ContainerStatus, CreateContainer, Exec, ExecSync, ImageFsInfo, ImageStatus, ListContainerStats, ListContainers, ListImages, ListPodSandbox, PodSandboxStatus, PortForward, PullImage, RemoveContainer, RemoveImage, RemovePodSandbox, ReopenContainerLog, RunPodSandbox, StartContainer, Status, StopContainer, StopPodSandbox, UpdateContainerResources, UpdateRuntimeConfig, Version

  • Available error categories for crio_image_pulls_failures:

    • UNKNOWN: The default label which gets applied if the error is not known
    • CONNECTION_REFUSED: The local network is down or the registry refused the connection.
    • CONNECTION_TIMEOUT: The connection timed out during the image download.
    • NOT_FOUND: The registry does not exist at the specified resource
    • BLOB_UNKNOWN: This error may be returned when a blob is unknown to the registry in a specified repository. This can be returned with a standard get or if a manifest references an unknown layer during upload.
    • BLOB_UPLOAD_INVALID: The blob upload encountered an error and can no longer proceed.
    • BLOB_UPLOAD_UNKNOWN: If a blob upload has been cancelled or was never started, this error code may be returned.
    • DENIED: The access controller denied access for the operation on a resource.
    • DIGEST_INVALID: When a blob is uploaded, the registry will check that the content matches the digest provided by the client. The error may include a detail structure with the key "digest", including the invalid digest string. This error may also be returned when a manifest includes an invalid layer digest.
    • MANIFEST_BLOB_UNKNOWN: This error may be returned when a manifest blob is unknown to the registry.
    • MANIFEST_INVALID: During upload, manifests undergo several checks ensuring validity. If those checks fail, this error may be returned, unless a more specific error is included. The detail will contain information the failed validation.
    • MANIFEST_UNKNOWN: This error is returned when the manifest, identified by name and tag is unknown to the repository.
    • MANIFEST_UNVERIFIED: During manifest upload, if the manifest fails signature verification, this error will be returned.
    • NAME_INVALID: Invalid repository name encountered either during manifest. validation or any API operation.
    • NAME_UNKNOWN: This is returned if the name used during an operation is unknown to the registry.
    • SIZE_INVALID: When a layer is uploaded, the provided size will be checked against the uploaded content. If they do not match, this error will be returned.
    • TAG_INVALID: During a manifest upload, if the tag in the manifest does not match the uri tag, this error will be returned.
    • TOOMANYREQUESTS: Returned when a client attempts to contact a service too many times.
    • UNAUTHORIZED: The access controller was unable to authenticate the client. Often this will be accompanied by a Www-Authenticate HTTP response header indicating how to authenticate.
    • UNAVAILABLE: Returned when a service is not available.
    • UNSUPPORTED: The operation was unsupported due to a missing implementation or invalid set of parameters.

Exporting Metrics via Prometheus

The CRI-O metrics exporter can be used to provide a cluster wide scraping endpoint for Prometheus. It is possible to either build the container image manually via make metrics-exporter or directly consume the available image on quay.io.

The deployment requires enabled RBAC within the target Kubernetes environment and creates a new ClusterRole to be able to list available nodes. Beside that a new Role will be created to be able to update a config-map within the cri-o-exporter namespace. Please be aware that the exporter only works if the pod has access to the node IP from its namespace. This should generally work but might be restricted due to network configuration or policies.

To deploy the metrics exporter within a new cri-o-metrics-exporter namespace, simply apply the cluster.yaml from the root directory of this repository:

kubectl create -f contrib/metrics-exporter/cluster.yaml

The CRIO_METRICS_PORT environment variable is set per default to "9090" and can be used to customize the metrics port for the nodes. If the deployment is up and running, it should log the registered nodes as well as that a new config-map has been created:

$ kubectl logs -f cri-o-metrics-exporter-65c9b7b867-7qmsb
level=info msg="Getting cluster configuration"
level=info msg="Creating Kubernetes client"
level=info msg="Retrieving nodes"
level=info msg="Registering handler /master (for 172.1.2.0)"
level=info msg="Registering handler /node-0 (for 172.1.3.0)"
level=info msg="Registering handler /node-1 (for 172.1.3.1)"
level=info msg="Registering handler /node-2 (for 172.1.3.2)"
level=info msg="Registering handler /node-3 (for 172.1.3.3)"
level=info msg="Registering handler /node-4 (for 172.1.3.4)"
level=info msg="Updated scrape configs in configMap cri-o-metrics-exporter"
level=info msg="Wrote scrape configs to configMap cri-o-metrics-exporter"
level=info msg="Serving HTTP on :8080"

The config-map now contains the scrape configuration, which can be used for Prometheus:

kubectl get cm cri-o-metrics-exporter -o yaml
apiVersion: v1
data:
  config: |
    scrape_configs:
    - job_name: "cri-o-exporter-master"
      scrape_interval: 1s
      metrics_path: /master
      static_configs:
        - targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
          labels:
            instance: "master"
    - job_name: "cri-o-exporter-node-0"
      scrape_interval: 1s
      metrics_path: /node-0
      static_configs:
        - targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
          labels:
            instance: "node-0"
    - job_name: "cri-o-exporter-node-1"
      scrape_interval: 1s
      metrics_path: /node-1
      static_configs:
        - targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
          labels:
            instance: "node-1"
    - job_name: "cri-o-exporter-node-2"
      scrape_interval: 1s
      metrics_path: /node-2
      static_configs:
        - targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
          labels:
            instance: "node-2"
    - job_name: "cri-o-exporter-node-3"
      scrape_interval: 1s
      metrics_path: /node-3
      static_configs:
        - targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
          labels:
            instance: "node-3"
    - job_name: "cri-o-exporter-node-4"
      scrape_interval: 1s
      metrics_path: /node-4
      static_configs:
        - targets: ["cri-o-metrics-exporter.cri-o-metrics-exporter"]
          labels:
            instance: "node-4"
kind: ConfigMap
metadata:
  creationTimestamp: "2020-05-12T08:29:06Z"
  name: cri-o-metrics-exporter
  namespace: cri-o-metrics-exporter
  resourceVersion: "2862950"
  selfLink: /api/v1/namespaces/cri-o-metrics-exporter/configmaps/cri-o-metrics-exporter
  uid: 1409804a-78a2-4961-8205-c5f383626b4b

If the scrape configuration has been added to the Prometheus server, then the provided Grafana dashboard within this repository can be setup, too:

grafana-setup