Prometheus Performance Tuning - What is the high base problem and how to solve it?

This article was last updated on: February 7, 2024 pm

background

Recently, I found that the performance of Prometheus for experiments has a bottleneck, and the following alarms often appear:

PrometheusMissingRuleEvaluations
PrometheusRuleFailures

After a slow investigation, it was found that it was caused by the high cardinality of some of Prometheus’ series. This article is a comprehensive summary of the Prometheus high cardinality problem.

What is cardinality?

CardinalBasic definitionRefers to the number of elements in a given set.

atPrometheusand observability in the world,Label cardinalityIt is very important because it affects the performance and resource usage of your monitoring system.

The following figure clearly reflects the importance of cardinality:

基数激增: Prometheus 中的基数的基本图示。

Simply put. Cardinality refers to the count of the population numeric value of a label. In the example above, labelsstatus_codeThe cardinality of is 5, (i.e.:1xx 2xx 3xx 4xx 5xx),environmentThe cardinality of is 2 (ieprod dev), and indicatorsserver_responsesThe overall cardinality is 10.

How much is a high cardinality?

Generally speaking:

Lower cardinality 1:5 label-value ratio,
Standard base 1:80 label-value ratio
High cardinality 1:10000 label-value ratio.

Or the example above, if status_code is detailed code, such as200 404…, then its base may be as high as hundreds, environmentA little more base, indicatorsserver_responsesThe overall base will expand rapidly.

Typical case of high cardinality

This is not enough image, to give 2 more particularly typical examples:

There is an indicator called: http_request_duration_seconds_bucket
1. It has instance label, corresponding to 100 instances;
2. Yes le label, corresponding to different buckets, there are 10 buckets, such as (0.002 0.004 0.008 … =+inf)
3. It still has url This label, corresponding to the incomprehensible URL:
  1. Even if the scale is small, the URL may have 400 URLs
  2. Another particularly scary pitfall here is that for large-scale systems, this URL may be nearinfinitude!!!
4. It still has http_method This label corresponds to 5 HTTP methods
5. In this case, the label of the indicator
  1. On a small scale, there will also be: 100*10*400*5=2 000 000 2 million series 💀💀💀
  2. If the scale is almost infinite, then this cardinality cannot be calculated 💥💥💥 at all
There is another situation, will user_id Even session_id 经纬度 This kind of cardinality is very large, maybe even infinite parameters set to label, then it is a disaster 💥💥💥 for Prometheus

The negative effects of high cardinality

When Prometheus has a high cardinality, various problems arise:

The monitoring system is unstable or even crashes
- The dashboard loads slowly or even fails
- Monitoring queries are slow or even failing
Compute storage resources are expensive
Surveillance is rife with a lot of noise distractions
- The SRE team had to deal with the massive amount of alarm data, which delayed the analysis and positioning of root cause

📝Notes:

Cardinality corresponds to the number of metrics series. So in this blog post, the number of series will be mentioned alternately with the cardinality.

How to analyze a high cardinality problem?

There are several ways to analyze high cardinality problems:

Use Prometheus UI analytics
Analysis using Prometheus PromQL
Use the Prometheus API for analysis
Use Grafana Mimirtool to analyze unused metrics

Use Prometheus UI analytics

From Prometheus v2.14.0 Later, there is directly on the UI Head Cardinality Stats This menu. This greatly facilitates the analysis of high cardinality problems! 👍️👍️👍️

Located at: Prometheus UI -> Status -> TSDB Status -> Head Cardinality Stats, screenshot below:

📝Notes:

The following screenshot illustrates the system scale: This is the environment I used to experiment, with only 4 1c2g nodes

Prometheus UI - Head Cardinality Stats

Prometheus UI - Head Cardinality Stats - 2

From the above figure, you can intuitively see:

The label with the most values is url
The indicators with the most series are:
1. apiserver_request_duration_seconds_bucket 45524
2. rest_client_rate_limiter_duration_seconds_bucket 36971
3. rest_client_request_duration_seconds_bucket 10032
Labels with the most memory usage: url
According to the Label key-value pair match, the key-value pairs with the most series are: (This project was not very useful to me before)
1. endpoint=metrics 105406
2. service=pushprox-k3s-server-client 101548
3. job=k3s-server 101543
4. namespace=cattle-monitoring-system 101120
5. metrics_path=/metrics 91761

Analysis using Prometheus PromQL

If the Prometheus version is lower v2.14.0, then you need to pass by:

Prometheus PromQL
Prometheus API

to analyze.

Here are some useful PromQLs:

1	`topk(10, count by (__name__)({__name__=~".+"}))`

The corresponding query result is the above Top 10 with the most series indicators

Knowing the Top10, you can further query the details, because the cardinality is huge, if the query range may always fail, so it is recommended instant Way to query details.

If you want to query the dimensions of the label, you can do PromQL as follows:

1	`count(count by (label_name) (metric_name))`

As:

1	`count(count by (url) (apiserver_request_duration_seconds_bucket))`

There are also some other PromQLs, listed below:

sum(scrape_series_added) by (job) Analyze series growth through job labels
sum(scrape_samples_scraped) by (job) Analyze the total number of series through the job label
prometheus_tsdb_symbol_table_size_bytes

Use the Prometheus API for analysis

Because of the high cardinality problem, queries through Prometheus PromQL may often time out or fail. Then it can be analyzed through the Prometheus API:

Analyze the number of series for each metric

# 找到 Prometheus 的 SVC ClusterIP
kubectl get svc -n cattle-monitoring-system
export url=http://10.43.85.24:9090
export now=$(date +%s)
curl -s $url/api/v1/label/__name__/values \
| jq -r ".data[]" \
| while read metric; do
    count=$(curl -s \
        --data-urlencode 'query=count({__name__="'$metric'"})' \
        --data-urlencode "time=$now" \
        $url/api/v1/query \
    | jq -r ".data.result[0].value[1]")
    echo "$count $metric"
done

The results of my own experimental cluster analysis top are as follows: (null may be no current data, but the amount of historical data may be large)

Number of active series	Metric name
null	apiserver_admission_webhook_rejection_count
null	apiserver_registered_watchers
null	apiserver_request_aborts_total
null	apiserver_request_duration_seconds_bucket
null	cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
null	cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
null	kube_pod_container_status_waiting_reason
null	prometheus_target_scrape_pool_target_limit
null	rest_client_rate_limiter_duration_seconds_bucket
5786	rest_client_request_duration_seconds_bucket
3660	etcd_request_duration_seconds_bucket
2938	rest_client_rate_limiter_duration_seconds_count
2938	rest_client_rate_limiter_duration_seconds_sum
2840	apiserver_response_sizes_bucket
1809	apiserver_watch_events_sizes_bucket

Gets the activity series for the specified metric

Here it is rest_client_request_duration_seconds_bucket Example:

export metric=rest_client_request_duration_seconds_bucket
curl -s \
    --data-urlencode "query=$metric" \
    --data-urlencode "time=$now" \
    $url/api/v1/query \
| jq -c ".data.result[].metric"

The result is as follows: (The main reason is that the URL has too much value)

获取指定指标的活动 series - url 值太多

Get a list of all metrics

1	`curl -s $url/api/v1/label/__name__/values \| jq -r ".data[]" \| sort`

Gets a list of tags and their cardinality

curl -s $url/api/v1/labels \
| jq -r ".data[]" \
| while read label; do
    count=$(curl -s $url/api/v1/label/$label/values \
    | jq -r ".data|length")
    echo "$count $label"
  done \
| sort -n

The result is as follows: (Again, because of label.) url Too much value! )

Cardinality	Label
2199	url
1706	`__name__`
854	name
729	id
729	path
657	filename
652	container_id
420	resource
407	le
351	secret
302	type
182	kind

Use Grafana Mimirtool to analyze unused metrics

📚️Reference:

Grafana Mimirtool | Grafana Mimir documentation

Grafana Mimir’s introduction can be found here: Intro to Grafana Mimir: The open source time series database that scales to 1 billion metrics & beyond | Grafana Labs

Mimir has a utility called Mimir mimirtoolIt is possible to analyze which indicators are not used by comparing the indicators of Prometheus with those used by AlertManager and Grafana. It can be analyzed by entering as follows:

Grafana Dashboards for Grafana instances
Recording rules and alerting rules for Prometheus instances
Grafana Dashboard json file
Prometheus recording and alerting rules YAML files

I won’t go into detail here, but the full introduction is here: Analyzing and reducing metrics usage with Grafana Mimirtool | Grafana Cloud documentation

Solve high cardinality problems

For high cardinality problems, there are several cases:

Some labels are unreasonable, worth a lot or even infinity;
Some indicators are unreasonable, there are many values;
Prometheus’ entire series is too large

For the third problem, the following 2 solutions can be solved:

For high availability Prometheus high cardinality problems

There is a high cardinality situation where Prometheus is deployed in HA mode and passes remote_write way to send data to VM, Mimir, or Thanos.

In this case, you can add it according to the guidance of the official documentation for VM, Mimir, or Thanos external_labels For these software to automatically handle high cardinality problems.

An example configuration is as follows:

increaseexternal_labels

cluster
__replicas__

Increase the acquisition interval

Add Prometheus global scrape_interval(Adjust this parameter globally, for some that really need a smaller acquisition interval, you can do it.) job detailed configuration)

Generally it may default to yes scrape_interval: 15s
It is recommended that you adjust its increase value to scrape_interval: 1m Even bigger.

Filter and persist kubernetes-mixin metrics

For projects such as kubernetes-mixin, Prometheus Operator, kube-prometheus, etc., something will be available out of the box:

scrape metrics
recording rules
alerting rules
Grafana Dashboards

In this case, according to Grafana Dashboards and alerting rules, the indicators used can be reserved by relabeling.

📚️Reference:

“Translate” Reduces Prometheus Metric Use with Relabel - Dongfeng Weiming Technology Blog (e-whisper.com)

Examples are as follows:

remoteWrite:
- url: "<Your Metrics instance remote_write endpoint>"
  basicAuth:
    username:
      name: your_grafanacloud_secret
      key: your_grafanacloud_secret_username_key
    password:
      name: your_grafanacloud_secret
      key: your_grafanacloud_secret_password_key
  writeRelabelConfigs:
  - sourceLabels:
    - "__name__"
    regex: "apiserver_request_total|kubelet_node_config_error|kubelet_runtime_operations_errors_total|kubeproxy_network_programming_duration_seconds_bucket|container_cpu_usage_seconds_total|kube_statefulset_status_replicas|kube_statefulset_status_replicas_ready|node_namespace_pod_container:container_memory_swap|kubelet_runtime_operations_total|kube_statefulset_metadata_generation|node_cpu_seconds_total|kube_pod_container_resource_limits_cpu_cores|node_namespace_pod_container:container_memory_cache|kubelet_pleg_relist_duration_seconds_bucket|scheduler_binding_duration_seconds_bucket|container_network_transmit_bytes_total|kube_pod_container_resource_requests_memory_bytes|namespace_workload_pod:kube_pod_owner:relabel|kube_statefulset_status_observed_generation|process_resident_memory_bytes|container_network_receive_packets_dropped_total|kubelet_running_containers|kubelet_pod_worker_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|workqueue_queue_duration_seconds_bucket|container_network_transmit_packets_total|rest_client_request_duration_seconds_bucket|node_namespace_pod_container:container_memory_rss|container_cpu_cfs_throttled_periods_total|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes_used|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|kube_node_status_allocatable_memory_bytes|container_memory_cache|go_goroutines|kubelet_runtime_operations_duration_seconds_bucket|kube_statefulset_replicas|kube_pod_owner|rest_client_requests_total|container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|storage_operation_errors_total|scheduler_e2e_scheduling_duration_seconds_bucket|container_network_transmit_packets_dropped_total|kube_pod_container_resource_limits_memory_bytes|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate|storage_operation_duration_seconds_count|node_netstat_TcpExt_TCPSynRetrans|node_netstat_Tcp_OutSegs|container_cpu_cfs_periods_total|kubelet_pod_start_duration_seconds_count|kubeproxy_network_programming_duration_seconds_count|container_network_receive_bytes_total|node_netstat_Tcp_RetransSegs|up|storage_operation_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_volume_stats_available_bytes|scheduler_scheduling_algorithm_duration_seconds_bucket|kube_statefulset_status_replicas_current|code_resource:apiserver_request_total:rate5m|kube_statefulset_status_replicas_updated|process_cpu_seconds_total|kube_pod_container_resource_requests_cpu_cores|kubelet_pod_worker_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|container_memory_usage_bytes|workqueue_adds_total|container_network_receive_packets_total|container_memory_working_set_bytes|kube_resourcequota|kubelet_running_pods|kubelet_volume_stats_inodes|kubeproxy_sync_proxy_rules_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_count|apiserver_request:availability30d|container_memory_rss|kubelet_pleg_relist_interval_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_count|workqueue_depth|:node_memory_MemAvailable_bytes:sum|volume_manager_total_volumes|kube_node_status_allocatable_cpu_cores"
    action: "keep"

🐾Warning:

The above configuration may vary according to different versions, please refer to and use as appropriate.
Or according to the above mimirtool Analyze and generate your own configuration.

Reduce the use of Prometheus metrics with Relabel

A simple example is as follows:

write_relabel_configs:
  - source_labels: [__name__]
    regex: "apiserver_request_duration_seconds_bucket"
    action: drop

Aggregate metrics through recording rules and use it with relabel drop

For example apiserver_request_duration_seconds_bucket, what I need is some high-latitude metrics - such as the availability of API Server, then these metrics can be recorded and stored through recording rules, example as follows:

groups:
  - interval: 3m
    name: kube-apiserver-availability.rules
    rules:
      - expr: >-
          avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) *
          24 * 30
        record: code_verb:apiserver_request_total:increase30d
      - expr: >-
          sum by (cluster, code, verb)
          (increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"2.."}[1h]))
        record: code_verb:apiserver_request_total:increase1h
      - expr: >-
          sum by (cluster, code, verb)
          (increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"5.."}[1h]))
        record: code_verb:apiserver_request_total:increase1h

After that, it can be in again remote_wirte and other stages to delete the original indicator:

write_relabel_configs:
  - source_labels: [__name__]
    regex: "apiserver_request_duration_seconds_bucket"
    action: drop