Prometheus Performance Tuning - What is the high base problem and how to solve it?

This article was last updated on: July 24, 2024 am

background

Recently, I found that the performance of Prometheus for experiments has a bottleneck, and the following alarms often appear:

  • PrometheusMissingRuleEvaluations
  • PrometheusRuleFailures

After a slow investigation, it was found that it was caused by the high cardinality of some of Prometheus’ series. This article is a comprehensive summary of the Prometheus high cardinality problem.

What is cardinality?

CardinalBasic definitionRefers to the number of elements in a given set.

atPrometheusand observability in the world,Label cardinalityIt is very important because it affects the performance and resource usage of your monitoring system.

The following figure clearly reflects the importance of cardinality:

基数激增: Prometheus 中的基数的基本图示。

Simply put. Cardinality refers to the count of the population numeric value of a label. In the example above, labelsstatus_codeThe cardinality of is 5, (i.e.:1xx 2xx 3xx 4xx 5xx),environmentThe cardinality of is 2 (ieprod dev), and indicatorsserver_responsesThe overall cardinality is 10.

How much is a high cardinality?

Generally speaking:

  • Lower cardinality 1:5 label-value ratio,
  • Standard base 1:80 label-value ratio
  • High cardinality 1:10000 label-value ratio.

Or the example above, if status_code is detailed code, such as200 404…, then its base may be as high as hundreds, environmentA little more base, indicatorsserver_responsesThe overall base will expand rapidly.

Typical case of high cardinality

This is not enough image, to give 2 more particularly typical examples:

  1. There is an indicator called: http_request_duration_seconds_bucket
    1. It has instance label, corresponding to 100 instances;
    2. Yes le label, corresponding to different buckets, there are 10 buckets, such as (0.002 0.004 0.008=+inf)
    3. It still has url This label, corresponding to the incomprehensible URL:
      1. Even if the scale is small, the URL may have 400 URLs
      2. Another particularly scary pitfall here is that for large-scale systems, this URL may be nearinfinitude!!!
    4. It still has http_method This label corresponds to 5 HTTP methods
    5. In this case, the label of the indicator
      1. On a small scale, there will also be: 100*10*400*5=2 000 000 2 million series 💀💀💀
      2. If the scale is almost infinite, then this cardinality cannot be calculated 💥💥💥 at all
  2. There is another situation, will user_id Even session_id 经纬度This kind of cardinality is very large, maybe even infinite parameters set to label, then it is a disaster 💥💥💥 for Prometheus

The negative effects of high cardinality

When Prometheus has a high cardinality, various problems arise:

  • The monitoring system is unstable or even crashes
    • The dashboard loads slowly or even fails
    • Monitoring queries are slow or even failing
  • Compute storage resources are expensive
  • Surveillance is rife with a lot of noise distractions
    • The SRE team had to deal with the massive amount of alarm data, which delayed the analysis and positioning of root cause

📝Notes:

Cardinality corresponds to the number of metrics series. So in this blog post, the number of series will be mentioned alternately with the cardinality.

How to analyze a high cardinality problem?

There are several ways to analyze high cardinality problems:

  1. Use Prometheus UI analytics
  2. Analysis using Prometheus PromQL
  3. Use the Prometheus API for analysis
  4. Use Grafana Mimirtool to analyze unused metrics

Use Prometheus UI analytics

From Prometheus v2.14.0 Later, there is directly on the UI Head Cardinality Stats This menu. This greatly facilitates the analysis of high cardinality problems! 👍️👍️👍️

Located at: Prometheus UI -> Status -> TSDB Status -> Head Cardinality Stats, screenshot below:

📝Notes:

The following screenshot illustrates the system scale: This is the environment I used to experiment, with only 4 1c2g nodes

Prometheus UI - Head Cardinality Stats

Prometheus UI - Head Cardinality Stats - 2

From the above figure, you can intuitively see:

  1. The label with the most values is url
  2. The indicators with the most series are:
    1. apiserver_request_duration_seconds_bucket 45524
    2. rest_client_rate_limiter_duration_seconds_bucket 36971
    3. rest_client_request_duration_seconds_bucket 10032
  3. Labels with the most memory usage: url
  4. According to the Label key-value pair match, the key-value pairs with the most series are: (This project was not very useful to me before)
    1. endpoint=metrics 105406
    2. service=pushprox-k3s-server-client 101548
    3. job=k3s-server 101543
    4. namespace=cattle-monitoring-system 101120
    5. metrics_path=/metrics 91761

Analysis using Prometheus PromQL

If the Prometheus version is lower v2.14.0, then you need to pass by:

  • Prometheus PromQL
  • Prometheus API

to analyze.

Here are some useful PromQLs:

1
topk(10, count by (__name__)({__name__=~".+"}))

The corresponding query result is the above Top 10 with the most series indicators

Knowing the Top10, you can further query the details, because the cardinality is huge, if the query range may always fail, so it is recommended instant Way to query details.

If you want to query the dimensions of the label, you can do PromQL as follows:

1
count(count by (label_name) (metric_name))

As:

1
count(count by (url) (apiserver_request_duration_seconds_bucket))

There are also some other PromQLs, listed below:

  • sum(scrape_series_added) by (job) Analyze series growth through job labels
  • sum(scrape_samples_scraped) by (job) Analyze the total number of series through the job label
  • prometheus_tsdb_symbol_table_size_bytes

Use the Prometheus API for analysis

Because of the high cardinality problem, queries through Prometheus PromQL may often time out or fail. Then it can be analyzed through the Prometheus API:

Analyze the number of series for each metric

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 找到 Prometheus 的 SVC ClusterIP
kubectl get svc -n cattle-monitoring-system
export url=http://10.43.85.24:9090
export now=$(date +%s)
curl -s $url/api/v1/label/__name__/values \
| jq -r ".data[]" \
| while read metric; do
count=$(curl -s \
--data-urlencode 'query=count({__name__="'$metric'"})' \
--data-urlencode "time=$now" \
$url/api/v1/query \
| jq -r ".data.result[0].value[1]")
echo "$count $metric"
done

The results of my own experimental cluster analysis top are as follows: (null may be no current data, but the amount of historical data may be large)

Number of active series Metric name
null apiserver_admission_webhook_rejection_count
null apiserver_registered_watchers
null apiserver_request_aborts_total
null apiserver_request_duration_seconds_bucket
null cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
null cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
null kube_pod_container_status_waiting_reason
null prometheus_target_scrape_pool_target_limit
null rest_client_rate_limiter_duration_seconds_bucket
5786 rest_client_request_duration_seconds_bucket
3660 etcd_request_duration_seconds_bucket
2938 rest_client_rate_limiter_duration_seconds_count
2938 rest_client_rate_limiter_duration_seconds_sum
2840 apiserver_response_sizes_bucket
1809 apiserver_watch_events_sizes_bucket

Gets the activity series for the specified metric

Here it is rest_client_request_duration_seconds_bucket Example:

1
2
3
4
5
6
export metric=rest_client_request_duration_seconds_bucket
curl -s \
--data-urlencode "query=$metric" \
--data-urlencode "time=$now" \
$url/api/v1/query \
| jq -c ".data.result[].metric"

The result is as follows: (The main reason is that the URL has too much value)

获取指定指标的活动 series - url值太多

Get a list of all metrics

1
curl -s $url/api/v1/label/__name__/values | jq -r ".data[]" | sort

Gets a list of tags and their cardinality

1
2
3
4
5
6
7
8
curl -s $url/api/v1/labels \
| jq -r ".data[]" \
| while read label; do
count=$(curl -s $url/api/v1/label/$label/values \
| jq -r ".data|length")
echo "$count $label"
done \
| sort -n

The result is as follows: (Again, because of label.) url Too much value! )

Cardinality Label
2199 url
1706 __name__
854 name
729 id
729 path
657 filename
652 container_id
420 resource
407 le
351 secret
302 type
182 kind

Use Grafana Mimirtool to analyze unused metrics

📚️Reference:

Grafana Mimirtool | Grafana Mimir documentation

Grafana Mimir’s introduction can be found here: Intro to Grafana Mimir: The open source time series database that scales to 1 billion metrics & beyond | Grafana Labs

Mimir has a utility called Mimir mimirtoolIt is possible to analyze which indicators are not used by comparing the indicators of Prometheus with those used by AlertManager and Grafana. It can be analyzed by entering as follows:

  • Grafana Dashboards for Grafana instances
  • Recording rules and alerting rules for Prometheus instances
  • Grafana Dashboard json file
  • Prometheus recording and alerting rules YAML files

I won’t go into detail here, but the full introduction is here: Analyzing and reducing metrics usage with Grafana Mimirtool | Grafana Cloud documentation

Solve high cardinality problems

For high cardinality problems, there are several cases:

  1. Some labels are unreasonable, worth a lot or even infinity;
  2. Some indicators are unreasonable, there are many values;
  3. Prometheus’ entire series is too large

For the third problem, the following 2 solutions can be solved:

For high availability Prometheus high cardinality problems

There is a high cardinality situation where Prometheus is deployed in HA mode and passes remote_write way to send data to VM, Mimir, or Thanos.

In this case, you can add it according to the guidance of the official documentation for VM, Mimir, or Thanos external_labels For these software to automatically handle high cardinality problems.

An example configuration is as follows:

increaseexternal_labels

  1. cluster
  2. __replicas__

Increase the acquisition interval

Add Prometheus global scrape_interval(Adjust this parameter globally, for some that really need a smaller acquisition interval, you can do it.) job detailed configuration)

Generally it may default to yes scrape_interval: 15s
It is recommended that you adjust its increase value to scrape_interval: 1m Even bigger.

Filter and persist kubernetes-mixin metrics

For projects such as kubernetes-mixin, Prometheus Operator, kube-prometheus, etc., something will be available out of the box:

  • scrape metrics
  • recording rules
  • alerting rules
  • Grafana Dashboards

In this case, according to Grafana Dashboards and alerting rules, the indicators used can be reserved by relabeling.

📚️Reference:

“Translate” Reduces Prometheus Metric Use with Relabel - Dongfeng Weiming Technology Blog (e-whisper.com)

Examples are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
remoteWrite:
- url: "<Your Metrics instance remote_write endpoint>"
basicAuth:
username:
name: your_grafanacloud_secret
key: your_grafanacloud_secret_username_key
password:
name: your_grafanacloud_secret
key: your_grafanacloud_secret_password_key
writeRelabelConfigs:
- sourceLabels:
- "__name__"
regex: "apiserver_request_total|kubelet_node_config_error|kubelet_runtime_operations_errors_total|kubeproxy_network_programming_duration_seconds_bucket|container_cpu_usage_seconds_total|kube_statefulset_status_replicas|kube_statefulset_status_replicas_ready|node_namespace_pod_container:container_memory_swap|kubelet_runtime_operations_total|kube_statefulset_metadata_generation|node_cpu_seconds_total|kube_pod_container_resource_limits_cpu_cores|node_namespace_pod_container:container_memory_cache|kubelet_pleg_relist_duration_seconds_bucket|scheduler_binding_duration_seconds_bucket|container_network_transmit_bytes_total|kube_pod_container_resource_requests_memory_bytes|namespace_workload_pod:kube_pod_owner:relabel|kube_statefulset_status_observed_generation|process_resident_memory_bytes|container_network_receive_packets_dropped_total|kubelet_running_containers|kubelet_pod_worker_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|workqueue_queue_duration_seconds_bucket|container_network_transmit_packets_total|rest_client_request_duration_seconds_bucket|node_namespace_pod_container:container_memory_rss|container_cpu_cfs_throttled_periods_total|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes_used|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|kube_node_status_allocatable_memory_bytes|container_memory_cache|go_goroutines|kubelet_runtime_operations_duration_seconds_bucket|kube_statefulset_replicas|kube_pod_owner|rest_client_requests_total|container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|storage_operation_errors_total|scheduler_e2e_scheduling_duration_seconds_bucket|container_network_transmit_packets_dropped_total|kube_pod_container_resource_limits_memory_bytes|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate|storage_operation_duration_seconds_count|node_netstat_TcpExt_TCPSynRetrans|node_netstat_Tcp_OutSegs|container_cpu_cfs_periods_total|kubelet_pod_start_duration_seconds_count|kubeproxy_network_programming_duration_seconds_count|container_network_receive_bytes_total|node_netstat_Tcp_RetransSegs|up|storage_operation_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_volume_stats_available_bytes|scheduler_scheduling_algorithm_duration_seconds_bucket|kube_statefulset_status_replicas_current|code_resource:apiserver_request_total:rate5m|kube_statefulset_status_replicas_updated|process_cpu_seconds_total|kube_pod_container_resource_requests_cpu_cores|kubelet_pod_worker_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|container_memory_usage_bytes|workqueue_adds_total|container_network_receive_packets_total|container_memory_working_set_bytes|kube_resourcequota|kubelet_running_pods|kubelet_volume_stats_inodes|kubeproxy_sync_proxy_rules_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_count|apiserver_request:availability30d|container_memory_rss|kubelet_pleg_relist_interval_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_count|workqueue_depth|:node_memory_MemAvailable_bytes:sum|volume_manager_total_volumes|kube_node_status_allocatable_cpu_cores"
action: "keep"

🐾Warning:

The above configuration may vary according to different versions, please refer to and use as appropriate.
Or according to the above mimirtool Analyze and generate your own configuration.

Reduce the use of Prometheus metrics with Relabel

A simple example is as follows:

1
2
3
4
write_relabel_configs:
- source_labels: [__name__]
regex: "apiserver_request_duration_seconds_bucket"
action: drop

Aggregate metrics through recording rules and use it with relabel drop

For example apiserver_request_duration_seconds_bucket, what I need is some high-latitude metrics - such as the availability of API Server, then these metrics can be recorded and stored through recording rules, example as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
groups:
- interval: 3m
name: kube-apiserver-availability.rules
rules:
- expr: >-
avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) *
24 * 30
record: code_verb:apiserver_request_total:increase30d
- expr: >-
sum by (cluster, code, verb)
(increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"2.."}[1h]))
record: code_verb:apiserver_request_total:increase1h
- expr: >-
sum by (cluster, code, verb)
(increase(apiserver_request_total{job="apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"5.."}[1h]))
record: code_verb:apiserver_request_total:increase1h

After that, it can be in again remote_wirte and other stages to delete the original indicator:

1
2
3
4
write_relabel_configs:
- source_labels: [__name__]
regex: "apiserver_request_duration_seconds_bucket"
action: drop

💪💪💪

📚️ Reference documentation


Prometheus Performance Tuning - What is the high base problem and how to solve it?
https://e-whisper.com/posts/16339/
Author
east4ming
Posted on
August 22, 2022
Licensed under