Prometheus basic high-availability architecture
This article was last updated on: July 24, 2024 am
“Reprint” Prometheus’ basic high-availability architecture
As a metrics system, Prometheus needs to ensure its stability and availability. Prometheus’ monolithic architecture has a single point of failure, and with the expansion of monitoring scale, a large number of data reads and writes will also slow down the response of the monolithic architecture and make it inconvenient to scale, so it is necessary to introduce a high-availability cluster of Prometheus. This topic describes three common HA architectures in prometheus: simple HA, simple HA+ remote storage, simple HA + remote storage + federation.
Simple HA architecture
Simple HA uses multiple prometheus with the same configuration to collect the same target, and uses a load balancer such as Nginx to route to any prometheus, even if one of the services is down, the prometheus monitoring service is still available. Prometheus alertManager is a built-in alarm module that connects multiple Prometheus alarms of the cluster to one alarm module. Simple HA is shown in the following figure.
This architecture is relatively simple HA, just deploy multiple identical instances to do the same thing, ensuring that the entire monitoring service will not have a single point of failure. However, even if there are multiple prometheus with the same configuration, there will be data inconsistencies. The acquisition cycle of each prometheus in simple HA is the same, but the start time of acquisition is not fixed, and the unstable network delay is added, so the data between each prometheus is not completely consistent. There are also problems with stability, in this architecture, if data loss is not recoverable, if Prometheus instances are frequently migrated, or dynamically scaled, this architecture is not so flexible. In addition, because there is no additional remote storage, this architecture is not enough to support Prometheus to store large or long periods of data. This solution is suitable for monitoring scenarios that are small in scale and do not require long-term or large amounts of data. This architecture has problems with consistency, storage disaster recovery, migration, dynamic scaling, and remote storage.
Simple HA + remote storage
This architecture only adds a remote storage on the basis of simple HA, and writes data to remote repositories by using prometheus’ built-in remote read and write interfaces (prometheus remote read and remote write) to connect to third-party storage such as InfluxDB and OpenTSDB. Prometheus stores and queries data and performs remote data processing after merging local data, which solves the problems of data consistency, persistence, migration, and scalability in simple HA architectures. The following figure shows Simple HA+ remote storage.
However, the memory required after turning on remote storage can spike by 3-4 times, and prometheus’ Commiter believes that 25%-35% memory footprint is relatively normal, and some people recommend reducing the maximum number of shards to 50 to reduce the memory footprint. This is because remote storage often needs to write the data in the WAL first, generally the WAL will save about 2 hours of data, so many processes will be started, so it needs to be limited.
This architecture can meet the monitoring needs of some small and medium-sized enterprises, short-term (such as 15 days) monitoring data can be retrieved locally, long-term data can be obtained from remote storage, and because of remote storage, Prometheus migration or downtime restart can quickly recover data.
However, there are still some problems with this architecture, when the monitoring scale is large, Prometheus’ limited server nodes have obvious bottlenecks in acquisition capabilities, and massive data also has huge challenges for remote storage. Prometheus remote storage consumes more memory and CPU than local storage, and you can reduce the label and acquisition interval to reduce the resource consumption or increase the resource limit.
Simple HA + remote storage + federated cluster
This architecture is also an extension of the previous solution, and the main solution is the acquisition bottleneck of a single prometheus. The federated cluster can divide the monitoring collection task into different prometheus instances in the form of divide and conquer to achieve functional partitioning, which is conducive to horizontal scaling. The following figure shows a simple HA + remote storage + federated cluster.
This federated cluster architecture greatly improves the collection and storage capabilities of a single prometheus as shown in the figure, the bottom level of prometheus can be collected in different areas and computer rooms, and the upper level of prometheus as a federated node, responsible for regularly obtaining data from the lower prometheus node and summarizing. Multiple federation nodes greatly guarantee availability. It should be noted that some sensitive alarms should not be triggered by the Global node (the upper layer node, which is only responsible for collecting and storing aggregated data), after all, the stability of the transmission link from the shard node to the Global node will affect the efficiency of data arrival, which will lead to a decrease in the effectiveness of the alarm. For example, alarms such as service UP DOWN status and API request exceptions are placed on the shard node.
This scheme mainly solves the problem of single prometheus collection bottleneck, reduces the collection pressure of a single prometheus, and aggregates the main data through the federal node to reduce the storage pressure of local storage, which also has a good advantage in avoiding a single point of failure.
However, this architecture also has certain problems, the main problems are:
-
Each cluster department deploys a separate set of Prometheus, which lacks a unified global view when viewing the data of each cluster through visual tools such as Grafana.
-
The configuration is more complicated, and it is necessary to split the task of the lower prometheus, and assign different collection points to each prometheus in the lower layer.
-
You need to compare the complete data backup scheme and the historical data storage scheme to ensure the high availability of the monitoring store.
-
Lack of ability to downgrade and sample historical data.
-
In the face of massive floods of monitoring data, a series of optimizations must also be carried out.
-
Data consistency and accuracy may be reduced. The subordinate node will crawl the target according to the set interval, while the superior node will capture the data of the subordinate node, which will cause a delay in the data reaching the primary node, resulting in data skew and alarm delay.
-
During use, as a rule of thumb, the federation has an additional memory overhead of about 5% on collection points, and the actual use of resources needs to be evaluated.
Using prometheus federated clustering, Prometheus can monitor Prometheus, but the following two principles need to be followed:
-
With network mode, each prometheus in the same data center can monitor other prometheus.
-
In a higher-down model, the higher-level Prometheus monitors the Prometheus at the data center level.
In addition, in order to avoid the single point of failure of the next level of prometheus, multiple prometheus nodes can be deployed, but the efficiency will be much worse, and each monitoring object will be repeatedly collected, and the data will be saved repeatedly.
In the active/standby architecture of a federated cluster, keepalived is usually used to implement primary/standby switchover. The master node obtains the prometheus data of each collection layer, and the slave node does not query the data; If the master node is down, keepalived will automatically switch the VIP to the slave node, remove the target-related settings collected in the master node, and start the slave node to collect target-related settings.
The simple HA + remote storage + federated cluster solution is suitable for medium and large enterprises, especially single data centers, data collection tasks, or multi-data center scenarios.In contrast, it may be better to use Thanos。
📓 Reference documentation:
Introduction to the working principle and components of Thanos
Monitor cluster optimization
Optimization from the deployment method is to select the most suitable architecture solution according to the different application scenarios, and optimization from the inside can ensure that our system can also be the best inside.
-
Remove high-latitude data as early as possible. In the indicator monitoring system represented by prometheus, there is a very important concept - cardinality cardinality, which represents the possible values of the label. Normally, the cardinality base value for a single instance should be around 10. High-latitude data is also an indicator with many labels or label values, each new label value is equivalent to creating a time series when storing, if the label value is too much, that is, high-latitude data, on the one hand, will occupy a lot of storage space, on the other hand, it will consume excessive resources when aggregating. So things like email address, user address, IP address, etc. are not suitable as labels. Use some warning rules to help you find bad high-latitude metrics and then discard those that are too high in your Scrape configuration:
1
2# 统计每个指标的时间序列数,超出 10000 的报警
count by (__name__)({__name__=~".+"}) > 10000 -
For the monitored system, be sure to pay attention to the indicators within the 20 that you need the most, it is best not to collect all indicators, you can sort out the indicators of the system you collect and sort out the core indicators that are most helpful to you. Reducing the data of Prometheus collection metrics is the best optimization for Prometheus cluster deployment.
-
Use proper and correct PromQL. For different indicator types, the order in which the relevant functions are used should be used correctly.
Original author
👉️URL: https://mp.weixin.qq.com/s/hjx3GOtDfwVmAOz4qTeSFw
Author/Zhao Yihan