Summary of the pitfalls that Prometheus Alertmanager has trodden through production configuration

This article was last updated on: July 24, 2024 am

Brief introduction

Alertmanager Handles alerts sent by client applications, such as Prometheus server. It is responsible for deduplicating, grouping, and routing them to the correct receiver integration, such as email, WeChat, or DingTalk. It is also responsible for handling silencing, timed send/do not send (Mute), and inhibition of alarms.

As an open source alerting application designed for Prometheus, AlertManager already has a variety of rich, flexible, and customizable functions of alerting applications:

  • Deduplicating: For example, in a high-availability AlertManager deployment, the same alarm is sent to all high-availability nodes at the same time, and it will be deduplicated according to the hash
  • Grouping: For example, it can be based on AlertName, Instance, Job and other labels to group massive alarms. Typically, suddenly a lot of pods are issued AlertName: InstanceDown Alerts, then directly based on AlertName It is sent in groups so that the user receives only one xxx 个 Pods InstanceDown Alert email. Greatly reduce the amount of receiving by the alarm receiver.
  • Routing (routing): Send the alarm to follow up with certain filter conditions to the specified receiver. Such as: satisfied job=db of alarms are routed to the DBA; satisfy team=team-A Alerts are routed to team-A’s mail group…
  • Receiver: Specific notification channel + recipient. Such as: email notification channel + DBA mailbox; DingTalk notification channel + SRE contact;
  • Silent/shielding: For example, the application release period, block related alarms.
  • Timed send/do not send (Mute): If working hours (965, 5 days a week) are sent through the mail channel; Non-working hours (off-hours, weekends, holidays) normal channel mute, only sent to on-call personnel through the on-call channel
  • Inhibition: In common scenarios, after the high-level alarm is triggered (firing), the low-level alarm does not need to be sent. For example: disk space critical The level alarm has been triggered (space usage exceeds 90%), at this time warning Level alarms (space usage greater than 80%) are suppressed.

In addition to no multi-tenant function, no good UI interface, no alarm history statistics display, as an alarm application, AlertManager is already very powerful. 👍️👍️👍️

The AlertManager production configuration wades through the pit

Let’s not go to the point: the pit that AlertManager production configuration has traversed

📝Notes:

All of the following are based on the latest version of AlertManager at 20220723: v0.24

demand

  1. The displayed URL looks like this: https://e-whisper.com/alertmanager/#/alerts
    1. It’s a domain name
    2. There is one /alertmanager/ prefix
  2. When a request goes to the AlertManager component, the default situation remains unchanged, such as https://10.0.0.1:9093/#/alerts

Namely: the path sent by the reverse proxy is different from the one used by the user.

implement

Ingress level implementation

Here is directly implemented using Traefik, which has been written about before, see here:

AlertManager configuration

AlertManager here requires 2 static parameters to be configured, which are added in StatefulSets in AlertManager alertmanager --<flags> to achieve .

By default, Prometheus (and Alertmanager) assume that the external URL (-web.external-url) is a prefix path that will appear in all requests sent to it. However, this is not always the case,--web.route-prefix Flags allow you to control this more granularly.

With the following configuration, this will be stripped before the request is delivered to AlertManager/alertmanager/。 In this case, you need to specify what the URL the user uses in their browser https://e-whisper.com/alertmanager/ The prefix that Prometheus sees in its HTTP requests is not/alertmanager/, but just empty/

1
2
3
alertmanager \
--web.external-url https://e-whisper.com/alertmanager/ \
--web.route-prefix=/
A small pit

After the above configuration, it is perfect, but when I check the content of the email, I found a small pit:

  • The default mail template at the bottom Sent by Alertmanager url does not /, causing the URL to not jump normally after clicking.

点击 Sent by Alertmanager 无法跳转

Here is the auto-add via Ingress-Traefik / features, you can see another article:

The default automatic Resolved alarm pits

If you haven’t read the documentation in detail, just use the default configuration, and AlertManager’s alert source has other monitoring software besides Prometheus. You will find a situation: every 5 minutes, some alarms that are still being triggered are automatically resolved!

This is because there is one in the default AlertManager configuration resolve_timeout parameter, and its default configuration is: resolve_timeout: 5m.

📚️Reference:

ResolveTimeout is the default value used by alertmanager, if alerts do not include EndsAt, after this time, if alerts are not updated, AlertManager will declare it resolved.
This parameter has no effect on alerts from Prometheus, as they always include EndsAt.

Here it is Automatic Resolved The reason, when I first came across, I looked, this is easy to do, I want to disable this feature, although this parameter cannot be disabled. If I had to add a term to this love, I would like it to be 10,000 years, set directly 10000y Come on.

The result was wrong… 😂😂😂

This one duration On the documentation, it clearly says that it is:

📚️Reference:

<duration>: The duration of the regular expression match ((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)As. 1d, 1h30m, 5m, 10s

But settings 10000y satisfy ([0-9]+)y) but reported an error. 😂

Then I looked through various source codes, and finally found this <duration> is not possible to set to the three-digit y: setting 100y Error, Settings 99y Normal operation.

So, you want to disable AlertManager Automatic Resolved function, just set it up like this: resolve_timeout: 99y

By default, 4H unresolved alarms automatically resend the pit of alarm notifications

No matter who it is, I think I hope that the fewer alert emails I receive every day, the better, so that I can follow up and solve them one by one.
As a result, AlertManager poured well. "Sweet"The ground provides4H Unresolved alerts are automatically reissued function. 😂😂😂

This is because there is one in the default AlertManager configuration repeat_interval parameter, and its default configuration is: repeat_interval: 4h

Still like last time, I want to disable this function, although this parameter cannot be disabled (if it is set to 0, it will not be disabled, but will report an error: repeat_interval cannot be zero), If I had to add a term to this love, I would like it to be 10,000 years, set directly 10000y Come on.

The result is wrong again (strictly speaking, warning)… 😂😂😂

1
2
repeat_interval is greater than the data retention period. 
It can lead to notifications being repeated more often than expected.

📚️Reference:

repeat_interval is greater than the data retention period. This can cause notifications to repeat more often than expected.

Other words repeat_interval Setting it too large may cause the notification to repeat more frequently, if you want to set it as large as possible, and not greater than the data retention time.

So, you want to keep AlertManager as low as possible Unresolved alerts are automatically reissued The frequency, just set like this: repeat_interval: <尽可能大, 但不要大于数据的保留(data retention)时间

Set the data retention period for AlertManager

Continuing from the above, what is the default AlertManager data retention period? If you want to adjust it, how do you adjust it?

Searched for a document, but couldn’t find it 😂😂😂

Why I didn’t find it, here’s why:

📚️Reference:

The Alertmanager is configured via command-line flags and a configuration file.
Command-line flags configure immutable system parameters, while configuration files define suppression rules, notification routes, and notification recipients.

The documentation is not about Command line flag configuration of the content. Where can I find it? run alertmanager -h command, the result is as follows:

📝Notes:

Only the display part, there are many flags related to the cluster, but it is not shown.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ alertmanager -h
usage: alertmanager [<flags>]

Flags:
...
--config.file="alertmanager.yml"
Alertmanager configuration file name.
--storage.path="data/" Base path for data storage.
--data.retention=120h How long to keep data for.
--alerts.gc-interval=30m Interval between alert GC.
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
--web.external-url=WEB.EXTERNAL-URL
The URL under which Alertmanager is externally reachable (for example, if Alertmanager is served via a reverse proxy). Used for generating relative and absolute links back to Alertmanager itself.
If the URL has a path portion, it will be used to prefix all HTTP endpoints served by Alertmanager. If omitted, relevant URL components will be derived automatically.
--web.route-prefix=WEB.ROUTE-PREFIX
Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url.
--web.listen-address=":9093"
Address to listen on for the web interface and API.
--web.get-concurrency=0 Maximum number of GET requests processed concurrently. If negative or zero, the limit is GOMAXPROC or 8, whichever is larger.
--web.timeout=0 Timeout for HTTP requests. If negative or zero, no timeout is set.
...
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]

You can see: --data.retention=120h The default is 120h, 5 days.

Too little, instead --data.retention=90d, it turned out to be wrong again, this time in the format, which should be 😂😂😂 written as: --data.retention=--data.retention=2160h

Then correspondingly, the above parameters can be set to: repeat_interval: 90d (Yes, you read that right, it can be written here as 90d … 😅😅😅)

Complete production practice AlertManager configuration

Finally, to give you a complete production practice AlertManager configuration, for reference:

Immutable parameters (and command-line flags)

  • '--storage.path=/alertmanager' (Storage location, production of this directory needs to configure persistent storage)
  • '--config.file=/etc/alertmanager/alertmanager.yml' (Configuration file location, which can be saved and used in production via ConfigMap)
  • '--cluster.advertise-address=[$(POD_IP)]:9094' (Highly available cluster port 9094)
  • '--cluster.listen-address=0.0.0.0:9094' (Highly available cluster port 9094)
  • --cluster.peer=alertmanager-0.alertmanager-headless:9094 (A peer of a highly available cluster; Need to create a headless service)
  • --cluster.peer=alertmanager-1.alertmanager-headless:9094 (Another peer of the highly available cluster)
  • --cluster.peer=alertmanager-2.alertmanager-headless:9094 (Third peer of a highly available cluster)
  • '--data.retention=2160h'
  • '--log.level=info' (Log level, DevTest Environment can be set to.) debug)
  • '--web.external-url=https://e-whisper.com/alertmanager/'
  • '--web.route-prefix=/'

Variadic parameters (parameters in a configuration file)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
global:
resolve_timeout: 99y
receivers:
# jiralert 插件, 可以将告警发送到 jira
- name: jiralert
webhook_configs:
- send_resolved: true
url: http://jiralert:9097/alert
route:
group_by:
- alertname
group_interval: 5m
group_wait: 1m
receiver: dev-receiver
repeat_interval: 90d
templates:
- /etc/alertmanager/*.tmpl

Specific route, receiver, inhibit and other configurations are not reflected.

Above.

🎉🎉🎉


Summary of the pitfalls that Prometheus Alertmanager has trodden through production configuration
https://e-whisper.com/posts/31626/
Author
east4ming
Posted on
July 23, 2022
Licensed under