Description of problem:
The alerting rule for AlertmanagerReceiversNotConfigured has a bug.
The alerting expression is `cluster:alertmanager_routing_enabled:max == 0`
`cluster:alertmanager_routing_enabled:max` is rule `clamp_max(sum(alertmanager_notifications_total), 1)`
`alertmanager_notifications_total` is simply the number of notifications an alertmanager instance has sent since it started. However, in a newly started alertmanager instance, this is 0. Thus, until alertmanager has sent the first alert, AlertmanagerReceiversNotConfigured will fire.
Hilariously enough, AlertmanagerReceiversNotConfigured firing and triggering an alert increases alertmanager_notifications_total and resolves the alert.
The end result is that every time a customer upgrades or rolls out a new MachineConfig to workers (i.e. anything that causes all the alertmanager instances to restart), they will get this alert.
Version-Release number of selected component (if applicable):
4.6.6
How reproducible:
Always
Steps to Reproduce:
1. Configure alertmanager receivers
2. oc scale statefulset alertmanager-main --replicas=0
3. CVO will override and scale alertmanager back up
4. After 10m, AlertmanagerReceiversNotConfigured will fire
Actual results:
AlertmanagerReceiversNotConfigured fires when receivers are configured
Expected results:
AlertmanagerReceiversNotConfigured should not fire when receivers are configured
Additional info:
tested with 4.7.0-0.nightly-2020-12-09-112139, followed the steps in comment 0, AlertmanagerReceiversNotConfigured alert was not fired again
- expr: clamp_max(sum(alertmanager_integrations),1)
record: cluster:alertmanager_routing_enabled:max
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2020:5633