Bug 1734266 - Duplicate alerts in Alertmanager web console
Summary: Duplicate alerts in Alertmanager web console
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-30 06:07 UTC by Junqi Zhao
Modified: 2019-10-16 06:34 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:33:52 UTC
Target Upstream Version:


Attachments (Terms of Use)
Duplicate Watchdog alerts in Alertmanager web console (117.82 KB, image/png)
2019-07-30 06:07 UTC, Junqi Zhao
no flags Details
alertmanager pod logs (5.22 KB, text/plain)
2019-07-30 06:09 UTC, Junqi Zhao
no flags Details
another duplicate KubePodNotReady alert (89.78 KB, image/png)
2019-07-30 06:14 UTC, Junqi Zhao
no flags Details
AlertManager configuration (601 bytes, text/plain)
2019-08-07 08:16 UTC, Simon Pasquier
no flags Details
info for Comment 8 (106.39 KB, application/gzip)
2019-08-07 09:13 UTC, Junqi Zhao
no flags Details
response from /api/v2/alerts/groups (13.17 KB, text/plain)
2019-08-07 12:06 UTC, Junqi Zhao
no flags Details
duplicate alerts for "Not grouped" alerts (135.29 KB, image/png)
2019-09-03 09:12 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift prometheus-alertmanager pull 27 0 None closed Bug 1734266: Bump 0.19.0 2020-10-08 12:37:10 UTC
Github prometheus alertmanager issues 1875 0 None closed duplicate alert groups returned by api 2020-10-08 12:37:01 UTC
Github prometheus alertmanager pull 2012 0 None closed Grouping label's expand button with grouping id 2020-10-08 12:37:01 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:34:02 UTC

Description Junqi Zhao 2019-07-30 06:07:53 UTC
Created attachment 1594482 [details]
Duplicate Watchdog alerts in Alertmanager web console

Description of problem:
Wait a few hours and check in API, there is only one Watchdog alert, but there are two Watchdog alerts in Alertmanager web console, see the attached picture.


$ oc -n openshift-monitoring get pod -o wide | grep alertmanager
alertmanager-main-0                            3/3     Running   0          23h     10.131.0.10    zhsun3-gzhhs-worker-centralus3-rj4cp   <none>           <none>
alertmanager-main-1                            3/3     Running   0          23h     10.129.2.11    zhsun3-gzhhs-worker-centralus2-84kws   <none>           <none>
alertmanager-main-2                            3/3     Running   0          23h     10.128.2.10    zhsun3-gzhhs-worker-centralus1-njxsq   <none>           <none>

$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -k -H "Authorization: Bearer $token" https://10.131.0.10:9095/api/v1/alerts | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1709  100  1709    0     0   5837      0 --:--:-- --:--:-- --:--:--  5852
{
    "data": [
        {
            "annotations": {
                "message": "In the last minute, rsyslog 10.128.3.79:24231 queue length increased more than 32. Current value is 1622.",
                "summary": "Rsyslog is overwhelmed"
            },
            "endsAt": "2019-07-30T05:52:59.265000656Z",
            "fingerprint": "446056d9ac32f75f",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=delta%28rsyslog_queue_size%5B1m%5D%29+%3E+32&g0.tab=1",
            "labels": {
                "alertname": "RsyslogQueueLengthBurst",
                "endpoint": "metrics",
                "instance": "10.128.3.79:24231",
                "job": "rsyslog",
                "namespace": "openshift-logging",
                "pod": "rsyslog-299q4",
                "prometheus": "openshift-monitoring/k8s",
                "queue": "main Q",
                "service": "rsyslog",
                "severity": "warning"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T05:49:29.265000656Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "unprocessed"
            }
        },
        {
            "annotations": {
                "message": "Pod default/recycler-for-pv-um5pl has been in a non-ready state for longer than an hour."
            },
            "endsAt": "2019-07-30T05:53:10.59085788Z",
            "fingerprint": "8e9b7daa0916e6db",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=sum+by%28namespace%2C+pod%29+%28kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22%28openshift-.%2A%7Ckube-.%2A%7Cdefault%7Clogging%29%22%2Cphase%3D~%22Failed%7CPending%7CUnknown%22%7D%29+%3E+0&g0.tab=1",
            "labels": {
                "alertname": "KubePodNotReady",
                "namespace": "default",
                "pod": "recycler-for-pv-um5pl",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "critical"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T02:49:40.59085788Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        },
        {
            "annotations": {
                "message": "14.285714285714285% of the rsyslog targets are down."
            },
            "endsAt": "2019-07-30T05:53:30.163677339Z",
            "fingerprint": "9993e10881e80f98",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=100+%2A+%28count+by%28job%29+%28up+%3D%3D+0%29+%2F+count+by%28job%29+%28up%29%29+%3E+10&g0.tab=1",
            "labels": {
                "alertname": "TargetDown",
                "job": "rsyslog",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "warning"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T05:47:30.163677339Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        },
        {
            "annotations": {
                "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"
            },
            "endsAt": "2019-07-30T05:53:30.163677339Z",
            "fingerprint": "e25963d69425c836",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=vector%281%29&g0.tab=1",
            "labels": {
                "alertname": "Watchdog",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "none"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-29T06:04:00.163677339Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        }
    ],
    "status": "success"
}


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-28-222114

How reproducible:
Wait a few hours

Steps to Reproduce:
1. See the description
2.
3.

Actual results:
Duplicate Watchdog alerts in Alertmanager web console

Expected results:
No duplicate alerts in Alertmanager web console

Additional info:

Comment 1 Junqi Zhao 2019-07-30 06:09:39 UTC
Created attachment 1594494 [details]
alertmanager pod logs

Comment 2 Junqi Zhao 2019-07-30 06:14:05 UTC
Created attachment 1594508 [details]
another duplicate KubePodNotReady alert

Comment 4 Andrew Pickering 2019-08-05 04:13:33 UTC
Looks like this Alertmanager UI issue: https://github.com/prometheus/alertmanager/issues/1875

Comment 8 Simon Pasquier 2019-08-07 08:16:40 UTC
Created attachment 1601263 [details]
AlertManager configuration

Though it looks similar to https://github.com/prometheus/alertmanager/issues/1875, I'm not convinced it is the same bug. The upstream issue occurs when an alert can match multiple groups in the routing tree. The default AlertManager configuration pushed  by CMO is very simple and an alert should only match one group (see attached file).
I've started a temporary cluster but I'm afraid it will be turned off before I can see the bug. To debug further, it would be great to have a copy of the AlertManager configuration and the response to the /api/v2/alerts/groups endpoint (which is used by the UI).

Comment 9 Junqi Zhao 2019-08-07 09:13:00 UTC
Created attachment 1601280 [details]
info for Comment 8

Comment 13 Junqi Zhao 2019-08-07 12:06:38 UTC
Created attachment 1601353 [details]
response from /api/v2/alerts/groups

Comment 17 Junqi Zhao 2019-09-03 09:12:51 UTC
Created attachment 1611058 [details]
duplicate alerts for "Not grouped" alerts

if the alert is not grouped(don't have job label), there maybe a duplicate alert.
eg:
************************************
alert: KubePodNotReady
expr: sum
  by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"})
  > 0
for: 15m
labels:
  severity: critical
annotations:
  message: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state
    for longer than 15 minutes.
************************************
result for expression
sum by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"}) > 0

Element	Value
{namespace="openshift-machine-config-operator",pod="etcd-quorum-guard-7c7dc46d74-ntc95"}	1
{namespace="openshift-monitoring",pod="alertmanager-main-1"}	1

Comment 23 errata-xmlrpc 2019-10-16 06:33:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.