1734266 – Duplicate alerts in Alertmanager web console

Bug 1734266 - Duplicate alerts in Alertmanager web console

Summary: Duplicate alerts in Alertmanager web console

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-30 06:07 UTC by Junqi Zhao
Modified:	2019-10-16 06:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:33:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Duplicate Watchdog alerts in Alertmanager web console (117.82 KB, image/png) 2019-07-30 06:07 UTC, Junqi Zhao	no flags	Details
alertmanager pod logs (5.22 KB, text/plain) 2019-07-30 06:09 UTC, Junqi Zhao	no flags	Details
another duplicate KubePodNotReady alert (89.78 KB, image/png) 2019-07-30 06:14 UTC, Junqi Zhao	no flags	Details
AlertManager configuration (601 bytes, text/plain) 2019-08-07 08:16 UTC, Simon Pasquier	no flags	Details
info for Comment 8 (106.39 KB, application/gzip) 2019-08-07 09:13 UTC, Junqi Zhao	no flags	Details
response from /api/v2/alerts/groups (13.17 KB, text/plain) 2019-08-07 12:06 UTC, Junqi Zhao	no flags	Details
duplicate alerts for "Not grouped" alerts (135.29 KB, image/png) 2019-09-03 09:12 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift prometheus-alertmanager pull 27	None	closed	Bug 1734266: Bump 0.19.0	2020-10-08 12:37:10 UTC
Github	prometheus alertmanager issues 1875	None	closed	duplicate alert groups returned by api	2020-10-08 12:37:01 UTC
Github	prometheus alertmanager pull 2012	None	closed	Grouping label's expand button with grouping id	2020-10-08 12:37:01 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:34:02 UTC

Description Junqi Zhao 2019-07-30 06:07:53 UTC

Created attachment 1594482 [details]
Duplicate Watchdog alerts in Alertmanager web console

Description of problem:
Wait a few hours and check in API, there is only one Watchdog alert, but there are two Watchdog alerts in Alertmanager web console, see the attached picture.


$ oc -n openshift-monitoring get pod -o wide | grep alertmanager
alertmanager-main-0                            3/3     Running   0          23h     10.131.0.10    zhsun3-gzhhs-worker-centralus3-rj4cp   <none>           <none>
alertmanager-main-1                            3/3     Running   0          23h     10.129.2.11    zhsun3-gzhhs-worker-centralus2-84kws   <none>           <none>
alertmanager-main-2                            3/3     Running   0          23h     10.128.2.10    zhsun3-gzhhs-worker-centralus1-njxsq   <none>           <none>

$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -k -H "Authorization: Bearer $token" https://10.131.0.10:9095/api/v1/alerts | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1709  100  1709    0     0   5837      0 --:--:-- --:--:-- --:--:--  5852
{
    "data": [
        {
            "annotations": {
                "message": "In the last minute, rsyslog 10.128.3.79:24231 queue length increased more than 32. Current value is 1622.",
                "summary": "Rsyslog is overwhelmed"
            },
            "endsAt": "2019-07-30T05:52:59.265000656Z",
            "fingerprint": "446056d9ac32f75f",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=delta%28rsyslog_queue_size%5B1m%5D%29+%3E+32&g0.tab=1",
            "labels": {
                "alertname": "RsyslogQueueLengthBurst",
                "endpoint": "metrics",
                "instance": "10.128.3.79:24231",
                "job": "rsyslog",
                "namespace": "openshift-logging",
                "pod": "rsyslog-299q4",
                "prometheus": "openshift-monitoring/k8s",
                "queue": "main Q",
                "service": "rsyslog",
                "severity": "warning"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T05:49:29.265000656Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "unprocessed"
            }
        },
        {
            "annotations": {
                "message": "Pod default/recycler-for-pv-um5pl has been in a non-ready state for longer than an hour."
            },
            "endsAt": "2019-07-30T05:53:10.59085788Z",
            "fingerprint": "8e9b7daa0916e6db",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=sum+by%28namespace%2C+pod%29+%28kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22%28openshift-.%2A%7Ckube-.%2A%7Cdefault%7Clogging%29%22%2Cphase%3D~%22Failed%7CPending%7CUnknown%22%7D%29+%3E+0&g0.tab=1",
            "labels": {
                "alertname": "KubePodNotReady",
                "namespace": "default",
                "pod": "recycler-for-pv-um5pl",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "critical"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T02:49:40.59085788Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        },
        {
            "annotations": {
                "message": "14.285714285714285% of the rsyslog targets are down."
            },
            "endsAt": "2019-07-30T05:53:30.163677339Z",
            "fingerprint": "9993e10881e80f98",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=100+%2A+%28count+by%28job%29+%28up+%3D%3D+0%29+%2F+count+by%28job%29+%28up%29%29+%3E+10&g0.tab=1",
            "labels": {
                "alertname": "TargetDown",
                "job": "rsyslog",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "warning"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T05:47:30.163677339Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        },
        {
            "annotations": {
                "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"
            },
            "endsAt": "2019-07-30T05:53:30.163677339Z",
            "fingerprint": "e25963d69425c836",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=vector%281%29&g0.tab=1",
            "labels": {
                "alertname": "Watchdog",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "none"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-29T06:04:00.163677339Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        }
    ],
    "status": "success"
}


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-28-222114

How reproducible:
Wait a few hours

Steps to Reproduce:
1. See the description
2.
3.

Actual results:
Duplicate Watchdog alerts in Alertmanager web console

Expected results:
No duplicate alerts in Alertmanager web console

Additional info:

Comment 1 Junqi Zhao 2019-07-30 06:09:39 UTC

Created attachment 1594494 [details]
alertmanager pod logs

Comment 2 Junqi Zhao 2019-07-30 06:14:05 UTC

Created attachment 1594508 [details]
another duplicate KubePodNotReady alert

Comment 4 Andrew Pickering 2019-08-05 04:13:33 UTC

Looks like this Alertmanager UI issue: https://github.com/prometheus/alertmanager/issues/1875

Comment 8 Simon Pasquier 2019-08-07 08:16:40 UTC

Created attachment 1601263 [details]
AlertManager configuration

Though it looks similar to https://github.com/prometheus/alertmanager/issues/1875, I'm not convinced it is the same bug. The upstream issue occurs when an alert can match multiple groups in the routing tree. The default AlertManager configuration pushed  by CMO is very simple and an alert should only match one group (see attached file).
I've started a temporary cluster but I'm afraid it will be turned off before I can see the bug. To debug further, it would be great to have a copy of the AlertManager configuration and the response to the /api/v2/alerts/groups endpoint (which is used by the UI).

Comment 9 Junqi Zhao 2019-08-07 09:13:00 UTC

Created attachment 1601280 [details]
info for Comment 8

Comment 13 Junqi Zhao 2019-08-07 12:06:38 UTC

Created attachment 1601353 [details]
response from /api/v2/alerts/groups

Comment 17 Junqi Zhao 2019-09-03 09:12:51 UTC

Created attachment 1611058 [details]
duplicate alerts for "Not grouped" alerts

if the alert is not grouped(don't have job label), there maybe a duplicate alert.
eg:
************************************
alert: KubePodNotReady
expr: sum
  by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"})
  > 0
for: 15m
labels:
  severity: critical
annotations:
  message: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state
    for longer than 15 minutes.
************************************
result for expression
sum by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"}) > 0

Element	Value
{namespace="openshift-machine-config-operator",pod="etcd-quorum-guard-7c7dc46d74-ntc95"}	1
{namespace="openshift-monitoring",pod="alertmanager-main-1"}	1

Comment 23 errata-xmlrpc 2019-10-16 06:33:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.