Bug 1734266

Summary:

Duplicate alerts in Alertmanager web console

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Monitoring

Assignee:

Simon Pasquier <spasquie>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.2.0

CC:

alegrand, anpicker, erooth, mloibl, pkrupa, spasquie, surbania

Target Milestone:

---

Keywords:

Regression, Reopened

Target Release:

4.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-10-16 06:33:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Duplicate Watchdog alerts in Alertmanager web console	none
alertmanager pod logs	none
another duplicate KubePodNotReady alert	none
AlertManager configuration	none
info for Comment 8	none
response from /api/v2/alerts/groups	none
duplicate alerts for "Not grouped" alerts	none

Description Junqi Zhao 2019-07-30 06:07:53 UTC

Created attachment 1594482 [details]
Duplicate Watchdog alerts in Alertmanager web console

Description of problem:
Wait a few hours and check in API, there is only one Watchdog alert, but there are two Watchdog alerts in Alertmanager web console, see the attached picture.


$ oc -n openshift-monitoring get pod -o wide | grep alertmanager
alertmanager-main-0                            3/3     Running   0          23h     10.131.0.10    zhsun3-gzhhs-worker-centralus3-rj4cp   <none>           <none>
alertmanager-main-1                            3/3     Running   0          23h     10.129.2.11    zhsun3-gzhhs-worker-centralus2-84kws   <none>           <none>
alertmanager-main-2                            3/3     Running   0          23h     10.128.2.10    zhsun3-gzhhs-worker-centralus1-njxsq   <none>           <none>

$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -k -H "Authorization: Bearer $token" https://10.131.0.10:9095/api/v1/alerts | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1709  100  1709    0     0   5837      0 --:--:-- --:--:-- --:--:--  5852
{
    "data": [
        {
            "annotations": {
                "message": "In the last minute, rsyslog 10.128.3.79:24231 queue length increased more than 32. Current value is 1622.",
                "summary": "Rsyslog is overwhelmed"
            },
            "endsAt": "2019-07-30T05:52:59.265000656Z",
            "fingerprint": "446056d9ac32f75f",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=delta%28rsyslog_queue_size%5B1m%5D%29+%3E+32&g0.tab=1",
            "labels": {
                "alertname": "RsyslogQueueLengthBurst",
                "endpoint": "metrics",
                "instance": "10.128.3.79:24231",
                "job": "rsyslog",
                "namespace": "openshift-logging",
                "pod": "rsyslog-299q4",
                "prometheus": "openshift-monitoring/k8s",
                "queue": "main Q",
                "service": "rsyslog",
                "severity": "warning"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T05:49:29.265000656Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "unprocessed"
            }
        },
        {
            "annotations": {
                "message": "Pod default/recycler-for-pv-um5pl has been in a non-ready state for longer than an hour."
            },
            "endsAt": "2019-07-30T05:53:10.59085788Z",
            "fingerprint": "8e9b7daa0916e6db",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=sum+by%28namespace%2C+pod%29+%28kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22%28openshift-.%2A%7Ckube-.%2A%7Cdefault%7Clogging%29%22%2Cphase%3D~%22Failed%7CPending%7CUnknown%22%7D%29+%3E+0&g0.tab=1",
            "labels": {
                "alertname": "KubePodNotReady",
                "namespace": "default",
                "pod": "recycler-for-pv-um5pl",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "critical"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T02:49:40.59085788Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        },
        {
            "annotations": {
                "message": "14.285714285714285% of the rsyslog targets are down."
            },
            "endsAt": "2019-07-30T05:53:30.163677339Z",
            "fingerprint": "9993e10881e80f98",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=100+%2A+%28count+by%28job%29+%28up+%3D%3D+0%29+%2F+count+by%28job%29+%28up%29%29+%3E+10&g0.tab=1",
            "labels": {
                "alertname": "TargetDown",
                "job": "rsyslog",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "warning"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-30T05:47:30.163677339Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        },
        {
            "annotations": {
                "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"
            },
            "endsAt": "2019-07-30T05:53:30.163677339Z",
            "fingerprint": "e25963d69425c836",
            "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=vector%281%29&g0.tab=1",
            "labels": {
                "alertname": "Watchdog",
                "prometheus": "openshift-monitoring/k8s",
                "severity": "none"
            },
            "receivers": [
                "null"
            ],
            "startsAt": "2019-07-29T06:04:00.163677339Z",
            "status": {
                "inhibitedBy": [],
                "silencedBy": [],
                "state": "active"
            }
        }
    ],
    "status": "success"
}


Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-28-222114

How reproducible:
Wait a few hours

Steps to Reproduce:
1. See the description
2.
3.

Actual results:
Duplicate Watchdog alerts in Alertmanager web console

Expected results:
No duplicate alerts in Alertmanager web console

Additional info:

Comment 1 Junqi Zhao 2019-07-30 06:09:39 UTC

Created attachment 1594494 [details]
alertmanager pod logs

Comment 2 Junqi Zhao 2019-07-30 06:14:05 UTC

Created attachment 1594508 [details]
another duplicate KubePodNotReady alert

Comment 4 Andrew Pickering 2019-08-05 04:13:33 UTC

Looks like this Alertmanager UI issue: https://github.com/prometheus/alertmanager/issues/1875

Comment 8 Simon Pasquier 2019-08-07 08:16:40 UTC

Created attachment 1601263 [details]
AlertManager configuration

Though it looks similar to https://github.com/prometheus/alertmanager/issues/1875, I'm not convinced it is the same bug. The upstream issue occurs when an alert can match multiple groups in the routing tree. The default AlertManager configuration pushed  by CMO is very simple and an alert should only match one group (see attached file).
I've started a temporary cluster but I'm afraid it will be turned off before I can see the bug. To debug further, it would be great to have a copy of the AlertManager configuration and the response to the /api/v2/alerts/groups endpoint (which is used by the UI).

Comment 9 Junqi Zhao 2019-08-07 09:13:00 UTC

Created attachment 1601280 [details]
info for Comment 8

Comment 13 Junqi Zhao 2019-08-07 12:06:38 UTC

Created attachment 1601353 [details]
response from /api/v2/alerts/groups

Comment 17 Junqi Zhao 2019-09-03 09:12:51 UTC

Created attachment 1611058 [details]
duplicate alerts for "Not grouped" alerts

if the alert is not grouped(don't have job label), there maybe a duplicate alert.
eg:
************************************
alert: KubePodNotReady
expr: sum
  by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"})
  > 0
for: 15m
labels:
  severity: critical
annotations:
  message: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state
    for longer than 15 minutes.
************************************
result for expression
sum by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"}) > 0

Element	Value
{namespace="openshift-machine-config-operator",pod="etcd-quorum-guard-7c7dc46d74-ntc95"}	1
{namespace="openshift-monitoring",pod="alertmanager-main-1"}	1

Comment 23 errata-xmlrpc 2019-10-16 06:33:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922