Created attachment 1594482 [details] Duplicate Watchdog alerts in Alertmanager web console Description of problem: Wait a few hours and check in API, there is only one Watchdog alert, but there are two Watchdog alerts in Alertmanager web console, see the attached picture. $ oc -n openshift-monitoring get pod -o wide | grep alertmanager alertmanager-main-0 3/3 Running 0 23h 10.131.0.10 zhsun3-gzhhs-worker-centralus3-rj4cp <none> <none> alertmanager-main-1 3/3 Running 0 23h 10.129.2.11 zhsun3-gzhhs-worker-centralus2-84kws <none> <none> alertmanager-main-2 3/3 Running 0 23h 10.128.2.10 zhsun3-gzhhs-worker-centralus1-njxsq <none> <none> $ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c alertmanager alertmanager-main-0 -- curl -k -H "Authorization: Bearer $token" https://10.131.0.10:9095/api/v1/alerts | python -mjson.tool % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1709 100 1709 0 0 5837 0 --:--:-- --:--:-- --:--:-- 5852 { "data": [ { "annotations": { "message": "In the last minute, rsyslog 10.128.3.79:24231 queue length increased more than 32. Current value is 1622.", "summary": "Rsyslog is overwhelmed" }, "endsAt": "2019-07-30T05:52:59.265000656Z", "fingerprint": "446056d9ac32f75f", "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=delta%28rsyslog_queue_size%5B1m%5D%29+%3E+32&g0.tab=1", "labels": { "alertname": "RsyslogQueueLengthBurst", "endpoint": "metrics", "instance": "10.128.3.79:24231", "job": "rsyslog", "namespace": "openshift-logging", "pod": "rsyslog-299q4", "prometheus": "openshift-monitoring/k8s", "queue": "main Q", "service": "rsyslog", "severity": "warning" }, "receivers": [ "null" ], "startsAt": "2019-07-30T05:49:29.265000656Z", "status": { "inhibitedBy": [], "silencedBy": [], "state": "unprocessed" } }, { "annotations": { "message": "Pod default/recycler-for-pv-um5pl has been in a non-ready state for longer than an hour." }, "endsAt": "2019-07-30T05:53:10.59085788Z", "fingerprint": "8e9b7daa0916e6db", "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=sum+by%28namespace%2C+pod%29+%28kube_pod_status_phase%7Bjob%3D%22kube-state-metrics%22%2Cnamespace%3D~%22%28openshift-.%2A%7Ckube-.%2A%7Cdefault%7Clogging%29%22%2Cphase%3D~%22Failed%7CPending%7CUnknown%22%7D%29+%3E+0&g0.tab=1", "labels": { "alertname": "KubePodNotReady", "namespace": "default", "pod": "recycler-for-pv-um5pl", "prometheus": "openshift-monitoring/k8s", "severity": "critical" }, "receivers": [ "null" ], "startsAt": "2019-07-30T02:49:40.59085788Z", "status": { "inhibitedBy": [], "silencedBy": [], "state": "active" } }, { "annotations": { "message": "14.285714285714285% of the rsyslog targets are down." }, "endsAt": "2019-07-30T05:53:30.163677339Z", "fingerprint": "9993e10881e80f98", "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=100+%2A+%28count+by%28job%29+%28up+%3D%3D+0%29+%2F+count+by%28job%29+%28up%29%29+%3E+10&g0.tab=1", "labels": { "alertname": "TargetDown", "job": "rsyslog", "prometheus": "openshift-monitoring/k8s", "severity": "warning" }, "receivers": [ "null" ], "startsAt": "2019-07-30T05:47:30.163677339Z", "status": { "inhibitedBy": [], "silencedBy": [], "state": "active" } }, { "annotations": { "message": "This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n" }, "endsAt": "2019-07-30T05:53:30.163677339Z", "fingerprint": "e25963d69425c836", "generatorURL": "https://prometheus-k8s-openshift-monitoring.apps.zhsun3.qe.azure.devcluster.openshift.com/graph?g0.expr=vector%281%29&g0.tab=1", "labels": { "alertname": "Watchdog", "prometheus": "openshift-monitoring/k8s", "severity": "none" }, "receivers": [ "null" ], "startsAt": "2019-07-29T06:04:00.163677339Z", "status": { "inhibitedBy": [], "silencedBy": [], "state": "active" } } ], "status": "success" } Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-07-28-222114 How reproducible: Wait a few hours Steps to Reproduce: 1. See the description 2. 3. Actual results: Duplicate Watchdog alerts in Alertmanager web console Expected results: No duplicate alerts in Alertmanager web console Additional info:
Created attachment 1594494 [details] alertmanager pod logs
Created attachment 1594508 [details] another duplicate KubePodNotReady alert
Looks like this Alertmanager UI issue: https://github.com/prometheus/alertmanager/issues/1875
Created attachment 1601263 [details] AlertManager configuration Though it looks similar to https://github.com/prometheus/alertmanager/issues/1875, I'm not convinced it is the same bug. The upstream issue occurs when an alert can match multiple groups in the routing tree. The default AlertManager configuration pushed by CMO is very simple and an alert should only match one group (see attached file). I've started a temporary cluster but I'm afraid it will be turned off before I can see the bug. To debug further, it would be great to have a copy of the AlertManager configuration and the response to the /api/v2/alerts/groups endpoint (which is used by the UI).
Created attachment 1601280 [details] info for Comment 8
Created attachment 1601353 [details] response from /api/v2/alerts/groups
Created attachment 1611058 [details] duplicate alerts for "Not grouped" alerts if the alert is not grouped(don't have job label), there maybe a duplicate alert. eg: ************************************ alert: KubePodNotReady expr: sum by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"}) > 0 for: 15m labels: severity: critical annotations: message: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes. ************************************ result for expression sum by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"}) > 0 Element Value {namespace="openshift-machine-config-operator",pod="etcd-quorum-guard-7c7dc46d74-ntc95"} 1 {namespace="openshift-monitoring",pod="alertmanager-main-1"} 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922