Bug 1734266
Summary: | Duplicate alerts in Alertmanager web console | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Junqi Zhao <juzhao> | ||||||||||||||||
Component: | Monitoring | Assignee: | Simon Pasquier <spasquie> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||
Priority: | medium | ||||||||||||||||||
Version: | 4.2.0 | CC: | alegrand, anpicker, erooth, mloibl, pkrupa, spasquie, surbania | ||||||||||||||||
Target Milestone: | --- | Keywords: | Regression, Reopened | ||||||||||||||||
Target Release: | 4.2.0 | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2019-10-16 06:33:52 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Attachments: |
|
Description
Junqi Zhao
2019-07-30 06:07:53 UTC
Created attachment 1594494 [details]
alertmanager pod logs
Created attachment 1594508 [details]
another duplicate KubePodNotReady alert
Looks like this Alertmanager UI issue: https://github.com/prometheus/alertmanager/issues/1875 Created attachment 1601263 [details] AlertManager configuration Though it looks similar to https://github.com/prometheus/alertmanager/issues/1875, I'm not convinced it is the same bug. The upstream issue occurs when an alert can match multiple groups in the routing tree. The default AlertManager configuration pushed by CMO is very simple and an alert should only match one group (see attached file). I've started a temporary cluster but I'm afraid it will be turned off before I can see the bug. To debug further, it would be great to have a copy of the AlertManager configuration and the response to the /api/v2/alerts/groups endpoint (which is used by the UI). Created attachment 1601280 [details] info for Comment 8 Created attachment 1601353 [details]
response from /api/v2/alerts/groups
Created attachment 1611058 [details]
duplicate alerts for "Not grouped" alerts
if the alert is not grouped(don't have job label), there maybe a duplicate alert.
eg:
************************************
alert: KubePodNotReady
expr: sum
by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"})
> 0
for: 15m
labels:
severity: critical
annotations:
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state
for longer than 15 minutes.
************************************
result for expression
sum by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Failed|Pending|Unknown"}) > 0
Element Value
{namespace="openshift-machine-config-operator",pod="etcd-quorum-guard-7c7dc46d74-ntc95"} 1
{namespace="openshift-monitoring",pod="alertmanager-main-1"} 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |