Bug 1936919

Summary: AlertmanagerMembersInconsistent fires too quickly, causing serial-test noise
Product: OpenShift Container Platform Reporter: ravig <rgudimet>
Component: MonitoringAssignee: Filip Petkovski <fpetkovs>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: amcdermo, anpicker, aos-bugs, ccoleman, dgrisonn, erooth, lcosic, spasquie, wking
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1976765 (view as bug list) Environment:
[Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed: 2021-10-18 17:29:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1976765    
Attachments:
Description Flags
alertmanager_cluster_peer_info metric none

Description ravig 2021-03-09 13:40:44 UTC
Description of problem:

[Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

is failing with the error:

s: "query failed: \nsort_desc((\ncount_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[51m3s:1s]) unless\ncount_over_time(ALERTS{alertname=\"AggregatedAPIDown\",name=\"v1alpha1.wardle.example.com\",alertstate=\"firing\"}[51m3s:1s])\n) > 0): promQL query: \nsort_desc((\ncount_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[51m3s:1s]) unless\ncount_over_time(ALERTS{alertname=\"AggregatedAPIDown\",name=\"v1alpha1.wardle.example.com\",alertstate=\"firing\"}[51m3s:1s])\n) > 0) had reported incorrect results:\n[{\"metric\":{\"alertname\":\"AlertmanagerMembersInconsistent\",\"alertstate\":\"firing\",\"container\":\"alertmanager-proxy\",\"endpoint\":\"web\",\"instance\":\"10.129.2.6:9095\",\"job\":\"alertmanager-main\",\"namespace\":\"openshift-monitoring\",\"pod\":\"alertmanager-main-0\",\"service\":\"alertmanager-main\",\"severity\":\"critical\"},\"value\":[1615290631.212,\"90\"]},{\"metric\":{\"alertname\":\"AlertmanagerMembersInconsistent\",\"alertstate\":\"firing\",\"container\":\"alertmanager-proxy\",\"endpoint\":\"web\",\"instance\":\"10.131.0.22:9095\",\"job\":\"alertmanager-main\",\"namespace\":\"openshift-monitoring\",\"pod\":\"alertmanager-main-1\",\"service\":\"alertmanager-main\",\"severity\":\"critical\"},\"value\":[1615290631.212,\"90\"]}]",
        },


Version-Release number of selected component (if applicable):


How reproducible:

pull-ci-openshift-origin-master-e2e-aws-serial (all) - 42 runs, 48% failed, 10% of failures match = 5% impact




Steps to Reproduce:
1.
2.
3.

Actual results:
Alert firing.

Expected results:

Alert firing

Additional info:

Comment 1 Damien Grisonnet 2021-03-09 14:13:53 UTC
From the alertmanager oauth proxy logs, AlertmanagerMembersInconsistent sounds like a DNS issue inside of the cluster so it might be worth sending the bug over to the network team: 
```
level=warn ts=2021-03-09T11:00:50.024Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-0.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
```

As for the other alerts:
AggregatedAPIDown: https://bugzilla.redhat.com/show_bug.cgi?id=1933144
ThanosSidecarUnhealthy: https://bugzilla.redhat.com/show_bug.cgi?id=1921335

Comment 2 Damien Grisonnet 2021-03-10 12:17:26 UTC
Example job were the network error triggering the AlertmanagerMembersInconsistent alert occurs: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25030/pull-ci-openshift-origin-master-e2e-aws-serial/1369516609869189120

Comment 4 Clayton Coleman 2021-03-16 14:45:52 UTC
This is also happening briefly during upgrades as well (I'm going to list this bug in the new "no alerts during upgrade", if we believe the serial is due to the new machine but the upgrade is unrelated, please split this bug into two).


https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032

alert ThanosSidecarUnhealthy fired for 30 seconds with labels: map[job:prometheus-k8s-thanos-sidecar pod:prometheus-k8s-0 severity:critical]

Logs after startup show - it's possible that the disruption before startup plus the brief errors after triggered it.

level=info ts=2021-03-15T23:14:48.487315228Z caller=grpc.go:116 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=[10.129.2.22]:10901
level=warn ts=2021-03-15T23:14:48.48974186Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp [::1]:9090: connect: connection refused"
level=warn ts=2021-03-15T23:14:48.489840232Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp [::1]:9090: connect: connection refused"
level=warn ts=2021-03-15T23:14:50.490841313Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:50.490931391Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:52.487657406Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:52.487740899Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:54.509136245Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:54.509193144Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:56.496098461Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:56.49617417Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:58.500054069Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:58.505758295Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:15:00.510658051Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:15:00.510717195Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=info ts=2021-03-15T23:15:02.535641945Z caller=sidecar.go:155 msg="successfully loaded prometheus external labels" external_labels="{prometheus=\"openshift-monitoring/k8s\", prometheus_replica=\"prometheus-k8s-0\"}"
level=info ts=2021-03-15T23:15:02.535712066Z caller=intrumentation.go:48 msg="changing probe status" status=ready


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_thanos-sidecar.log

Unfortunately, the prometheus logs are from the same thanos instance and so are lost.  Once this merges, we should eventually get lucky and get the alert to fire from the other prometheus-k8s-1

Comment 5 Simon Pasquier 2021-04-16 14:27:16 UTC
Created attachment 1772451 [details]
alertmanager_cluster_peer_info metric

For some reason that I can't explain right now, 2 Alertmanager pods ended up with the same peer name ("01F393X2Y3KZYW2J5ED0Y1JVXS" in the attached screenshot). The peer name is generated randomly using the current time (in nanoseconds) as the seed [1] so I'm not sure why 2 pods generate the exact same name even if they're started roughly at the same time...

[1] https://github.com/prometheus/alertmanager/blob/22ac6dff21901bfce14545da59b37a1aaca0db3a/cluster/cluster.go#L190-L193

Comment 7 Filip Petkovski 2021-06-28 07:39:35 UTC
The issue was caused by evictions from e2e tests from the Node suite. We extended the alert's for clause from 10m to 15m which fixes the problem.

Comment 9 Junqi Zhao 2021-07-01 06:22:47 UTC
checked recent CI jobs, no AlertmanagerMembersInconsistent alert, but found test flake, alert KubeAPIErrorBudgetBurn pending for **** seconds with labels: {long="3d", severity="warning", short="6h"}, tracked in bug 1978085

Comment 10 W. Trevor King 2021-07-14 16:36:31 UTC
I'm updating the bug title (and moving the test-case name into the environment field for Sippy) to make it more clear what is getting fixed here.

Comment 16 errata-xmlrpc 2021-10-18 17:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759