Bug 1936919 - AlertmanagerMembersInconsistent fires too quickly, causing serial-test noise
Summary: AlertmanagerMembersInconsistent fires too quickly, causing serial-test noise
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Filip Petkovski
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1976765
TreeView+ depends on / blocked
 
Reported: 2021-03-09 13:40 UTC by ravig
Modified: 2021-10-18 17:29 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1976765 (view as bug list)
Environment:
[Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed: 2021-10-18 17:29:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
alertmanager_cluster_peer_info metric (52.50 KB, image/png)
2021-04-16 14:27 UTC, Simon Pasquier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1244 0 None closed jsonnet: pull latest deps 2021-06-28 08:51:34 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:29:59 UTC

Description ravig 2021-03-09 13:40:44 UTC
Description of problem:

[Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]

is failing with the error:

s: "query failed: \nsort_desc((\ncount_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[51m3s:1s]) unless\ncount_over_time(ALERTS{alertname=\"AggregatedAPIDown\",name=\"v1alpha1.wardle.example.com\",alertstate=\"firing\"}[51m3s:1s])\n) > 0): promQL query: \nsort_desc((\ncount_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[51m3s:1s]) unless\ncount_over_time(ALERTS{alertname=\"AggregatedAPIDown\",name=\"v1alpha1.wardle.example.com\",alertstate=\"firing\"}[51m3s:1s])\n) > 0) had reported incorrect results:\n[{\"metric\":{\"alertname\":\"AlertmanagerMembersInconsistent\",\"alertstate\":\"firing\",\"container\":\"alertmanager-proxy\",\"endpoint\":\"web\",\"instance\":\"10.129.2.6:9095\",\"job\":\"alertmanager-main\",\"namespace\":\"openshift-monitoring\",\"pod\":\"alertmanager-main-0\",\"service\":\"alertmanager-main\",\"severity\":\"critical\"},\"value\":[1615290631.212,\"90\"]},{\"metric\":{\"alertname\":\"AlertmanagerMembersInconsistent\",\"alertstate\":\"firing\",\"container\":\"alertmanager-proxy\",\"endpoint\":\"web\",\"instance\":\"10.131.0.22:9095\",\"job\":\"alertmanager-main\",\"namespace\":\"openshift-monitoring\",\"pod\":\"alertmanager-main-1\",\"service\":\"alertmanager-main\",\"severity\":\"critical\"},\"value\":[1615290631.212,\"90\"]}]",
        },


Version-Release number of selected component (if applicable):


How reproducible:

pull-ci-openshift-origin-master-e2e-aws-serial (all) - 42 runs, 48% failed, 10% of failures match = 5% impact




Steps to Reproduce:
1.
2.
3.

Actual results:
Alert firing.

Expected results:

Alert firing

Additional info:

Comment 1 Damien Grisonnet 2021-03-09 14:13:53 UTC
From the alertmanager oauth proxy logs, AlertmanagerMembersInconsistent sounds like a DNS issue inside of the cluster so it might be worth sending the bug over to the network team: 
```
level=warn ts=2021-03-09T11:00:50.024Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-0.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\n"
```

As for the other alerts:
AggregatedAPIDown: https://bugzilla.redhat.com/show_bug.cgi?id=1933144
ThanosSidecarUnhealthy: https://bugzilla.redhat.com/show_bug.cgi?id=1921335

Comment 2 Damien Grisonnet 2021-03-10 12:17:26 UTC
Example job were the network error triggering the AlertmanagerMembersInconsistent alert occurs: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25030/pull-ci-openshift-origin-master-e2e-aws-serial/1369516609869189120

Comment 4 Clayton Coleman 2021-03-16 14:45:52 UTC
This is also happening briefly during upgrades as well (I'm going to list this bug in the new "no alerts during upgrade", if we believe the serial is due to the new machine but the upgrade is unrelated, please split this bug into two).


https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032

alert ThanosSidecarUnhealthy fired for 30 seconds with labels: map[job:prometheus-k8s-thanos-sidecar pod:prometheus-k8s-0 severity:critical]

Logs after startup show - it's possible that the disruption before startup plus the brief errors after triggered it.

level=info ts=2021-03-15T23:14:48.487315228Z caller=grpc.go:116 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=[10.129.2.22]:10901
level=warn ts=2021-03-15T23:14:48.48974186Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp [::1]:9090: connect: connection refused"
level=warn ts=2021-03-15T23:14:48.489840232Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp [::1]:9090: connect: connection refused"
level=warn ts=2021-03-15T23:14:50.490841313Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:50.490931391Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:52.487657406Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:52.487740899Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:54.509136245Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:54.509193144Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:56.496098461Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:56.49617417Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:58.500054069Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:14:58.505758295Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:15:00.510658051Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable"
level=warn ts=2021-03-15T23:15:00.510717195Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable"
level=info ts=2021-03-15T23:15:02.535641945Z caller=sidecar.go:155 msg="successfully loaded prometheus external labels" external_labels="{prometheus=\"openshift-monitoring/k8s\", prometheus_replica=\"prometheus-k8s-0\"}"
level=info ts=2021-03-15T23:15:02.535712066Z caller=intrumentation.go:48 msg="changing probe status" status=ready


https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_thanos-sidecar.log

Unfortunately, the prometheus logs are from the same thanos instance and so are lost.  Once this merges, we should eventually get lucky and get the alert to fire from the other prometheus-k8s-1

Comment 5 Simon Pasquier 2021-04-16 14:27:16 UTC
Created attachment 1772451 [details]
alertmanager_cluster_peer_info metric

For some reason that I can't explain right now, 2 Alertmanager pods ended up with the same peer name ("01F393X2Y3KZYW2J5ED0Y1JVXS" in the attached screenshot). The peer name is generated randomly using the current time (in nanoseconds) as the seed [1] so I'm not sure why 2 pods generate the exact same name even if they're started roughly at the same time...

[1] https://github.com/prometheus/alertmanager/blob/22ac6dff21901bfce14545da59b37a1aaca0db3a/cluster/cluster.go#L190-L193

Comment 7 Filip Petkovski 2021-06-28 07:39:35 UTC
The issue was caused by evictions from e2e tests from the Node suite. We extended the alert's for clause from 10m to 15m which fixes the problem.

Comment 9 Junqi Zhao 2021-07-01 06:22:47 UTC
checked recent CI jobs, no AlertmanagerMembersInconsistent alert, but found test flake, alert KubeAPIErrorBudgetBurn pending for **** seconds with labels: {long="3d", severity="warning", short="6h"}, tracked in bug 1978085

Comment 10 W. Trevor King 2021-07-14 16:36:31 UTC
I'm updating the bug title (and moving the test-case name into the environment field for Sippy) to make it more clear what is getting fixed here.

Comment 16 errata-xmlrpc 2021-10-18 17:29:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.