Description of problem: [Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel] is failing with the error: s: "query failed: \nsort_desc((\ncount_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[51m3s:1s]) unless\ncount_over_time(ALERTS{alertname=\"AggregatedAPIDown\",name=\"v1alpha1.wardle.example.com\",alertstate=\"firing\"}[51m3s:1s])\n) > 0): promQL query: \nsort_desc((\ncount_over_time(ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured\",alertstate=\"firing\",severity!=\"info\"}[51m3s:1s]) unless\ncount_over_time(ALERTS{alertname=\"AggregatedAPIDown\",name=\"v1alpha1.wardle.example.com\",alertstate=\"firing\"}[51m3s:1s])\n) > 0) had reported incorrect results:\n[{\"metric\":{\"alertname\":\"AlertmanagerMembersInconsistent\",\"alertstate\":\"firing\",\"container\":\"alertmanager-proxy\",\"endpoint\":\"web\",\"instance\":\"10.129.2.6:9095\",\"job\":\"alertmanager-main\",\"namespace\":\"openshift-monitoring\",\"pod\":\"alertmanager-main-0\",\"service\":\"alertmanager-main\",\"severity\":\"critical\"},\"value\":[1615290631.212,\"90\"]},{\"metric\":{\"alertname\":\"AlertmanagerMembersInconsistent\",\"alertstate\":\"firing\",\"container\":\"alertmanager-proxy\",\"endpoint\":\"web\",\"instance\":\"10.131.0.22:9095\",\"job\":\"alertmanager-main\",\"namespace\":\"openshift-monitoring\",\"pod\":\"alertmanager-main-1\",\"service\":\"alertmanager-main\",\"severity\":\"critical\"},\"value\":[1615290631.212,\"90\"]}]", }, Version-Release number of selected component (if applicable): How reproducible: pull-ci-openshift-origin-master-e2e-aws-serial (all) - 42 runs, 48% failed, 10% of failures match = 5% impact Steps to Reproduce: 1. 2. 3. Actual results: Alert firing. Expected results: Alert firing Additional info:
From the alertmanager oauth proxy logs, AlertmanagerMembersInconsistent sounds like a DNS issue inside of the cluster so it might be worth sending the bug over to the network team: ``` level=warn ts=2021-03-09T11:00:50.024Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-main-0.alertmanager-operated:9094 err="1 error occurred:\n\t* Failed to resolve alertmanager-main-0.alertmanager-operated:9094: lookup alertmanager-main-0.alertmanager-operated on 172.30.0.10:53: no such host\n\n" ``` As for the other alerts: AggregatedAPIDown: https://bugzilla.redhat.com/show_bug.cgi?id=1933144 ThanosSidecarUnhealthy: https://bugzilla.redhat.com/show_bug.cgi?id=1921335
Example job were the network error triggering the AlertmanagerMembersInconsistent alert occurs: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25030/pull-ci-openshift-origin-master-e2e-aws-serial/1369516609869189120
This is also happening briefly during upgrades as well (I'm going to list this bug in the new "no alerts during upgrade", if we believe the serial is due to the new machine but the upgrade is unrelated, please split this bug into two). https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032 alert ThanosSidecarUnhealthy fired for 30 seconds with labels: map[job:prometheus-k8s-thanos-sidecar pod:prometheus-k8s-0 severity:critical] Logs after startup show - it's possible that the disruption before startup plus the brief errors after triggered it. level=info ts=2021-03-15T23:14:48.487315228Z caller=grpc.go:116 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=[10.129.2.22]:10901 level=warn ts=2021-03-15T23:14:48.48974186Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp [::1]:9090: connect: connection refused" level=warn ts=2021-03-15T23:14:48.489840232Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="perform GET request against http://localhost:9090/api/v1/status/config: Get \"http://localhost:9090/api/v1/status/config\": dial tcp [::1]:9090: connect: connection refused" level=warn ts=2021-03-15T23:14:50.490841313Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:50.490931391Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:52.487657406Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:52.487740899Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:54.509136245Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:54.509193144Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:56.496098461Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:56.49617417Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:58.500054069Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:14:58.505758295Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:15:00.510658051Z caller=sidecar.go:146 msg="failed to fetch initial external labels. Is Prometheus running? Retrying" err="expected 2xx response, got 503. Body: Service Unavailable" level=warn ts=2021-03-15T23:15:00.510717195Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason="expected 2xx response, got 503. Body: Service Unavailable" level=info ts=2021-03-15T23:15:02.535641945Z caller=sidecar.go:155 msg="successfully loaded prometheus external labels" external_labels="{prometheus=\"openshift-monitoring/k8s\", prometheus_replica=\"prometheus-k8s-0\"}" level=info ts=2021-03-15T23:15:02.535712066Z caller=intrumentation.go:48 msg="changing probe status" status=ready https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/25904/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1371561731540652032/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-0_thanos-sidecar.log Unfortunately, the prometheus logs are from the same thanos instance and so are lost. Once this merges, we should eventually get lucky and get the alert to fire from the other prometheus-k8s-1
Created attachment 1772451 [details] alertmanager_cluster_peer_info metric For some reason that I can't explain right now, 2 Alertmanager pods ended up with the same peer name ("01F393X2Y3KZYW2J5ED0Y1JVXS" in the attached screenshot). The peer name is generated randomly using the current time (in nanoseconds) as the seed [1] so I'm not sure why 2 pods generate the exact same name even if they're started roughly at the same time... [1] https://github.com/prometheus/alertmanager/blob/22ac6dff21901bfce14545da59b37a1aaca0db3a/cluster/cluster.go#L190-L193
The issue was caused by evictions from e2e tests from the Node suite. We extended the alert's for clause from 10m to 15m which fixes the problem.
checked recent CI jobs, no AlertmanagerMembersInconsistent alert, but found test flake, alert KubeAPIErrorBudgetBurn pending for **** seconds with labels: {long="3d", severity="warning", short="6h"}, tracked in bug 1978085
I'm updating the bug title (and moving the test-case name into the environment field for Sippy) to make it more clear what is getting fixed here.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759