+++ This bug was initially created as a clone of Bug #1936488 +++ test: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured Seeing a lot of failures with the queries used. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1368886803049746432 Test grids show that these errors have been popping up for a long time https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt --- Additional comment from Simon Pasquier on 2021-03-08 09:33:51 CST --- Looking at release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 [1], there's a problem with DNS resolution for Alertmanager pods [2] which leads to "AlertmanagerMembersInconsistent" being fired. It should be redirected to the teams dealing with libvirt and/or ppc64 platforms because it's not something we see in other environments. Looking at periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt [3], "TargetDown" is firing for the node-exporter and kubelet targets. It means that for some reason, Prometheus fails to scrape metrics and I can see some failures in node_exporter's kube-rbac-proxy [4][5][6] that would match. Again since it is specific to a given job, it would be best redirected to the folks in charge of this job. [1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896/artifacts/e2e-remote-libvirt/pods/openshift-monitoring_alertmanager-main-0_alertmanager.log [3] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1368886803049746432 [4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-dnw48_kube-rbac-proxy.log [5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-5jz89_node-exporter.log [6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-grgw8_kube-rbac-proxy.log --- Additional comment from Prashanth Sundararaman on 2021-03-09 19:42:22 CST --- i see this issue on any libvirt deploy on baremetal..not specific to multi-arch, where there is this alert firing: 33.33% of the machine-api-controllers/machine-api-controllers targets in NamespaceNSopenshift-machine-api namespace are down. this is due to the kube-rbac-proxy-machine-mtrc container reporting these errors: I0309 17:29:57.278737 1 main.go:159] Reading config file: /etc/kube-rbac-proxy/config-file.yaml I0309 17:29:57.285997 1 main.go:190] Valid token audiences: I0309 17:29:57.286071 1 main.go:278] Reading certificate files I0309 17:29:57.286336 1 main.go:311] Starting TCP socket on 0.0.0.0:8441 I0309 17:29:57.287013 1 main.go:318] Listening securely on 0.0.0.0:8441 2021/03/09 17:43:04 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:43:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:43:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:43:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:45:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused --- Additional comment from Andy McCrae on 2021-03-10 04:13:11 CST --- This looks to be happening because a change to metrics went in to the machine-api-operator, and the required provider change was not applied to the libvirt provider: https://github.com/openshift/machine-api-operator/pull/609 This will impact all branches going back to 4.6 - I'm looking into applying the change to the libvirt provider now. --- Additional comment from Dan Li on 2021-03-11 07:13:45 CST --- Hi Andy or Prashanth, as part of bug triaging, can we provide a "Severity" for this bug? --- Additional comment from Andy McCrae on 2021-03-11 10:24:10 CST --- I've marked this as Medium - This should only impact CI since the libvirt-provider isn't used in supported environments, but none-the-less it impairs out ability to perform accurate CI runs. We have a PR up for master, but we will need to shepherd this through to 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218 --- Additional comment from Yaakov Selkowitz on 2021-03-12 12:57:50 CST --- (In reply to Andy McCrae from comment #5) > We have a PR up for master, but we will need to shepherd this through to > 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218 Automatic cherry-pick failed, so we'll need manual PRs for 4.7 and 4.6.
Hi Prashanth, do you think this bug will be resolved before the end of the sprint? If not, can we add "reviewed-in-sprint" flag.
Setting the "Reviewed-in-Sprint" flag for the past sprint (ended on March 20th) as PR is still open and the bug is at POST state.
tested with latest nightly and it works.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1005