+++ This bug was initially created as a clone of Bug #1936488 +++
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured
is failing frequently in CI, see search results:
Seeing a lot of failures with the queries used.
Test grids show that these errors have been popping up for a long time
--- Additional comment from Simon Pasquier on 2021-03-08 09:33:51 CST ---
Looking at release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 , there's a problem with DNS resolution for Alertmanager pods  which leads to "AlertmanagerMembersInconsistent" being fired. It should be redirected to the teams dealing with libvirt and/or ppc64 platforms because it's not something we see in other environments.
Looking at periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt , "TargetDown" is firing for the node-exporter and kubelet targets. It means that for some reason, Prometheus fails to scrape metrics and I can see some failures in node_exporter's kube-rbac-proxy  that would match. Again since it is specific to a given job, it would be best redirected to the folks in charge of this job.
--- Additional comment from Prashanth Sundararaman on 2021-03-09 19:42:22 CST ---
i see this issue on any libvirt deploy on baremetal..not specific to multi-arch, where there is this alert firing:
33.33% of the machine-api-controllers/machine-api-controllers targets in NamespaceNSopenshift-machine-api namespace are down.
this is due to the kube-rbac-proxy-machine-mtrc container reporting these errors:
I0309 17:29:57.278737 1 main.go:159] Reading config file: /etc/kube-rbac-proxy/config-file.yaml
I0309 17:29:57.285997 1 main.go:190] Valid token audiences:
I0309 17:29:57.286071 1 main.go:278] Reading certificate files
I0309 17:29:57.286336 1 main.go:311] Starting TCP socket on 0.0.0.0:8441
I0309 17:29:57.287013 1 main.go:318] Listening securely on 0.0.0.0:8441
2021/03/09 17:43:04 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:43:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:43:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:43:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:45:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused
--- Additional comment from Andy McCrae on 2021-03-10 04:13:11 CST ---
This looks to be happening because a change to metrics went in to the machine-api-operator, and the required provider change was not applied to the libvirt provider:
This will impact all branches going back to 4.6 - I'm looking into applying the change to the libvirt provider now.
--- Additional comment from Dan Li on 2021-03-11 07:13:45 CST ---
Hi Andy or Prashanth, as part of bug triaging, can we provide a "Severity" for this bug?
--- Additional comment from Andy McCrae on 2021-03-11 10:24:10 CST ---
I've marked this as Medium - This should only impact CI since the libvirt-provider isn't used in supported environments, but none-the-less it impairs out ability to perform accurate CI runs.
We have a PR up for master, but we will need to shepherd this through to 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218
--- Additional comment from Yaakov Selkowitz on 2021-03-12 12:57:50 CST ---
(In reply to Andy McCrae from comment #5)
> We have a PR up for master, but we will need to shepherd this through to
> 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218
Automatic cherry-pick failed, so we'll need manual PRs for 4.7 and 4.6.
Hi Prashanth, do you think this bug will be resolved before the end of the sprint? If not, can we add "reviewed-in-sprint" flag.
Setting the "Reviewed-in-Sprint" flag for the past sprint (ended on March 20th) as PR is still open and the bug is at POST state.
tested with latest nightly and it works.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.5 security and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.