Bug 1939259 - [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: Prometheus query error
Summary: [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing stat...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Multi-Arch
Version: 4.6
Hardware: Unspecified
OS: Linux
high
high
Target Milestone: ---
: 4.6.z
Assignee: Prashanth Sundararaman
QA Contact: Barry Donahue
URL:
Whiteboard:
Depends On: 1938316
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-15 21:09 UTC by Prashanth Sundararaman
Modified: 2021-04-20 19:27 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1938316
Environment:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured
Last Closed: 2021-04-20 19:27:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-libvirt pull 220 0 None open Bug 1939259: [release-4.6] Update MAO and set metrics on :8081 address 2021-03-25 14:35:09 UTC
Red Hat Product Errata RHBA-2021:1153 0 None None None 2021-04-20 19:27:39 UTC

Description Prashanth Sundararaman 2021-03-15 21:09:29 UTC
+++ This bug was initially created as a clone of Bug #1938316 +++

+++ This bug was initially created as a clone of Bug #1936488 +++

test:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured

Seeing a lot of failures with the queries used.
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1368886803049746432

Test grids show that these errors have been popping up for a long time
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt

--- Additional comment from Simon Pasquier on 2021-03-08 09:33:51 CST ---

Looking at release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 [1], there's a problem with DNS resolution for Alertmanager pods [2] which leads to "AlertmanagerMembersInconsistent" being fired. It should be redirected to the teams dealing with libvirt and/or ppc64 platforms because it's not something we see in other environments.

Looking at periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt [3], "TargetDown" is firing for the node-exporter and kubelet targets. It means that for some reason, Prometheus fails to scrape metrics and I can see some failures in node_exporter's kube-rbac-proxy [4][5][6] that would match. Again since it is specific to a given job, it would be best redirected to the folks in charge of this job.

[1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896/artifacts/e2e-remote-libvirt/pods/openshift-monitoring_alertmanager-main-0_alertmanager.log
[3] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1368886803049746432
[4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-dnw48_kube-rbac-proxy.log
[5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-5jz89_node-exporter.log
[6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-grgw8_kube-rbac-proxy.log

--- Additional comment from Prashanth Sundararaman on 2021-03-09 19:42:22 CST ---

i see this issue on any libvirt deploy on baremetal..not specific to multi-arch, where there is this alert firing:

33.33% of the machine-api-controllers/machine-api-controllers targets in NamespaceNSopenshift-machine-api namespace are down.

this is due to the kube-rbac-proxy-machine-mtrc container reporting these errors:

I0309 17:29:57.278737       1 main.go:159] Reading config file: /etc/kube-rbac-proxy/config-file.yaml
I0309 17:29:57.285997       1 main.go:190] Valid token audiences: 
I0309 17:29:57.286071       1 main.go:278] Reading certificate files
I0309 17:29:57.286336       1 main.go:311] Starting TCP socket on 0.0.0.0:8441
I0309 17:29:57.287013       1 main.go:318] Listening securely on 0.0.0.0:8441
2021/03/09 17:43:04 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:43:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:43:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:43:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:44:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused
2021/03/09 17:45:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused

--- Additional comment from Andy McCrae on 2021-03-10 04:13:11 CST ---

This looks to be happening because a change to metrics went in to the machine-api-operator, and the required provider change was not applied to the libvirt provider:

https://github.com/openshift/machine-api-operator/pull/609

This will impact all branches going back to 4.6 - I'm looking into applying the change to the libvirt provider now.

--- Additional comment from Dan Li on 2021-03-11 07:13:45 CST ---

Hi Andy or Prashanth, as part of bug triaging, can we provide a "Severity" for this bug?

--- Additional comment from Andy McCrae on 2021-03-11 10:24:10 CST ---

I've marked this as Medium - This should only impact CI since the libvirt-provider isn't used in supported environments, but none-the-less it impairs out ability to perform accurate CI runs.

We have a PR up for master, but we will need to shepherd this through to 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218

--- Additional comment from Yaakov Selkowitz on 2021-03-12 12:57:50 CST ---

(In reply to Andy McCrae from comment #5)
> We have a PR up for master, but we will need to shepherd this through to
> 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218

Automatic cherry-pick failed, so we'll need manual PRs for 4.7 and 4.6.

Comment 1 Dan Li 2021-03-16 11:55:40 UTC
Changing to "Assigned"

Comment 2 Dan Li 2021-03-17 14:13:53 UTC
Hi Prashanth, do you think this bug will be resolved before the end of this sprint? If not, can we set the "reviewed-in-sprint" flag?

Comment 3 Dan Li 2021-03-22 12:35:45 UTC
Setting the "Reviewed-in-Sprint" flag for the past sprint (ended on March 20th) as the bug was still at ASSIGNED state

Comment 6 Barry Donahue 2021-04-02 13:03:12 UTC
Verfified by last 10 runs in CI.

Comment 11 errata-xmlrpc 2021-04-20 19:27:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.25 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1153


Note You need to log in before you can comment on or make changes to this bug.