Bug 1936488
Summary: | [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: Prometheus query error | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Aditya Narayanaswamy <anarayan> | |
Component: | Multi-Arch | Assignee: | Prashanth Sundararaman <psundara> | |
Status: | CLOSED ERRATA | QA Contact: | Jeremy Poulin <jpoulin> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.8 | CC: | alegrand, amccrae, anpicker, danili, dgrisonn, erooth, kakkoyun, kir, lcosic, pkrupa, psundara, spasquie, surbania, wking, yselkowi | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1938316 (view as bug list) | Environment: |
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
|
|
Last Closed: | 2021-07-27 22:51:42 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1938316 |
Description
Aditya Narayanaswamy
2021-03-08 15:03:32 UTC
Looking at release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 [1], there's a problem with DNS resolution for Alertmanager pods [2] which leads to "AlertmanagerMembersInconsistent" being fired. It should be redirected to the teams dealing with libvirt and/or ppc64 platforms because it's not something we see in other environments. Looking at periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt [3], "TargetDown" is firing for the node-exporter and kubelet targets. It means that for some reason, Prometheus fails to scrape metrics and I can see some failures in node_exporter's kube-rbac-proxy [4][5][6] that would match. Again since it is specific to a given job, it would be best redirected to the folks in charge of this job. [1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896/artifacts/e2e-remote-libvirt/pods/openshift-monitoring_alertmanager-main-0_alertmanager.log [3] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1368886803049746432 [4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-dnw48_kube-rbac-proxy.log [5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-5jz89_node-exporter.log [6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-grgw8_kube-rbac-proxy.log i see this issue on any libvirt deploy on baremetal..not specific to multi-arch, where there is this alert firing: 33.33% of the machine-api-controllers/machine-api-controllers targets in NamespaceNSopenshift-machine-api namespace are down. this is due to the kube-rbac-proxy-machine-mtrc container reporting these errors: I0309 17:29:57.278737 1 main.go:159] Reading config file: /etc/kube-rbac-proxy/config-file.yaml I0309 17:29:57.285997 1 main.go:190] Valid token audiences: I0309 17:29:57.286071 1 main.go:278] Reading certificate files I0309 17:29:57.286336 1 main.go:311] Starting TCP socket on 0.0.0.0:8441 I0309 17:29:57.287013 1 main.go:318] Listening securely on 0.0.0.0:8441 2021/03/09 17:43:04 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:43:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:43:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:43:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:44:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused 2021/03/09 17:45:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused This looks to be happening because a change to metrics went in to the machine-api-operator, and the required provider change was not applied to the libvirt provider: https://github.com/openshift/machine-api-operator/pull/609 This will impact all branches going back to 4.6 - I'm looking into applying the change to the libvirt provider now. Hi Andy or Prashanth, as part of bug triaging, can we provide a "Severity" for this bug? I've marked this as Medium - This should only impact CI since the libvirt-provider isn't used in supported environments, but none-the-less it impairs out ability to perform accurate CI runs. We have a PR up for master, but we will need to shepherd this through to 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218 (In reply to Andy McCrae from comment #5) > We have a PR up for master, but we will need to shepherd this through to > 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218 Automatic cherry-pick failed, so we'll need manual PRs for 4.7 and 4.6. verified with release: https://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/4.8.0-0.nightly-2021-03-14-134919/ Should this one be closed now? Yeah you can close it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |