Bug 1959185 - 4.6 CI failures with aws due to prometheus NoRunningOvnMaster alert
Summary: 4.6 CI failures with aws due to prometheus NoRunningOvnMaster alert
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.z
Assignee: jamo luhrsen
QA Contact: Ross Brattain
URL:
Whiteboard:
: 1960781 (view as bug list)
Depends On: 1891023
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-10 20:51 UTC by Tim Rozet
Modified: 2021-10-01 00:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-01 12:10:08 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-network-operator pull 1096 0 None open Bug 1959185: Backport rbac proxy init script fixes 2021-05-17 17:45:55 UTC
Red Hat Product Errata RHBA-2021:2100 0 None None None 2021-06-01 12:10:42 UTC

Description Tim Rozet 2021-05-10 20:51:05 UTC
Description of problem:
Looking at the logs it doesn't appear that ovnkube-master was down at all. It looks like the metric reporting may have an issue.

Example job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/525/pull-ci-openshift-ovn-kubernetes-release-4.6-e2e-aws-ovn/1390746993151709184

s: "promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"NoRunningOvnMaster\",\"alertstate\":\"firing\",\"severity\":\"warning\"},\"value\":[1620417749.571,\"1\"]


https://search.ci.openshift.org/?search=NoRunningOvnMaster&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 3 jamo luhrsen 2021-05-14 15:18:01 UTC
These failures started happening when this PR in CNO was merged to fix kube-rbac-proxy startup scripts
  https://github.com/openshift/cluster-network-operator/pull/1061

I think the scripts are now better reporting a problem when ovn-node-metrics-certs is not mounted.

in 4.7 ovn-node-metrics-certs is mounted fine:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.7/1393028380240121856/artifacts/e2e-aws/pods/openshift-ovn-kubernetes_ovnkube-node-w2fh2_kube-rbac-proxy.log

but, in 4.6, before the rbac-proxy script fix, you can see what looks like some trouble (traceback)
and there is no log message that ovn-node-metrics-certs is mounted. I assume this meant that we
did not fire any alert when maybe we should have:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.6/1388934081181388800/artifacts/e2e-aws/pods/openshift-ovn-kubernetes_ovnkube-node-p7pxb_kube-rbac-proxy.log

after that rbac-proxy script fix, we can see that it's failing repeatedly and I'm assuming
that's what fires the alert:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.6/1389008719353745408/artifacts/e2e-aws/pods/openshift-ovn-kubernetes_ovnkube-node-qphc2_kube-rbac-proxy.log

I will be investigating around ovn-node-metrics-certs next.

Comment 8 jamo luhrsen 2021-05-14 23:36:43 UTC
This PR fixes it for 4.6:
  https://github.com/openshift/cluster-network-operator/pull/1096

I don't know how to do the magic on this BZ, which was filed just for a bug on 4.6, so that I can link the PR. I am
getting a complaint that there needs to be a 4.7 bug if I want to use 4.6.z as the target. I tried moving this bug
to 4.7 to get around that, but that didn't work either.

Comment 9 Ben Bennett 2021-05-17 15:10:26 UTC
*** Bug 1960781 has been marked as a duplicate of this bug. ***

Comment 14 Ross Brattain 2021-05-26 05:05:33 UTC
Verified on 4.6.0-0.nightly-2021-05-24-230019 on ipi-on-aws/versioned-installer-ovn

It seems to take a few minutes to wait for the certs to be mounted.

The kube-rbac-proxy scrips them selves are now inconsistent with respect each other and to comments and messages, we don't wait for "one hour." for example, but these are cosmetic issues.

2021-05-25T21:14:03+00:00 INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
I0525 21:14:03.461287    2803 main.go:188] Valid token audiences:
I0525 21:14:03.461454    2803 main.go:261] Reading certificate files
I0525 21:14:03.461781    2803 main.go:294] Starting TCP socket on :9103
I0525 21:14:03.462287    2803 main.go:301] Listening securely on :9103
2021-05-25T21:04:50+00:00 INFO: ovn-node-metrics-cert not mounted. Waiting one hour.
2021-05-25T21:08:30+00:00 INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
I0525 21:08:30.646215    2542 main.go:188] Valid token audiences:
I0525 21:08:30.646299    2542 main.go:261] Reading certificate files
I0525 21:08:30.646552    2542 main.go:294] Starting TCP socket on :9103
I0525 21:08:30.646875    2542 main.go:301] Listening securely on :9103
2021-05-25T21:14:04+00:00 INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
I0525 21:14:04.396860    2848 main.go:188] Valid token audiences:
I0525 21:14:04.397075    2848 main.go:261] Reading certificate files
I0525 21:14:04.398162    2848 main.go:294] Starting TCP socket on :9103
I0525 21:14:04.398606    2848 main.go:301] Listening securely on :9103
2021-05-25T21:04:51+00:00 INFO: ovn-node-metrics-cert not mounted. Waiting one hour.
2021-05-25T21:08:51+00:00 INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
I0525 21:08:51.614035    2550 main.go:188] Valid token audiences:
I0525 21:08:51.614128    2550 main.go:261] Reading certificate files
I0525 21:08:51.614435    2550 main.go:294] Starting TCP socket on :9103
I0525 21:08:51.614774    2550 main.go:301] Listening securely on :9103
2021-05-25T21:16:25+00:00 INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
I0525 21:16:25.748801    2832 main.go:188] Valid token audiences:
I0525 21:16:25.748947    2832 main.go:261] Reading certificate files
I0525 21:16:25.749240    2832 main.go:294] Starting TCP socket on :9103
I0525 21:16:25.749995    2832 main.go:301] Listening securely on :9103
2021-05-25T21:04:47+00:00 INFO: ovn-node-metrics-cert not mounted. Waiting one hour.
2021-05-25T21:08:32+00:00 INFO: ovn-node-metrics-certs mounted, starting kube-rbac-proxy
I0525 21:08:32.399646    2642 main.go:188] Valid token audiences:
I0525 21:08:32.399716    2642 main.go:261] Reading certificate files
I0525 21:08:32.399910    2642 main.go:294] Starting TCP socket on :9103
I0525 21:08:32.400221    2642 main.go:301] Listening securely on :9103

Comment 16 errata-xmlrpc 2021-06-01 12:10:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2100


Note You need to log in before you can comment on or make changes to this bug.