Bug 1999255 - ovnkube-node always crashes out the first time it starts [NEEDINFO]
Summary: ovnkube-node always crashes out the first time it starts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.10.0
Assignee: Christoph Stäbler
QA Contact: Dan Brahaney
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-30 18:11 UTC by Dan Winship
Modified: 2022-03-12 04:38 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:37:58 UTC
Target Upstream Version:
Embargoed:
anusaxen: needinfo? (dbrahane)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ovn-org ovn-kubernetes pull 2523 0 None Merged Node wait for Controller before initializing Gateway 2021-10-29 13:48:06 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:38:16 UTC

Description Dan Winship 2021-08-30 18:11:13 UTC
even on successful e2e ovn runs, each of the ovnkube-node pods has a previous.log that ends with:

F0830 07:32:04.337095    2679 ovnkube.go:131] error waiting for node readiness: failed while waiting on patch port "patch-br-ex_ip-10-0-144-192.us-west-1.compute.internal-to-br-int" to be created by ovn-controller and while getting ofport. stderr: "ovs-vsctl: no row \"patch-br-ex_ip-10-0-144-192.us-west-1.compute.internal-to-br-int\" in table Interface\n", error: exit status 1

That's from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1432239942456578048, specifically https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1432239942456578048/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-7dd8x_ovnkube-node_previous.log.

The corresponding ovn-controller log suggests ovnkube-node just didn't wait long enough. Actually, despite the claim in the error message, it looks like ovnkube-node is not actually *waiting* for ovn-controller, but rather, simply assuming ovn-controller will already be ready by the time it reaches that check. Which is turning out to be false.

Comment 5 zhaozhanqi 2022-01-25 03:39:28 UTC
I think this bug can be moved to verified on 4.10.0-0.nightly-2022-01-24-020644

no restart for ovnkube-node pod

$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS      AGE
ovnkube-master-92jk8   6/6     Running   0             43m
ovnkube-master-fnrf6   6/6     Running   6 (42m ago)   43m
ovnkube-master-ncx24   6/6     Running   6 (41m ago)   43m
ovnkube-node-bq8dw     5/5     Running   0             43m
ovnkube-node-fkd2z     5/5     Running   0             26m
ovnkube-node-mqwqb     5/5     Running   0             26m
ovnkube-node-qrlb4     5/5     Running   0             43m
ovnkube-node-s6m6v     5/5     Running   0             26m
ovnkube-node-zbjd7     5/5     Running   0             43m

Comment 12 errata-xmlrpc 2022-03-12 04:37:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.