Bug 1999255 - ovnkube-node always crashes out the first time it starts
Summary: ovnkube-node always crashes out the first time it starts
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.10.0
Assignee: Christoph Stäbler
QA Contact: Dan Brahaney
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-30 18:11 UTC by Dan Winship
Modified: 2023-09-15 01:14 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-12 04:37:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ovn-org ovn-kubernetes pull 2523 0 None Merged Node wait for Controller before initializing Gateway 2021-10-29 13:48:06 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:38:16 UTC

Description Dan Winship 2021-08-30 18:11:13 UTC
even on successful e2e ovn runs, each of the ovnkube-node pods has a previous.log that ends with:

F0830 07:32:04.337095    2679 ovnkube.go:131] error waiting for node readiness: failed while waiting on patch port "patch-br-ex_ip-10-0-144-192.us-west-1.compute.internal-to-br-int" to be created by ovn-controller and while getting ofport. stderr: "ovs-vsctl: no row \"patch-br-ex_ip-10-0-144-192.us-west-1.compute.internal-to-br-int\" in table Interface\n", error: exit status 1

That's from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1432239942456578048, specifically https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn/1432239942456578048/artifacts/e2e-aws-ovn/gather-extra/artifacts/pods/openshift-ovn-kubernetes_ovnkube-node-7dd8x_ovnkube-node_previous.log.

The corresponding ovn-controller log suggests ovnkube-node just didn't wait long enough. Actually, despite the claim in the error message, it looks like ovnkube-node is not actually *waiting* for ovn-controller, but rather, simply assuming ovn-controller will already be ready by the time it reaches that check. Which is turning out to be false.

Comment 5 zhaozhanqi 2022-01-25 03:39:28 UTC
I think this bug can be moved to verified on 4.10.0-0.nightly-2022-01-24-020644

no restart for ovnkube-node pod

$ oc get pod -n openshift-ovn-kubernetes
NAME                   READY   STATUS    RESTARTS      AGE
ovnkube-master-92jk8   6/6     Running   0             43m
ovnkube-master-fnrf6   6/6     Running   6 (42m ago)   43m
ovnkube-master-ncx24   6/6     Running   6 (41m ago)   43m
ovnkube-node-bq8dw     5/5     Running   0             43m
ovnkube-node-fkd2z     5/5     Running   0             26m
ovnkube-node-mqwqb     5/5     Running   0             26m
ovnkube-node-qrlb4     5/5     Running   0             43m
ovnkube-node-s6m6v     5/5     Running   0             26m
ovnkube-node-zbjd7     5/5     Running   0             43m

Comment 12 errata-xmlrpc 2022-03-12 04:37:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 13 Red Hat Bugzilla 2023-09-15 01:14:26 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.