1908231 – [Migration] The pods ovnkube-node are in CrashLoopBackOff after SDN to OVN

Bug 1908231 - [Migration] The pods ovnkube-node are in CrashLoopBackOff after SDN to OVN

Summary: [Migration] The pods ovnkube-node are in CrashLoopBackOff after SDN to OVN

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Peng Liu
QA Contact:	huirwang
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1908076 1909187 (view as bug list)
Depends On:
Blocks:	ocp-47-z-tracker
TreeView+	depends on / blocked

Reported:	2020-12-16 07:15 UTC by huirwang
Modified:	2021-02-24 15:45 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:44:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ovnkube-node.log (44.50 KB, text/plain) 2020-12-18 04:22 UTC, huirwang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ovn-kubernetes pull 393	0	None	closed	Bug 1908231: Assign readyFunc during local gateway init	2021-02-02 21:33:24 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:45:16 UTC

Description huirwang 2020-12-16 07:15:28 UTC

Description of problem:
The pods ovnkube-node are in  CrashLoopBackOff after SDN to OVN

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-14-165231 

How reproducible:


Steps to Reproduce:
1. oc annotate Network.operator.openshift.io cluster networkoperator.openshift.io/network-migration=""
2.
  oc patch MachineConfigPool master --type=merge --patch '{"spec":{"paused":true}}'
machineconfigpool.machineconfiguration.openshift.io/master patched
oc patch MachineConfigPool worker --type=merge --patch '{"spec":{"paused":true}}'
machineconfigpool.machineconfiguration.openshift.io/worker patched
3.
Wait until the multus DaemonSet pods of multus in namespace openshift-multus are recreated

4. Manually reboot all the nodes from cloud portal

5. 
oc patch MachineConfigPool master --type='merge' --patch "{\"spec\":{\"paused\":false}}"
machineconfigpool.machineconfiguration.openshift.io/master patched

oc patch MachineConfigPool worker --type='merge' --patch "{\"spec\":{\"paused\":false}}"
machineconfigpool.machineconfiguration.openshift.io/worker patched


Actual results:
MCO did not updated the node config.
Then check openshift-ovn-kubernetes pods

oc get pods -n openshift-ovn-kubernetes -o wide
NAME                   READY   STATUS             RESTARTS   AGE   IP          NODE                                           NOMINATED NODE   READINESS GATES
ovnkube-master-27x47   6/6     Running            0          53m   10.0.0.7    huirwang-azure-lmnbl-master-0                  <none>           <none>
ovnkube-master-hvgj5   6/6     Running            0          53m   10.0.0.8    huirwang-azure-lmnbl-master-1                  <none>           <none>
ovnkube-master-n4cxt   6/6     Running            0          53m   10.0.0.6    huirwang-azure-lmnbl-master-2                  <none>           <none>
ovnkube-node-5hjzg     2/3     CrashLoopBackOff   13         53m   10.0.32.5   huirwang-azure-lmnbl-worker-centralus2-jccbd   <none>           <none>
ovnkube-node-9fz5d     2/3     CrashLoopBackOff   13         53m   10.0.32.4   huirwang-azure-lmnbl-worker-centralus1-jzzbm   <none>           <none>
ovnkube-node-k5jdn     2/3     CrashLoopBackOff   13         53m   10.0.32.6   huirwang-azure-lmnbl-worker-centralus3-k6rgf   <none>           <none>
ovnkube-node-q2c5h     2/3     CrashLoopBackOff   13         53m   10.0.0.8    huirwang-azure-lmnbl-master-1                  <none>           <none>
ovnkube-node-vvdxs     2/3     CrashLoopBackOff   13         53m   10.0.0.7    huirwang-azure-lmnbl-master-0                  <none>           <none>
ovnkube-node-wsz5q     2/3     CrashLoopBackOff   13         53m   10.0.0.6    huirwang-azure-lmnbl-master-2                  <none>           <none>
ovs-node-6hsfc         1/1     Running            0          54m   10.0.0.8    huirwang-azure-lmnbl-master-1                  <none>           <none>
ovs-node-f8d22         1/1     Running            0          54m   10.0.32.5   huirwang-azure-lmnbl-worker-centralus2-jccbd   <none>           <none>
ovs-node-g6lp5         1/1     Running            0          54m   10.0.0.6    huirwang-azure-lmnbl-master-2                  <none>           <none>
ovs-node-k4jbj         1/1     Running            0          54m   10.0.32.6   huirwang-azure-lmnbl-worker-centralus3-k6rgf   <none>           <none>
ovs-node-ttslv         1/1     Running            0          54m   10.0.32.4   huirwang-azure-lmnbl-worker-centralus1-jzzbm   <none>           <none>
ovs-node-wf2jm         1/1     Running            0          54m   10.0.0.7    huirwang-azure-lmnbl-master-0                  <none>           <none>



oc describe pod ovnkube-node-5hjzg  -n openshift-ovn-kubernetes
..........
State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   oller/pkg/node/startup-waiter.go:44 +0x94
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xd4
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
  panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1737744]

goroutine 182 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x1948160, 0x288e420)
  /usr/lib/golang/src/runtime/panic.go:969 +0x175
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1.1(0x414f9b, 0xc000298460, 0xc0005667b0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:45 +0x24
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0xc000566798, 0x1339b00, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:211 +0x69
k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal(0xc00059c6a0, 0xc0005eff98, 0xc00059c6a0, 0xc0003b6540)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:445 +0x2f
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0x1dcd6500, 0x45d964b800, 0xc000566798, 0xc0005667b0, 0x1)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:441 +0x4d
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1(0xc0003a3280, 0xc0003e41e0, 0xc00038f150)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:44 +0x94
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xd4

      Exit Code:    2
      Started:      Wed, 16 Dec 2020 14:54:28 +0800
      Finished:     Wed, 16 Dec 2020 14:54:31 +0800
    Ready:          False


Reboot all the nodes again doesn't help.
Expected results:


SDN migrated to OVN successfully.

Additional info:

Comment 4 huirwang 2020-12-18 04:22:11 UTC

Created attachment 1740116 [details]
ovnkube-node.log

Comment 8 Tom Dale 2021-01-07 14:17:28 UTC

*** Bug 1908076 has been marked as a duplicate of this bug. ***

Comment 9 Surya Seetharaman 2021-01-07 14:30:36 UTC

*** Bug 1909187 has been marked as a duplicate of this bug. ***

Comment 11 Archana Prabhakar 2021-01-12 12:51:33 UTC

On power, we dont see the ovnkube pods crashlooping anymore, so we can close this bug. 
However, we noticed that one of the node remains in scheduling disabled state because some pod evictions fail. It could be a new issue, we will check further.

Comment 14 errata-xmlrpc 2021-02-24 15:44:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.