Bug 1908231

Summary: [Migration] The pods ovnkube-node are in CrashLoopBackOff after SDN to OVN
Product: OpenShift Container Platform Reporter: huirwang
Component: NetworkingAssignee: Peng Liu <pliu>
Networking sub component: ovn-kubernetes QA Contact: huirwang
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aconstan, aprabhak, pliu, tdale
Version: 4.7Keywords: TestBlocker
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:44:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1903544    
Attachments:
Description Flags
ovnkube-node.log none

Description huirwang 2020-12-16 07:15:28 UTC
Description of problem:
The pods ovnkube-node are in  CrashLoopBackOff after SDN to OVN

Version-Release number of selected component (if applicable):
4.7.0-0.nightly-2020-12-14-165231 

How reproducible:


Steps to Reproduce:
1. oc annotate Network.operator.openshift.io cluster networkoperator.openshift.io/network-migration=""
2.
  oc patch MachineConfigPool master --type=merge --patch '{"spec":{"paused":true}}'
machineconfigpool.machineconfiguration.openshift.io/master patched
oc patch MachineConfigPool worker --type=merge --patch '{"spec":{"paused":true}}'
machineconfigpool.machineconfiguration.openshift.io/worker patched
3.
Wait until the multus DaemonSet pods of multus in namespace openshift-multus are recreated

4. Manually reboot all the nodes from cloud portal

5. 
oc patch MachineConfigPool master --type='merge' --patch "{\"spec\":{\"paused\":false}}"
machineconfigpool.machineconfiguration.openshift.io/master patched

oc patch MachineConfigPool worker --type='merge' --patch "{\"spec\":{\"paused\":false}}"
machineconfigpool.machineconfiguration.openshift.io/worker patched


Actual results:
MCO did not updated the node config.
Then check openshift-ovn-kubernetes pods

oc get pods -n openshift-ovn-kubernetes -o wide
NAME                   READY   STATUS             RESTARTS   AGE   IP          NODE                                           NOMINATED NODE   READINESS GATES
ovnkube-master-27x47   6/6     Running            0          53m   10.0.0.7    huirwang-azure-lmnbl-master-0                  <none>           <none>
ovnkube-master-hvgj5   6/6     Running            0          53m   10.0.0.8    huirwang-azure-lmnbl-master-1                  <none>           <none>
ovnkube-master-n4cxt   6/6     Running            0          53m   10.0.0.6    huirwang-azure-lmnbl-master-2                  <none>           <none>
ovnkube-node-5hjzg     2/3     CrashLoopBackOff   13         53m   10.0.32.5   huirwang-azure-lmnbl-worker-centralus2-jccbd   <none>           <none>
ovnkube-node-9fz5d     2/3     CrashLoopBackOff   13         53m   10.0.32.4   huirwang-azure-lmnbl-worker-centralus1-jzzbm   <none>           <none>
ovnkube-node-k5jdn     2/3     CrashLoopBackOff   13         53m   10.0.32.6   huirwang-azure-lmnbl-worker-centralus3-k6rgf   <none>           <none>
ovnkube-node-q2c5h     2/3     CrashLoopBackOff   13         53m   10.0.0.8    huirwang-azure-lmnbl-master-1                  <none>           <none>
ovnkube-node-vvdxs     2/3     CrashLoopBackOff   13         53m   10.0.0.7    huirwang-azure-lmnbl-master-0                  <none>           <none>
ovnkube-node-wsz5q     2/3     CrashLoopBackOff   13         53m   10.0.0.6    huirwang-azure-lmnbl-master-2                  <none>           <none>
ovs-node-6hsfc         1/1     Running            0          54m   10.0.0.8    huirwang-azure-lmnbl-master-1                  <none>           <none>
ovs-node-f8d22         1/1     Running            0          54m   10.0.32.5   huirwang-azure-lmnbl-worker-centralus2-jccbd   <none>           <none>
ovs-node-g6lp5         1/1     Running            0          54m   10.0.0.6    huirwang-azure-lmnbl-master-2                  <none>           <none>
ovs-node-k4jbj         1/1     Running            0          54m   10.0.32.6   huirwang-azure-lmnbl-worker-centralus3-k6rgf   <none>           <none>
ovs-node-ttslv         1/1     Running            0          54m   10.0.32.4   huirwang-azure-lmnbl-worker-centralus1-jzzbm   <none>           <none>
ovs-node-wf2jm         1/1     Running            0          54m   10.0.0.7    huirwang-azure-lmnbl-master-0                  <none>           <none>



oc describe pod ovnkube-node-5hjzg  -n openshift-ovn-kubernetes
..........
State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   oller/pkg/node/startup-waiter.go:44 +0x94
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xd4
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
  panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1737744]

goroutine 182 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x1948160, 0x288e420)
  /usr/lib/golang/src/runtime/panic.go:969 +0x175
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1.1(0x414f9b, 0xc000298460, 0xc0005667b0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:45 +0x24
k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection(0xc000566798, 0x1339b00, 0x0, 0x0)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:211 +0x69
k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal(0xc00059c6a0, 0xc0005eff98, 0xc00059c6a0, 0xc0003b6540)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:445 +0x2f
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0x1dcd6500, 0x45d964b800, 0xc000566798, 0xc0005667b0, 0x1)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:441 +0x4d
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait.func1(0xc0003a3280, 0xc0003e41e0, 0xc00038f150)
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:44 +0x94
created by github.com/ovn-org/ovn-kubernetes/go-controller/pkg/node.(*startupWaiter).Wait
  /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/node/startup-waiter.go:42 +0xd4

      Exit Code:    2
      Started:      Wed, 16 Dec 2020 14:54:28 +0800
      Finished:     Wed, 16 Dec 2020 14:54:31 +0800
    Ready:          False


Reboot all the nodes again doesn't help.
Expected results:


SDN migrated to OVN successfully.

Additional info:

Comment 4 huirwang 2020-12-18 04:22:11 UTC
Created attachment 1740116 [details]
ovnkube-node.log

Comment 8 Tom Dale 2021-01-07 14:17:28 UTC
*** Bug 1908076 has been marked as a duplicate of this bug. ***

Comment 9 Surya Seetharaman 2021-01-07 14:30:36 UTC
*** Bug 1909187 has been marked as a duplicate of this bug. ***

Comment 11 Archana Prabhakar 2021-01-12 12:51:33 UTC
On power, we dont see the ovnkube pods crashlooping anymore, so we can close this bug. 
However, we noticed that one of the node remains in scheduling disabled state because some pod evictions fail. It could be a new issue, we will check further.

Comment 14 errata-xmlrpc 2021-02-24 15:44:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633