Created attachment 1643381 [details] ovnkube pod logs Description of problem: Pod fails to create after deleting the ovnkube-master pod for the OVN_Southbound raft leader. After a few iterations of creating a testpod, then delete the ovnkube-master pod, the test pod eventually fails to create. Version-Release number of selected component (if applicable): 4.3.0-0.nightly-2019-12-09-035405 How reproducible: very Steps to Reproduce: 1. oc new-project t1 2. oc create -f https://raw.githubusercontent.com/anuragthehatter/v3-testfiles/master/networking/pod-for-ping.json 3. Find ovnkube-master pod for the OVN_Southbound leader 4. for f in $(oc -n openshift-ovn-kubernetes get pod -l name=ovnkube-master -o jsonpath='{range .items[*]}{.metadata.name}{" "}{end}') ; do echo -e "\n${f}\n" ; oc -n openshift-ovn-kubernetes exec "${f}" -c northd -- ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl cluster/status OVN_Southbound ; done 5. oc -n openshift-ovn-kubernetes delete pod $LEADER 6. oc delete pod hello-pod 7. Repeat step 2 to create hello-pod again 8. Repeat step 4 and 5 to delete the OVN_Southbound leader again. 9. Repeat steps until hello-pod fails to create within a reasonable amount of time, < 5 minutes. 10. oc describe pod hello-pod Actual results: Observe message in oc describe pod hello-pod Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_hello-pod_t1_20073c58-568d-4300-a1ff-f88d9b6bfc5c_0(25017624de5cdc4827377f301499f3fa698a6b9f310548a18fa4f3e4a220dd8e):Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition Expected results: hello-pod should create successfully within a reasonable amount of time. Additional info:
Hello, I just noticed the same error in my environment. Once I apply MachineConfig resource and it restarts some worker nodes I face with the same error: Normal Scheduled 51s default-scheduler Successfully assigned sctptest/checksctppbsls to worker-1 Warning FailedCreatePodSandBox 23s kubelet, worker-1 Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_checksctppbsls_sctptest_31321e45-0ccb-4de5-bcc6-dcac0f4a73b9_0(25c0a5d3429f8ede19aa65166c7767d84c5028069e3ad9d00c5fc7be7f437d9d): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition It happened sporadically but pretty common like every second pod is in ContainerCreating state. My environment is libvirt [root@cnv-qe-01 ~]# oc version Client Version: 4.4.0-0.ci-2020-01-15-100606 Server Version: 4.4.0-0.ci-2020-01-15-100606 Kubernetes Version: v1.17.0 Also I notice that ovn pods are restarting openshift-ovn-kubernetes ovnkube-master-fx8gg 4/4 Running 4 20h openshift-ovn-kubernetes ovnkube-master-xczjm 4/4 Running 9 71m openshift-ovn-kubernetes ovnkube-master-z7g44 4/4 Running 20 20h I tried to reboot entire cluster and after it I noticed the following panics in OVN master: The first one is: I0116 10:19:01.629834 1 leaderelection.go:235] attempting to acquire leader lease openshift-ovn-kubernetes/ovn-kubernetes-master... I0116 10:19:01.640741 1 leaderelection.go:245] successfully acquired lease openshift-ovn-kubernetes/ovn-kubernetes-master time="2020-01-16T10:19:01Z" level=info msg=" I (master-1) won the election. In active mode" panic: assignment to entry in nil map goroutine 29 [running]: github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).getLbEndpoints(0xc000341440, 0xc0008ad3a8, 0xc0007aaa50, 0xc0007aaa80) /go-controller/pkg/ovn/endpoints.go:33 +0x421 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handleNodePortLB(0xc000341440, 0xc00014b000, 0x1, 0xc00003e3f0) /go-controller/pkg/ovn/endpoints.go:162 +0x783 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).syncGatewayLogicalNetwork(0xc000341440, 0xc00014b000, 0xc000822db0, 0xc00081eea0, 0xd, 0x1463860, 0x1634fac) /go-controller/pkg/ovn/master.go:409 +0x567 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).syncNodeGateway(0xc000341440, 0xc00014b000, 0xc000822750, 0x0, 0x0) /go-controller/pkg/ovn/ovn.go:498 +0x11d github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNodes.func1(0x161e500, 0xc00014b000) /go-controller/pkg/ovn/ovn.go:524 +0x246 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:195 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd(0xc0004e5b80, 0x18b2ce0, 0xc0007c3640, 0x161e500, 0xc00014b000) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:227 +0x69 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc000347a10, 0x18f97c0, 0x161e500, 0x0, 0x0, 0x0, 0x18b2ce0, 0xc0007c3640, 0xc0008ade08, 0x0, ...) /go-controller/pkg/factory/factory.go:268 +0x575 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredNodeHandler(...) /go-controller/pkg/factory/factory.go:374 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNodes(0xc000341440, 0x0, 0xc0000e7590, 0x7) /go-controller/pkg/ovn/ovn.go:509 +0x1d5 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).Run(0xc000341440, 0x0, 0xc00009a780, 0x0, 0x0) /go-controller/pkg/ovn/ovn.go:166 +0x52 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*HAMasterController).ConfigureAsActive(0xc000717720, 0x7fffa9418bef, 0x8, 0x28, 0xc00042f7b0) /go-controller/pkg/ovn/ha_master.go:209 +0x96 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*HAMasterController).StartHAMasterController.func1(0x18b6da0, 0xc0002bc400) /go-controller/pkg/ovn/ha_master.go:79 +0xd7 created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run /go-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec The second one is: time="2020-01-16T10:24:20Z" level=info msg="Setting annotations map[k8s.ovn.org/pod-networks:{\"default\":{\"ip_address\":\"10.128.2.8/23\",\"mac_address\":\"fa:e8:5b:80:02:09\",\"gateway_ip\":\"10.128.2.1\"}} ovn:{\"ip_address\":\"10.128.2.8/23\",\"mac_address\":\"fa:e8:5b:80:02:09\",\"gateway_ip\":\"10.128.2.1\"}] on pod sctptest/sctpclient2gphn" time="2020-01-16T10:24:21Z" level=info msg="[sctptest/sctpclient2gphn] addLogicalPort took 517.143088ms" E0116 10:24:30.357812 1 runtime.go:73] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map) goroutine 87 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x148db40, 0x186a0d0) /go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:69 +0x7b k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 +0x82 panic(0x148db40, 0x186a0d0) /usr/local/go/src/runtime/panic.go:522 +0x1b5 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).getLbEndpoints(0xc000aa86c0, 0xc0001857c0, 0xc0007d5290, 0xc0007d52c0) /go-controller/pkg/ovn/endpoints.go:33 +0x421 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).AddEndpoints(0xc000aa86c0, 0xc0001857c0, 0x13ea7c0, 0x2510280) /go-controller/pkg/ovn/endpoints.go:59 +0x109 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchEndpoints.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0) /go-controller/pkg/ovn/ovn.go:424 +0x19c k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate(0xc00057ed00, 0x18b2ce0, 0xc00038fe20, 0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:236 +0x120 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2.1(0xc000bdc1e0) /go-controller/pkg/factory/factory.go:151 +0x64 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc000412090, 0x161d700, 0xc0001857c0, 0xc00086fcd8) /go-controller/pkg/factory/factory.go:54 +0x16d github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0) /go-controller/pkg/factory/factory.go:150 +0x8c k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202 k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc000466c00, 0xc000480a00, 0x0) /go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:605 +0x1dd k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc00086fe38, 0x42c47f, 0xc000480a30) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:284 +0x51 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:601 +0x79 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00032ef68) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00086ff68, 0xdf8475800, 0x0, 0x42cd01, 0xc0003298c0) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/apimachinery/pkg/util/wait.Until(...) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 k8s.io/client-go/tools/cache.(*processorListener).run(0xc0003eac80) /go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:599 +0x9c k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc000393170, 0xc0004a6080) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62 panic: assignment to entry in nil map [recovered] panic: assignment to entry in nil map goroutine 87 [running]: k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105 panic(0x148db40, 0x186a0d0) /usr/local/go/src/runtime/panic.go:522 +0x1b5 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).getLbEndpoints(0xc000aa86c0, 0xc0001857c0, 0xc0007d5290, 0xc0007d52c0) /go-controller/pkg/ovn/endpoints.go:33 +0x421 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).AddEndpoints(0xc000aa86c0, 0xc0001857c0, 0x13ea7c0, 0x2510280) /go-controller/pkg/ovn/endpoints.go:59 +0x109 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchEndpoints.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0) /go-controller/pkg/ovn/ovn.go:424 +0x19c k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate(0xc00057ed00, 0x18b2ce0, 0xc00038fe20, 0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:236 +0x120 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2.1(0xc000bdc1e0) /go-controller/pkg/factory/factory.go:151 +0x64 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc000412090, 0x161d700, 0xc0001857c0, 0xc00086fcd8) /go-controller/pkg/factory/factory.go:54 +0x16d github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0) /go-controller/pkg/factory/factory.go:150 +0x8c k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...) /go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202 k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc000466c00, 0xc000480a00, 0x0) /go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:605 +0x1dd k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc00086fe38, 0x42c47f, 0xc000480a30) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:284 +0x51 k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:601 +0x79 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00032ef68) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00086ff68, 0xdf8475800, 0x0, 0x42cd01, 0xc0003298c0) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 k8s.io/apimachinery/pkg/util/wait.Until(...) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 k8s.io/client-go/tools/cache.(*processorListener).run(0xc0003eac80) /go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:599 +0x9c k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc000393170, 0xc0004a6080) /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start /go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62
Please update regarding my comment above.
Hi So this has been fixed in upstreams ovnkubernetes, and should be existing downstreams openshift/ovn-kubernetes in 4.3/4.4/4.5 Go ahead and re-test -Alex
Unable to reproduce in 4.4.0-0.nightly-2020-03-05-084644
@Alexander, how did you figure that this is fixed on 4.3? because we see it failing in 4.3, And apparently there is no 4.3 clone, so I can create one, but I wanted to make sure before that we were expecting it to be fixed, and somehow it got missed
Hi Yes this issue together with a bunch of other should be fixed because branch 4.4 and 4.3 should be completely aligned (we have basically rebased them one on top of the other). What are you experiencing in your tests on 4.3, is it the panic in #comment 4? Or the 400 in #comment 1? Which tests are you actually performing by the way (I guess the steps described in #comment 1)? - Alex
Ill let Nikita or Marius provide the technical details
Hi, We faced with the same issue. Please take a look on the logs below: We noticed that ovnkube pods crash all the time > oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-lwggs 4/4 Running 14 3d18h ovnkube-master-q2xht 3/4 Error 14 3d18h ovnkube-master-s29zx 3/4 CrashLoopBackOff 13 3d18h ovnkube-node-9nt7j 3/3 Running 0 3d17h ovnkube-node-ffvjz 3/3 Running 0 3d18h ovnkube-node-fvqbb 3/3 Running 0 3d18h ovnkube-node-hchb2 3/3 Running 0 3d17h ovnkube-node-jlzjd 3/3 Running 0 3d18h This is log from the pod which I'm trying to start: Warning FailedCreatePodSandBox 64s kubelet, openshift-worker-1.qe1.kni.lab.eng.bos.redhat.com Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_checksctph5z8j_sctptest_881d34ae-fb06-4b3e-98be-a58715b599a2_0(4db0832a72074eb3ffb389132ebcb2e41636e26b2c808d2d8d2c587e6a6113e5): Multus: [sctptest/checksctph5z8j]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[sctptest/checksctph5z8j] failed to get pod annotation: timed out waiting for the condition This happened before actual test. I can not run the pod. The idea of the test is validate that sctp protocol works correctly but as I noticed it falls before the test.
Hi Wait a minute, you're testing SCTP? This is not support by OVN right now and is a current work-in-progress, https://github.com/ovn-org/ovn-kubernetes/issues/1120 and https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=5254&view=planning&selectedIssue=SDN-138 -Alex
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581