Bug 1781297 - pod fails to create after delete ovnkube-master OVN_Southbound leader - status 400: 'failed to get pod annotation: timed out waiting for the condition
Summary: pod fails to create after delete ovnkube-master OVN_Southbound leader - statu...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Alexander Constantinescu
QA Contact: Ross Brattain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-09 17:51 UTC by Ross Brattain
Modified: 2021-03-15 20:19 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-13 21:54:24 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-13 21:54:26 UTC

Description Ross Brattain 2019-12-09 17:51:56 UTC
Created attachment 1643381 [details]
ovnkube pod logs

Description of problem:

Pod fails to create after deleting the ovnkube-master pod for the OVN_Southbound raft leader.

After a few iterations of creating a testpod, then delete the ovnkube-master pod, the test pod eventually fails to create.


Version-Release number of selected component (if applicable):

4.3.0-0.nightly-2019-12-09-035405


How reproducible:

very

Steps to Reproduce:
1. oc new-project t1
2. oc create -f https://raw.githubusercontent.com/anuragthehatter/v3-testfiles/master/networking/pod-for-ping.json
3. Find ovnkube-master pod for the OVN_Southbound leader
4. for f in $(oc -n openshift-ovn-kubernetes get pod -l name=ovnkube-master -o jsonpath='{range .items[*]}{.metadata.name}{" "}{end}')  ; do echo -e "\n${f}\n" ; oc  -n openshift-ovn-kubernetes exec "${f}" -c northd -- ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl cluster/status OVN_Southbound  ; done
5. oc -n openshift-ovn-kubernetes delete pod $LEADER
6. oc delete pod hello-pod
7. Repeat step 2 to create hello-pod again
8. Repeat step 4 and 5 to delete the OVN_Southbound leader again.
9. Repeat steps until hello-pod fails to create within a reasonable amount of time, < 5 minutes.
10. oc describe pod hello-pod

Actual results:

Observe message in oc describe pod hello-pod

Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_hello-pod_t1_20073c58-568d-4300-a1ff-f88d9b6bfc5c_0(25017624de5cdc4827377f301499f3fa698a6b9f310548a18fa4f3e4a220dd8e):Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition

Expected results:

hello-pod should create successfully within a reasonable amount of time.

Additional info:

Comment 4 Nikita 2020-01-16 10:33:53 UTC
Hello,
I just noticed the same error in my environment. Once I apply MachineConfig resource and it restarts some worker nodes I face with the same error:
  Normal   Scheduled               51s   default-scheduler  Successfully assigned sctptest/checksctppbsls to worker-1
  Warning  FailedCreatePodSandBox  23s   kubelet, worker-1  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_checksctppbsls_sctptest_31321e45-0ccb-4de5-bcc6-dcac0f4a73b9_0(25c0a5d3429f8ede19aa65166c7767d84c5028069e3ad9d00c5fc7be7f437d9d): Multus: error adding pod to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: 'failed to get pod annotation: timed out waiting for the condition

It happened sporadically but pretty common like every second pod is in ContainerCreating state.

My environment is libvirt
[root@cnv-qe-01 ~]# oc version
Client Version: 4.4.0-0.ci-2020-01-15-100606
Server Version: 4.4.0-0.ci-2020-01-15-100606
Kubernetes Version: v1.17.0

Also I notice that ovn pods are restarting 
openshift-ovn-kubernetes                                ovnkube-master-fx8gg                                              4/4     Running             4          20h
openshift-ovn-kubernetes                                ovnkube-master-xczjm                                              4/4     Running             9          71m
openshift-ovn-kubernetes                                ovnkube-master-z7g44                                              4/4     Running             20         20h

I tried to reboot entire cluster and after it I noticed the following panics in OVN master:

The first one is:
I0116 10:19:01.629834       1 leaderelection.go:235] attempting to acquire leader lease  openshift-ovn-kubernetes/ovn-kubernetes-master...
I0116 10:19:01.640741       1 leaderelection.go:245] successfully acquired lease openshift-ovn-kubernetes/ovn-kubernetes-master
time="2020-01-16T10:19:01Z" level=info msg=" I (master-1) won the election. In active mode"

panic: assignment to entry in nil map

goroutine 29 [running]:
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).getLbEndpoints(0xc000341440, 0xc0008ad3a8, 0xc0007aaa50, 0xc0007aaa80)
	/go-controller/pkg/ovn/endpoints.go:33 +0x421
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).handleNodePortLB(0xc000341440, 0xc00014b000, 0x1, 0xc00003e3f0)
	/go-controller/pkg/ovn/endpoints.go:162 +0x783
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).syncGatewayLogicalNetwork(0xc000341440, 0xc00014b000, 0xc000822db0, 0xc00081eea0, 0xd, 0x1463860, 0x1634fac)
	/go-controller/pkg/ovn/master.go:409 +0x567
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).syncNodeGateway(0xc000341440, 0xc00014b000, 0xc000822750, 0x0, 0x0)
	/go-controller/pkg/ovn/ovn.go:498 +0x11d
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNodes.func1(0x161e500, 0xc00014b000)
	/go-controller/pkg/ovn/ovn.go:524 +0x246
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:195
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd(0xc0004e5b80, 0x18b2ce0, 0xc0007c3640, 0x161e500, 0xc00014b000)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:227 +0x69
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).addHandler(0xc000347a10, 0x18f97c0, 0x161e500, 0x0, 0x0, 0x0, 0x18b2ce0, 0xc0007c3640, 0xc0008ade08, 0x0, ...)
	/go-controller/pkg/factory/factory.go:268 +0x575
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).AddFilteredNodeHandler(...)
	/go-controller/pkg/factory/factory.go:374
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchNodes(0xc000341440, 0x0, 0xc0000e7590, 0x7)
	/go-controller/pkg/ovn/ovn.go:509 +0x1d5
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).Run(0xc000341440, 0x0, 0xc00009a780, 0x0, 0x0)
	/go-controller/pkg/ovn/ovn.go:166 +0x52
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*HAMasterController).ConfigureAsActive(0xc000717720, 0x7fffa9418bef, 0x8, 0x28, 0xc00042f7b0)
	/go-controller/pkg/ovn/ha_master.go:209 +0x96
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*HAMasterController).StartHAMasterController.func1(0x18b6da0, 0xc0002bc400)
	/go-controller/pkg/ovn/ha_master.go:79 +0xd7
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/go-controller/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec


The second one is:

time="2020-01-16T10:24:20Z" level=info msg="Setting annotations map[k8s.ovn.org/pod-networks:{\"default\":{\"ip_address\":\"10.128.2.8/23\",\"mac_address\":\"fa:e8:5b:80:02:09\",\"gateway_ip\":\"10.128.2.1\"}} ovn:{\"ip_address\":\"10.128.2.8/23\",\"mac_address\":\"fa:e8:5b:80:02:09\",\"gateway_ip\":\"10.128.2.1\"}] on pod sctptest/sctpclient2gphn"
time="2020-01-16T10:24:21Z" level=info msg="[sctptest/sctpclient2gphn] addLogicalPort took 517.143088ms"
E0116 10:24:30.357812       1 runtime.go:73] Observed a panic: "assignment to entry in nil map" (assignment to entry in nil map)
goroutine 87 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x148db40, 0x186a0d0)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:69 +0x7b
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 +0x82
panic(0x148db40, 0x186a0d0)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).getLbEndpoints(0xc000aa86c0, 0xc0001857c0, 0xc0007d5290, 0xc0007d52c0)
	/go-controller/pkg/ovn/endpoints.go:33 +0x421
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).AddEndpoints(0xc000aa86c0, 0xc0001857c0, 0x13ea7c0, 0x2510280)
	/go-controller/pkg/ovn/endpoints.go:59 +0x109
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchEndpoints.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0)
	/go-controller/pkg/ovn/ovn.go:424 +0x19c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate(0xc00057ed00, 0x18b2ce0, 0xc00038fe20, 0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:236 +0x120
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2.1(0xc000bdc1e0)
	/go-controller/pkg/factory/factory.go:151 +0x64
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc000412090, 0x161d700, 0xc0001857c0, 0xc00086fcd8)
	/go-controller/pkg/factory/factory.go:54 +0x16d
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0)
	/go-controller/pkg/factory/factory.go:150 +0x8c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202
k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc000466c00, 0xc000480a00, 0x0)
	/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:605 +0x1dd
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc00086fe38, 0x42c47f, 0xc000480a30)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:284 +0x51
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:601 +0x79
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00032ef68)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00086ff68, 0xdf8475800, 0x0, 0x42cd01, 0xc0003298c0)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0003eac80)
	/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:599 +0x9c
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc000393170, 0xc0004a6080)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62
panic: assignment to entry in nil map [recovered]
	panic: assignment to entry in nil map

goroutine 87 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x148db40, 0x186a0d0)
	/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).getLbEndpoints(0xc000aa86c0, 0xc0001857c0, 0xc0007d5290, 0xc0007d52c0)
	/go-controller/pkg/ovn/endpoints.go:33 +0x421
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).AddEndpoints(0xc000aa86c0, 0xc0001857c0, 0x13ea7c0, 0x2510280)
	/go-controller/pkg/ovn/endpoints.go:59 +0x109
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/ovn.(*Controller).WatchEndpoints.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0)
	/go-controller/pkg/ovn/ovn.go:424 +0x19c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202
k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnUpdate(0xc00057ed00, 0x18b2ce0, 0xc00038fe20, 0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:236 +0x120
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2.1(0xc000bdc1e0)
	/go-controller/pkg/factory/factory.go:151 +0x64
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*informer).forEachHandler(0xc000412090, 0x161d700, 0xc0001857c0, 0xc00086fcd8)
	/go-controller/pkg/factory/factory.go:54 +0x16d
github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*WatchFactory).newFederatedHandler.func2(0x161d700, 0xc000ad6b40, 0x161d700, 0xc0001857c0)
	/go-controller/pkg/factory/factory.go:150 +0x8c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:202
k8s.io/client-go/tools/cache.(*processorListener).run.func1.1(0xc000466c00, 0xc000480a00, 0x0)
	/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:605 +0x1dd
k8s.io/apimachinery/pkg/util/wait.ExponentialBackoff(0x989680, 0x3ff0000000000000, 0x3fb999999999999a, 0x5, 0x0, 0xc00086fe38, 0x42c47f, 0xc000480a30)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:284 +0x51
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:601 +0x79
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc00032ef68)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00086ff68, 0xdf8475800, 0x0, 0x42cd01, 0xc0003298c0)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0003eac80)
	/go-controller/vendor/k8s.io/client-go/tools/cache/shared_informer.go:599 +0x9c
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1(0xc000393170, 0xc0004a6080)
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:71 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
	/go-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:69 +0x62

Comment 5 Nikita 2020-01-16 10:40:47 UTC
Please update regarding my comment above.

Comment 6 Alexander Constantinescu 2020-03-05 14:23:25 UTC
Hi

So this has been fixed in upstreams ovnkubernetes, and should be existing downstreams openshift/ovn-kubernetes in 4.3/4.4/4.5

Go ahead and re-test

-Alex

Comment 7 Ross Brattain 2020-03-06 04:11:14 UTC
Unable to reproduce in 4.4.0-0.nightly-2020-03-05-084644

Comment 9 Nelly Credi 2020-03-10 15:06:00 UTC
@Alexander, how did you figure that this is fixed on 4.3? because we see it failing in 4.3,
And apparently there is no 4.3 clone, so I can create one, but I wanted to make sure before that we were expecting it to be fixed, 
and somehow it got missed

Comment 10 Alexander Constantinescu 2020-03-10 15:15:15 UTC
Hi

Yes this issue together with a bunch of other should be fixed because branch 4.4 and 4.3 should be completely aligned (we have basically rebased them one on top of the other). 

What are you experiencing in your tests on 4.3, is it the panic in #comment 4? Or the 400 in #comment 1? Which tests are you actually performing by the way (I guess the steps described  in #comment 1)?

- Alex

Comment 11 Nelly Credi 2020-03-11 08:46:55 UTC
Ill let Nikita or Marius provide the technical details

Comment 12 Nikita 2020-03-11 09:42:08 UTC
Hi,

We faced with the same issue. Please take a look on the logs below:
We noticed that ovnkube pods crash all the time
> oc get pods -n openshift-ovn-kubernetes
NAME                   READY   STATUS             RESTARTS   AGE
ovnkube-master-lwggs   4/4     Running            14         3d18h
ovnkube-master-q2xht   3/4     Error              14         3d18h
ovnkube-master-s29zx   3/4     CrashLoopBackOff   13         3d18h
ovnkube-node-9nt7j     3/3     Running            0          3d17h
ovnkube-node-ffvjz     3/3     Running            0          3d18h
ovnkube-node-fvqbb     3/3     Running            0          3d18h
ovnkube-node-hchb2     3/3     Running            0          3d17h
ovnkube-node-jlzjd     3/3     Running            0          3d18h

This is log from the pod which I'm trying to start:

Warning  FailedCreatePodSandBox  64s  kubelet, openshift-worker-1.qe1.kni.lab.eng.bos.redhat.com  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_checksctph5z8j_sctptest_881d34ae-fb06-4b3e-98be-a58715b599a2_0(4db0832a72074eb3ffb389132ebcb2e41636e26b2c808d2d8d2c587e6a6113e5): Multus: [sctptest/checksctph5z8j]: error adding container to network "ovn-kubernetes": delegateAdd: error invoking DelegateAdd - "ovn-k8s-cni-overlay": error in getting result from AddNetwork: CNI request failed with status 400: '[sctptest/checksctph5z8j] failed to get pod annotation: timed out waiting for the condition

This happened before actual test. I can not run the pod.
The idea of the test is validate that sctp protocol works correctly but as I noticed it falls before the test.

Comment 13 Alexander Constantinescu 2020-03-11 10:00:15 UTC
Hi

Wait a minute, you're testing SCTP? This is not support by OVN right now and is a current work-in-progress, https://github.com/ovn-org/ovn-kubernetes/issues/1120 and https://issues.redhat.com/secure/RapidBoard.jspa?rapidView=5254&view=planning&selectedIssue=SDN-138

-Alex

Comment 15 errata-xmlrpc 2020-05-13 21:54:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.