Created attachment 1615165 [details] ovn node logs Description of problem: This is happening more often on IPI Azure with OVN. Containers are getting stuck in container creating state. $ oc get pods -n x1 NAME READY STATUS RESTARTS AGE test-rc-jfc54 0/1 ContainerCreating 0 2m40s test-rc-n62ks 0/1 ContainerCreating 0 2m40s [anusaxen@anusaxen Desktop]$ oc get pods -n x2 NAME READY STATUS RESTARTS AGE test-rc-j5xwm 0/1 ContainerCreating 0 2m36s test-rc-qgjq4 0/1 ContainerCreating 0 2m36s [anusaxen@anusaxen Desktop]$ oc get pods -n x3 NAME READY STATUS RESTARTS AGE test-rc-kg4vx 0/1 ContainerCreating 0 2m37s test-rc-rfdw5 0/1 ContainerCreating 0 2m37s $ cat /etc/cni/net.d/ 100-crio-bridge.conf 200-loopback.conf 87-podman-bridge.conflist Check "Additional Info" for events on one of a pod Version-Release number of selected component (if applicable):4.2.0-0.nightly-2019-09-11-202233 How reproducible: Often Steps to Reproduce: 1.Create some couple or more projects and create pods inside them 2.Delete all projects one by one 3.Create some couple or more projects again and create pods inside them 4.The pods get stuck in container creating state Actual results:Unable to create pods due to CNI request errors Expected results:Should be able to create pods Additional info: oc describe events on a pod Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m52s default-scheduler Successfully assigned x1/test-rc-jfc54 to qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Warning FailedCreatePodSandBox 4m26s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(701239727ab417d2db017f1244dc58271c0e27191f2a574a94243977c0198473): CNI request failed with status 400: 'Nil response to CNI request ' Warning FailedCreatePodSandBox 3m51s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(5e93cb31a769da56e1b5646d8a6346affa8c13c9d50ee16a0e382526001b87d8): CNI request failed with status 400: 'Nil response to CNI request ' Warning FailedCreatePodSandBox 3m14s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(17f5e7b128e4c4284203dfb389ae7b75d3ff389689ad09ae4818bfe4c5c5265e): CNI request failed with status 400: 'Nil response to CNI request ' Warning FailedCreatePodSandBox 2m37s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(321c479c9531ecae862eae231cf4b87cff5780e6223fed20be17e6da3c1112c7): CNI request failed with status 400: 'Nil response to CNI request ' Warning FailedCreatePodSandBox 118s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(f23882309586c75031898c65d388edfc5d6930466ef52bbc557cf01f2aa8fb00): CNI request failed with status 400: 'Nil response to CNI request ' Warning FailedCreatePodSandBox 83s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(e232e425e99d2e32429795cc927a621a063faf45c3bbbcdfa375d1e75030819f): CNI request failed with status 400: 'Nil response to CNI request ' Warning FailedCreatePodSandBox 48s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(94acbb959d438d027a5775d4652029124f82598376d3647dff5135327040c2a5): CNI request failed with status 400: 'Nil response to CNI request ' Warning FailedCreatePodSandBox 14s kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(3fb4bd35eff6992c415d1adb95fba30bdf81e451943022f716fc93fe29db6e01): CNI request failed with status 400: 'Nil response to CNI request '
seems same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1746616
When this happens, can you get oc logs -n openshift-ovn-kubernetes -c ovn-node ovnkube-node-XXXXXX for the ovnkube-node-XXXXXX pod on the same node as the pod that's failing to start, and oc logs -n openshift-ovn-kubernetes -c ovnkube-master ovnkube-master-XXXXXX for the ovnkube-master pod (there will only be one, but it has a partly-random name)
Created attachment 1615544 [details] ovn_node_logs
Created attachment 1615545 [details] ovnkube master logs
Sure, Dan Winship. Please see attached, ovn-node were already attached though but please refer to new set of logs on newly created platform as ovn_node_logs.txt and ovnkube_master_logs.txt. The logs are from overnight cluster so let me now if i should get on fresh setup as they seems verbose to me. Projects name are x1,x2 and x2 in the logs just FYR.
To shed more light on this, this bug is often reproducible when attached test case under "Links" is executed and its associated projects say z1,z2,z3 is deleted at the end of testing. Then tries to create 3 new projects say x1,x2,x3 and pods under them as described in Bug introduction.
first error in the node log: time="2019-09-15T20:00:59Z" level=error msg="failed to get pod annotation - timed out waiting for the condition" but the last message in the master log is: time="2019-09-15T19:59:08Z" level=info msg="Deleting network policy namespace-pod-selector in namespace z1" so it seems like the master either crashed or wedged after that and stopped processing new pods, and so all the pod creations after that point fail. I guess if it crashed kubelet would have restarted it, so it must be "wedged".
Thanks Dan. Also it doesn't seem like master was crashed but wedged i guess as i see 0 restarts against it $ oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-master-76c57ddbd-mw4vx 4/4 Running 0 22h ovnkube-node-hfnbx 3/3 Running 0 22h ovnkube-node-hpkfs 3/3 Running 0 22h ovnkube-node-nm2jd 3/3 Running 0 22h ovnkube-node-sgf7h 3/3 Running 0 22h ovnkube-node-twz7z 3/3 Running 0 22h
The mutex handling around NetworkPolicies is pretty weird. Looks like a deadlock, probably specifically involving the "both namespaceSelector and podSelector" code, which is new.
(In reply to Dan Winship from comment #10) > The mutex handling around NetworkPolicies is pretty weird. Looks like a > deadlock, probably specifically involving the "both namespaceSelector and > podSelector" code, which is new. Dan, I guess to isolate that i can try couple of tests with simple networkpolicies (which doesn't involve namespaceSelector and podSelector) involving multiple projects or so and try to reproduce on that to see.
No, I can reproduce the bug. I was just making a note.
Pushing to 4.3. You can work around this by not creating a network policy with both namespaceSelector and podSelector. This is a tech preview product, it should not block the release.
(In reply to Ben Bennett from comment #13) > You can work around this by not creating a network policy with both > namespaceSelector and podSelector. This is a tech preview product, it > should not block the release. Yes, but once *anyone* does that, the entire cluster becomes unusable until that policy is deleted and the ovnkube-master is restarted. That seems pretty bad even for tech preview. I've already submitted a fix upstream, but I hadn't linked it here since it hasn't even been reviewed and we need to cherry-pick it to our repo too..
Agree with Dan. This leaves cluster unusable.
Hi, this bug was pushed to 4.3 but apparently from comment 14 and comment 15, we may want it in 4.2. Is there any PR merged? then this bug should be ON_QA then?
The bug is not fixed in 4.2.0. It will be backported to 4.2.z when it is fixed.
Anurag, can we get this VERIFIED so we can get the 4.2 backport in?
(In reply to Dan Winship from comment #19) > Anurag, can we get this VERIFIED so we can get the 4.2 backport in? ping
Hi Dan, looking at it right now and should update in an hour. Sorry for the delay
Verified on 4.3.0-0.nightly-2019-11-11-060801 using same steps mentioned in description. Thanks
This was verified a while back, but I'm seeing something similar in a 4.2.15 -> 4.3.0-rc.1 update job [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14595/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d24ac732f2fd86150091410623d388ad78196ad7f8072696e85ceaaccb187759/namespaces/openshift-apiserver/core/events.yaml | yaml2json | jq -r '[.items[] | select(.message | contains("Nil response to CNI request"))][-1]' | json2yaml apiVersion: v1 count: '137' eventTime: 'null' firstTimestamp: '2020-01-16T18:44:35Z' involvedObject: apiVersion: v1 kind: Pod name: apiserver-kvc86 namespace: openshift-apiserver resourceVersion: '27022' uid: 7c6e75ff-388f-11ea-b6d2-0a992d1f3055 kind: Event lastTimestamp: '2020-01-16T19:50:03Z' message: '(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-kvc86_openshift-apiserver_7c6e75ff-388f-11ea-b6d2-0a992d1f3055_0(cee486d2fff290f897777ba80767fbc38453f14a8bfb8d1ba97124089ca705fe): Multus: Err adding pod to network "ovn-kubernetes": Multus: error in invoke Delegate add - "ovn-k8s-cni-overlay": CNI request failed with status 400: ''Nil response to CNI request ''' metadata: creationTimestamp: '2020-01-16T18:44:35Z' name: apiserver-kvc86.15ea7248953ab503 namespace: openshift-apiserver resourceVersion: '69656' selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-kvc86.15ea7248953ab503 uid: 0dedcb0f-e07c-4e2c-9795-b19875995a06 reason: FailedCreatePodSandBox reportingComponent: '' reportingInstance: '' source: component: kubelet host: ip-10-0-142-130.ec2.internal type: Warning You can see it going on for over an hour. Is that this issue, or should I spin it off into a new bug? [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14595
Unfortunately, that error message is basically a "check engine" light. It doesn't really tell you anything about *what* is going wrong, just that something went wrong. This bug was about a NetworkPolicy-related deadlock in ovnkube-master, and it's fixed. If you're seeing that error message now, it's a new, unrelated, bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062