Bug 1752222 - [ovn] Failed to create pods, CNI request failed, due to NetworkPolicy deadlock
Summary: [ovn] Failed to create pods, CNI request failed, due to NetworkPolicy deadlock
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.0
Assignee: Dan Winship
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks: 1755190
TreeView+ depends on / blocked
 
Reported: 2019-09-14 19:43 UTC by Anurag saxena
Modified: 2020-01-23 11:06 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1755190 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:05:53 UTC
Target Upstream Version:
Embargoed:
anusaxen: needinfo-


Attachments (Terms of Use)
ovn node logs (157.55 KB, text/plain)
2019-09-14 19:43 UTC, Anurag saxena
no flags Details
ovn_node_logs (5.07 MB, text/plain)
2019-09-16 14:04 UTC, Anurag saxena
no flags Details
ovnkube master logs (45.30 KB, text/plain)
2019-09-16 14:05 UTC, Anurag saxena
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift ovn-kubernetes pull 30 0 'None' closed rebase to upstream 2019-10-08 2021-02-16 22:28:50 UTC
Github ovn-org ovn-kubernetes pull 838 0 'None' closed ovn: fix a deadlock in the NetworkPolicy code 2021-02-16 22:28:50 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:06:22 UTC

Description Anurag saxena 2019-09-14 19:43:03 UTC
Created attachment 1615165 [details]
ovn node logs

Description of problem: This is happening more often on IPI Azure with OVN. Containers are getting stuck in container creating state. 

$ oc get pods -n x1
NAME            READY   STATUS              RESTARTS   AGE
test-rc-jfc54   0/1     ContainerCreating   0          2m40s
test-rc-n62ks   0/1     ContainerCreating   0          2m40s
[anusaxen@anusaxen Desktop]$ oc get pods -n x2
NAME            READY   STATUS              RESTARTS   AGE
test-rc-j5xwm   0/1     ContainerCreating   0          2m36s
test-rc-qgjq4   0/1     ContainerCreating   0          2m36s
[anusaxen@anusaxen Desktop]$ oc get pods -n x3
NAME            READY   STATUS              RESTARTS   AGE
test-rc-kg4vx   0/1     ContainerCreating   0          2m37s
test-rc-rfdw5   0/1     ContainerCreating   0          2m37s

$ cat /etc/cni/net.d/
100-crio-bridge.conf       200-loopback.conf          87-podman-bridge.conflist  


Check "Additional Info" for events on one of a pod

Version-Release number of selected component (if applicable):4.2.0-0.nightly-2019-09-11-202233


How reproducible: Often


Steps to Reproduce:
1.Create some couple or more projects and create pods inside them
2.Delete all projects one by one
3.Create some couple or more projects again and create pods inside them
4.The pods get stuck in container creating state

Actual results:Unable to create pods due to CNI request errors


Expected results:Should be able to create pods


Additional info: oc describe events on a pod

Events:
  Type     Reason                  Age    From                                                      Message
  ----     ------                  ----   ----                                                      -------
  Normal   Scheduled               4m52s  default-scheduler                                         Successfully assigned x1/test-rc-jfc54 to qe-anurag-azure12-klwdn-worker-centralus1-7jwl5
  Warning  FailedCreatePodSandBox  4m26s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(701239727ab417d2db017f1244dc58271c0e27191f2a574a94243977c0198473): CNI request failed with status 400: 'Nil response to CNI request
'
  Warning  FailedCreatePodSandBox  3m51s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(5e93cb31a769da56e1b5646d8a6346affa8c13c9d50ee16a0e382526001b87d8): CNI request failed with status 400: 'Nil response to CNI request
'
  Warning  FailedCreatePodSandBox  3m14s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(17f5e7b128e4c4284203dfb389ae7b75d3ff389689ad09ae4818bfe4c5c5265e): CNI request failed with status 400: 'Nil response to CNI request
'
  Warning  FailedCreatePodSandBox  2m37s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(321c479c9531ecae862eae231cf4b87cff5780e6223fed20be17e6da3c1112c7): CNI request failed with status 400: 'Nil response to CNI request
'
  Warning  FailedCreatePodSandBox  118s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(f23882309586c75031898c65d388edfc5d6930466ef52bbc557cf01f2aa8fb00): CNI request failed with status 400: 'Nil response to CNI request
'
  Warning  FailedCreatePodSandBox  83s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(e232e425e99d2e32429795cc927a621a063faf45c3bbbcdfa375d1e75030819f): CNI request failed with status 400: 'Nil response to CNI request
'
  Warning  FailedCreatePodSandBox  48s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(94acbb959d438d027a5775d4652029124f82598376d3647dff5135327040c2a5): CNI request failed with status 400: 'Nil response to CNI request
'
  Warning  FailedCreatePodSandBox  14s  kubelet, qe-anurag-azure12-klwdn-worker-centralus1-7jwl5  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-jfc54_x1_4e775213-d722-11e9-bfcf-000d3a3e70b6_0(3fb4bd35eff6992c415d1adb95fba30bdf81e451943022f716fc93fe29db6e01): CNI request failed with status 400: 'Nil response to CNI request
'

Comment 2 zhaozhanqi 2019-09-16 01:10:25 UTC
seems same issue with https://bugzilla.redhat.com/show_bug.cgi?id=1746616

Comment 3 Dan Winship 2019-09-16 13:03:55 UTC
When this happens, can you get

    oc logs -n openshift-ovn-kubernetes -c ovn-node ovnkube-node-XXXXXX

for the ovnkube-node-XXXXXX pod on the same node as the pod that's failing to start, and

    oc logs -n openshift-ovn-kubernetes -c ovnkube-master ovnkube-master-XXXXXX

for the ovnkube-master pod (there will only be one, but it has a partly-random name)

Comment 4 Anurag saxena 2019-09-16 14:04:57 UTC
Created attachment 1615544 [details]
ovn_node_logs

Comment 5 Anurag saxena 2019-09-16 14:05:20 UTC
Created attachment 1615545 [details]
ovnkube master logs

Comment 6 Anurag saxena 2019-09-16 14:05:33 UTC
Sure, Dan Winship. Please see attached, ovn-node were already attached though but please refer to new set of logs on newly created platform as ovn_node_logs.txt and ovnkube_master_logs.txt. The logs are from overnight cluster so let me now if i should get on fresh setup as they seems verbose to me.  Projects name are x1,x2 and x2 in the logs just FYR.

Comment 7 Anurag saxena 2019-09-16 15:39:27 UTC
To shed more light on this, this bug is often reproducible when attached test case under "Links" is executed and its associated projects say z1,z2,z3 is deleted at the end of testing. Then tries to create 3 new projects say x1,x2,x3 and pods under them as described in Bug introduction.

Comment 8 Dan Winship 2019-09-16 16:10:34 UTC
first error in the node log:

    time="2019-09-15T20:00:59Z" level=error msg="failed to get pod annotation - timed out waiting for the condition"

but the last message in the master log is:

    time="2019-09-15T19:59:08Z" level=info msg="Deleting network policy namespace-pod-selector in namespace z1"

so it seems like the master either crashed or wedged after that and stopped processing new pods, and so all the pod creations after that point fail.

I guess if it crashed kubelet would have restarted it, so it must be "wedged".

Comment 9 Anurag saxena 2019-09-16 17:27:28 UTC
Thanks Dan. Also it doesn't seem like master was crashed but wedged i guess as i see 0 restarts against it

$ oc get pods -n openshift-ovn-kubernetes
NAME                             READY   STATUS    RESTARTS   AGE
ovnkube-master-76c57ddbd-mw4vx   4/4     Running   0          22h
ovnkube-node-hfnbx               3/3     Running   0          22h
ovnkube-node-hpkfs               3/3     Running   0          22h
ovnkube-node-nm2jd               3/3     Running   0          22h
ovnkube-node-sgf7h               3/3     Running   0          22h
ovnkube-node-twz7z               3/3     Running   0          22h

Comment 10 Dan Winship 2019-09-17 18:24:18 UTC
The mutex handling around NetworkPolicies is pretty weird. Looks like a deadlock, probably specifically involving the "both namespaceSelector and podSelector" code, which is new.

Comment 11 Anurag saxena 2019-09-17 18:36:27 UTC
(In reply to Dan Winship from comment #10)
> The mutex handling around NetworkPolicies is pretty weird. Looks like a
> deadlock, probably specifically involving the "both namespaceSelector and
> podSelector" code, which is new.

Dan, I guess to isolate that i can try couple of tests with simple networkpolicies (which doesn't involve namespaceSelector and podSelector) involving multiple projects or so and try to reproduce on that to see.

Comment 12 Dan Winship 2019-09-17 19:01:08 UTC
No, I can reproduce the bug. I was just making a note.

Comment 13 Ben Bennett 2019-09-19 13:10:03 UTC
Pushing to 4.3.

You can work around this by not creating a network policy with both namespaceSelector and podSelector.  This is a tech preview product, it should not block the release.

Comment 14 Dan Winship 2019-09-19 14:30:59 UTC
(In reply to Ben Bennett from comment #13)
> You can work around this by not creating a network policy with both
> namespaceSelector and podSelector.  This is a tech preview product, it
> should not block the release.

Yes, but once *anyone* does that, the entire cluster becomes unusable until that policy is deleted and the ovnkube-master is restarted. That seems pretty bad even for tech preview.

I've already submitted a fix upstream, but I hadn't linked it here since it hasn't even been reviewed and we need to cherry-pick it to our repo too..

Comment 15 Anurag saxena 2019-09-19 14:33:44 UTC
Agree with Dan. This leaves cluster unusable.

Comment 16 Anurag saxena 2019-09-24 15:29:28 UTC
Hi, this bug was pushed to 4.3 but apparently from comment 14 and comment 15, we may want it in 4.2. Is there any PR merged? then this bug should be ON_QA then?

Comment 17 Dan Winship 2019-09-24 16:34:11 UTC
The bug is not fixed in 4.2.0. It will be backported to 4.2.z when it is fixed.

Comment 19 Dan Winship 2019-11-04 12:49:31 UTC
Anurag, can we get this VERIFIED so we can get the 4.2 backport in?

Comment 20 Dan Winship 2019-11-10 21:58:43 UTC
(In reply to Dan Winship from comment #19)
> Anurag, can we get this VERIFIED so we can get the 4.2 backport in?

ping

Comment 21 Anurag saxena 2019-11-11 13:27:06 UTC
Hi Dan, looking at it right now and should update in an hour. Sorry for the delay

Comment 22 Anurag saxena 2019-11-11 13:44:53 UTC
Verified on 4.3.0-0.nightly-2019-11-11-060801 using same steps mentioned in description. Thanks

Comment 23 W. Trevor King 2020-01-17 00:50:28 UTC
This was verified a while back, but I'm seeing something similar in a 4.2.15 -> 4.3.0-rc.1 update job [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14595/artifacts/e2e-aws-upgrade/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-d24ac732f2fd86150091410623d388ad78196ad7f8072696e85ceaaccb187759/namespaces/openshift-apiserver/core/events.yaml | yaml2json | jq -r '[.items[] | select(.message | contains("Nil response to CNI request"))][-1]' | json2yaml 
apiVersion: v1
count: '137'
eventTime: 'null'
firstTimestamp: '2020-01-16T18:44:35Z'
involvedObject:
  apiVersion: v1
  kind: Pod
  name: apiserver-kvc86
  namespace: openshift-apiserver
  resourceVersion: '27022'
  uid: 7c6e75ff-388f-11ea-b6d2-0a992d1f3055
kind: Event
lastTimestamp: '2020-01-16T19:50:03Z'
message: '(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod
  network sandbox k8s_apiserver-kvc86_openshift-apiserver_7c6e75ff-388f-11ea-b6d2-0a992d1f3055_0(cee486d2fff290f897777ba80767fbc38453f14a8bfb8d1ba97124089ca705fe):
  Multus: Err adding pod to network "ovn-kubernetes": Multus: error in invoke Delegate add - "ovn-k8s-cni-overlay": CNI request
  failed with status 400: ''Nil response to CNI request

  '''
metadata:
  creationTimestamp: '2020-01-16T18:44:35Z'
  name: apiserver-kvc86.15ea7248953ab503
  namespace: openshift-apiserver
  resourceVersion: '69656'
  selfLink: /api/v1/namespaces/openshift-apiserver/events/apiserver-kvc86.15ea7248953ab503
  uid: 0dedcb0f-e07c-4e2c-9795-b19875995a06
reason: FailedCreatePodSandBox
reportingComponent: ''
reportingInstance: ''
source:
  component: kubelet
  host: ip-10-0-142-130.ec2.internal
type: Warning

You can see it going on for over an hour.  Is that this issue, or should I spin it off into a new bug?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14595

Comment 25 Dan Winship 2020-01-17 11:18:36 UTC
Unfortunately, that error message is basically a "check engine" light. It doesn't really tell you anything about *what* is going wrong, just that something went wrong.

This bug was about a NetworkPolicy-related deadlock in ovnkube-master, and it's fixed. If you're seeing that error message now, it's a new, unrelated, bug.

Comment 27 errata-xmlrpc 2020-01-23 11:05:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.