Description of problem: IHAC running OCP 3.11.117 with CRI-O runtime instead of docker and they are suffering the following issue: (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_<obfuscated>_cebf6bd6-d87e-11e9-bf6d-005056beab36_0(b13accb48cde06e15374eef4f3eceb841f36af79725af4f8b773c0b5a68d9b38): CNI request failed with status 400: 'failed to run IPAM for b13accb48cde06e15374eef4f3eceb841f36af79725af4f8b773c0b5a68d9b38: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 172.42.8.1-172.42.9.254 ' The node suffering the issue currently has 249 pods running but the whole IP address pool reserved: $ oc get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c 2 NODE 17 vg00dodv.example.com 17 vg00dpdv.example.com 11 vg00drdv.example.com 17 vg00dsdv.example.com 11 vg00dtdv.example.com 18 vg00dudv.example.com 67 vg00dvdv.example.com 173 vg00dwdv.example.com 249 vg00dxdv.example.com <--- this is the node suffering the problem [vg00dxdv ~]$ ls /var/lib/cni/networks/openshift-sdn/ | wc -l 510 $ oc get clusternetwork NAME CLUSTER NETWORKS SERVICE NETWORK PLUGIN NAME default 172.42.0.0/16:9 172.30.0.0/16 redhat/openshift-ovs-networkpolicy $ oc get hostsubnets | grep dxdv vg00dxdv.example.com vg00dxdv.example.com 10.47.235.172 172.42.8.0/23 [] [10.47.235.149] Version-Release number of selected component (if applicable): OCP 3.11.117 cri-o-1.11.14-2.rhaos3.11.gitd56660e.el7.x86_64 openshift-ovs-networkpolicy How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: The node is unable to create more pods because IP assignation is not possible. Expected results: Garbage collector running to keep /var/lib/cni/networks/openshift-sdn/ folder clean. Additional info: Other BZs already exist for this same issue but not for CRI-O runtime within OCP 3.11, maybe the fix for 4.1 needs to be backported to 3.11? BZ#1532965 for 3.9 (docker) BZ#1743587 for 4.1 BZ#1735538 for 4.2
FWIW, the workaround is documented here: https://access.redhat.com/solutions/4457521
[To whom it may concern] If you have come to this bugzilla looking for answers for the same "CNI IPAM ADD" issue, apart from using aforementioned workaround[1] if needed, please check also the kernel version of your nodes, there is a buggy one with specific version "3.10.0-1062.7.1.el7.x86_64" that causes unexpected reboots and could be interfering here also. For more information about that kernel problem, please check on this BZ#1738415 and this solution[2]. [1] - https://access.redhat.com/solutions/4457521 [2] - https://access.redhat.com/solutions/4621451
Created attachment 1690240 [details] python script removing stale files
Created attachment 1690242 [details] python script removing stale files
There are many moving parts to this ticket. 3.11 is largely restricted to small, concise changes (including CVEs). In 4.x, we have improved the error handling across networking, crio, and the kubelet. This patch in crio [1] is one such fix. The patch causes the kubelet to retry sandbox creations on networking errors. Patching crio with this fix in 3.11 could uncover issues with other networking components, including the Kubelet. Currently, we recommend using the python script in comment 47 as an acceptable workaround due to the risk factors of including all the patches into 3.11. 1. https://github.com/cri-o/cri-o/pull/3164
Thanks Ryan, I have updated the KCS[1] to include also the script as a possible workaround if needed. [1] - https://access.redhat.com/solutions/4457521 Regards.