Bug 1753706 - Pods stuck in container creating - Failed to run CNI IPAM ADD: failed to allocate for range 0
Summary: Pods stuck in container creating - Failed to run CNI IPAM ADD: failed to allo...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.z
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard: SDN-CUST-IMPACT
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-19 15:30 UTC by Pedro Amoedo
Modified: 2023-10-06 18:35 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-08 15:56:53 UTC
Target Upstream Version:
Embargoed:
cdc: needinfo-


Attachments (Terms of Use)
python script removing stale files (1.56 KB, text/x-python)
2020-05-20 13:14 UTC, Alexander Constantinescu
no flags Details
python script removing stale files (1.56 KB, text/x-python)
2020-05-20 13:17 UTC, Alexander Constantinescu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4457521 0 None None None 2019-11-18 10:39:07 UTC

Description Pedro Amoedo 2019-09-19 15:30:15 UTC
Description of problem:

IHAC running OCP 3.11.117 with CRI-O runtime instead of docker and they are suffering the following issue:

(combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_<obfuscated>_cebf6bd6-d87e-11e9-bf6d-005056beab36_0(b13accb48cde06e15374eef4f3eceb841f36af79725af4f8b773c0b5a68d9b38): CNI request failed with status 400: 'failed to run IPAM for b13accb48cde06e15374eef4f3eceb841f36af79725af4f8b773c0b5a68d9b38: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 172.42.8.1-172.42.9.254 '

The node suffering the issue currently has 249 pods running but the whole IP address pool reserved:

$ oc get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c
      2 NODE
     17 vg00dodv.example.com
     17 vg00dpdv.example.com
     11 vg00drdv.example.com
     17 vg00dsdv.example.com
     11 vg00dtdv.example.com
     18 vg00dudv.example.com
     67 vg00dvdv.example.com
    173 vg00dwdv.example.com
    249 vg00dxdv.example.com   <--- this is the node suffering the problem

[vg00dxdv ~]$ ls /var/lib/cni/networks/openshift-sdn/ | wc -l
510

$ oc get clusternetwork
NAME      CLUSTER NETWORKS   SERVICE NETWORK   PLUGIN NAME
default   172.42.0.0/16:9    172.30.0.0/16     redhat/openshift-ovs-networkpolicy


$ oc get hostsubnets | grep dxdv
vg00dxdv.example.com   vg00dxdv.example.com   10.47.235.172   172.42.8.0/23    []             [10.47.235.149]

Version-Release number of selected component (if applicable):

OCP 3.11.117
cri-o-1.11.14-2.rhaos3.11.gitd56660e.el7.x86_64
openshift-ovs-networkpolicy

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

The node is unable to create more pods because IP assignation is not possible.

Expected results:

Garbage collector running to keep /var/lib/cni/networks/openshift-sdn/ folder clean.

Additional info:

Other BZs already exist for this same issue but not for CRI-O runtime within OCP 3.11, maybe the fix for 4.1 needs to be backported to 3.11?

BZ#1532965 for 3.9 (docker)
BZ#1743587 for 4.1
BZ#1735538 for 4.2

Comment 16 Pedro Amoedo 2019-10-28 08:59:26 UTC
FWIW, the workaround is documented here: https://access.redhat.com/solutions/4457521

Comment 29 Pedro Amoedo 2019-12-03 09:46:43 UTC
[To whom it may concern]

If you have come to this bugzilla looking for answers for the same "CNI IPAM ADD" issue, apart from using aforementioned workaround[1] if needed, please check also the kernel version of your nodes, there is a buggy one with specific version "3.10.0-1062.7.1.el7.x86_64" that causes unexpected reboots and could be interfering here also.

For more information about that kernel problem, please check on this BZ#1738415 and this solution[2].

[1] - https://access.redhat.com/solutions/4457521
[2] - https://access.redhat.com/solutions/4621451

Comment 46 Alexander Constantinescu 2020-05-20 13:14:23 UTC
Created attachment 1690240 [details]
python script removing stale files

Comment 47 Alexander Constantinescu 2020-05-20 13:17:29 UTC
Created attachment 1690242 [details]
python script removing stale files

Comment 52 Ryan Phillips 2020-06-15 21:28:53 UTC
There are many moving parts to this ticket. 3.11 is largely restricted to small, concise changes (including CVEs).

In 4.x, we have improved the error handling across networking, crio, and the kubelet. This patch in crio [1] is one such fix. The patch causes the kubelet to retry sandbox creations on networking errors. Patching crio with this fix in 3.11 could uncover issues with other networking components, including the Kubelet. 

Currently, we recommend using the python script in comment 47 as an acceptable workaround due to the risk factors of including all the patches into 3.11.

1. https://github.com/cri-o/cri-o/pull/3164

Comment 53 Pedro Amoedo 2020-06-16 14:45:46 UTC
Thanks Ryan, I have updated the KCS[1] to include also the script as a possible workaround if needed.

[1] - https://access.redhat.com/solutions/4457521

Regards.


Note You need to log in before you can comment on or make changes to this bug.