Bug 1735538

Summary:	Pods stuck in container creating - Failed to run CNI IPAM ADD: failed to allocate for range 0
Product:	OpenShift Container Platform	Reporter:	Itzik Brown <itbrown>
Component:	Networking	Assignee:	Alexander Constantinescu <aconstan>
Status:	CLOSED ERRATA	QA Contact:	zhaozhanqi <zzhao>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	aconstan, akrzos, aos-bugs, cdc, eparis, nagrawal, veer, weliang
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1743587 (view as bug list)		Environment:
Last Closed:	2019-10-16 06:34:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1743587

Description Itzik Brown 2019-08-01 04:48:53 UTC

Description of problem:
After running some tests Pods are stuck in CcontainerCreating.
I see the following error when running oc describe:
  Warning  FailedCreatePodSandBox  89s (x2174 over 11h)  kubelet, master-1  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kubevirt-hyperconverged-cluster-jobbrw79-jczfc_kubevirt-hyperconverged_8482d21d-b3b3-11e9-bf45-00b6291ae442_0(ff39fece449a153d88bf540602a749d69fb8141ecb642d2445c234fca3c8feec): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'failed to run IPAM for ff39fece449a153d88bf540602a749d69fb8141ecb642d2445c234fca3c8feec: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.129.0.1-10.129.1.254

When listing all addresses of Containers starting with 10.129 I get only 34.

On the node I get the following:
[core@master-1 openshift-sdn]$ ls -l /var/lib/cni/networks/openshift-sdn/10.129.* |wc -l
509

Version-Release number of selected component (if applicable):
$ oc version
Client Version: version.Info{Major:"4", Minor:"2+", GitVersion:"v4.2.0", GitCommit:"2e9d4a117", GitTreeState:"clean", BuildDate:"2019-07-28T17:15:26Z", GoVersion:"go1.12.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0+743bc2c", GitCommit:"743bc2c", GitTreeState:"clean", BuildDate:"2019-07-21T21:17:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}


How reproducible:


Steps to Reproduce:
1. Create and delete many pods (more that the addresses you have)
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Casey Callendrello 2019-08-06 12:16:27 UTC

Jacob, please work with Itzik to reproduce this issue.

We've had several reports of problems with this. This is very urgent. I would argue it's a release blocker.

Comment 2 Veer Muchandi 2019-08-07 21:13:01 UTC

This is frequently coming up on my cluster. Even reboot of the node doesn't help.

I have captured CRIO logs here https://pastebin.com/y2jga9Vw

I have to manually clean up stale IPs at /var/lib/cni/networks/openshift-sdn/

Comment 5 Weibin Liang 2019-08-19 19:44:15 UTC

Tested and verified in v4.2.0-0.ci-2019-08-19-054234:

1. Create many pods (more that the addresses you have), all pods got created and use all ip addresses.
2. Delete those pods, check pods again and both pods and pods' ip addresses got deleted.
3. Recreate many pods (more that the addresses you have), all pods got created and use all ip addresses.
4. Re delete those pods, check pods again and both pods and pods' ip addresses got deleted.

Comment 6 Casey Callendrello 2019-08-26 17:26:01 UTC

*** Bug 1688955 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2019-10-16 06:34:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922