1688955 – CNI fails to allocate IPs due to stale files in /var/lib/cni/networks/openshift-sdn

Bug 1688955 - CNI fails to allocate IPs due to stale files in /var/lib/cni/networks/openshift-sdn

Summary: CNI fails to allocate IPs due to stale files in /var/lib/cni/networks/openshi...

Keywords:
Status:	CLOSED DUPLICATE of bug 1735538
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Dan Williams
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:	aos-scalability-41
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-14 18:50 UTC by Alex Krzos
Modified:	2019-08-26 17:26 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-26 17:26:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alex Krzos 2019-03-14 18:50:09 UTC

Description of problem:
It is possible to create a project with a few objects and delete that project within a few seconds and left over "used" IPs will remain on the worker nodes in the form of files in "/var/lib/cni/networks/openshift-sdn"

I have now seen two versions of this:
1. Create a single project with several templates resembling The MasterVertical test [0], sleep 2 seconds, delete project and review files in each worker under "/var/lib/cni/networks/openshift-sdn" to find stale or unused ips vs oc get pods on that specific worker

2. Large Scale cluster creating very close to at capacity number of pods and then deleting and cleaning up projects/pods has many left over stale files.


Once enough stale files collect, you will receive this error and pods stuck in containercreating on worker nodes:

Warning  FailedCreatePodSandBox  2m34s (x24 over 9m56s)  kubelet, ip-10-0-161-155.us-west-2.compute.internal  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_c0_4baa8ff3-45b7-11e9-a8a6-0a191194ab3e_0(9119e9b35c3722ef3f28843d0f65625177e7cd976f7a1352d936cb5c8d948815): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'failed to run IPAM for 9119e9b35c3722ef3f28843d0f65625177e7cd976f7a1352d936cb5c8d948815: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.129.2.1-10.129.3.254
'


Version-Release number of selected component (if applicable):
Installer 0.14.0
Payload HTB2 - 4.0.0-0.nightly-2019-03-04-234414

How reproducible:
Pretty much always, at small scale it seems to be related to how rapidly you delete the project after all oc create commands are run

Steps to Reproduce:
1. Create Project with deployment configs, upon completion of oc commands, sleep 2, then delete project
2. Retry above several times and/or adjust number of objects created in project before deleting
3. Check worker nodes for stale files (Alternatively loop over this iteration many times until workers exhaust all IPs to receive an error and thus cluster can not spawn pods that have a Pod IP)

Actual results:
Left over files can eventually exhaust worker node and prevent workloads from running

Expected results:
Unused IPs to be cleaned up on project deletion


Additional info:

I tried running the below small scale example without a sleep statement and the ips seems to get cleaned up, so I believe it is a race condition on whether deployment configs are still creating objects while the project is being deleted.  The large scale example could be occurring because of other reasons related to the fact it is running a huge load.

Error Gist: https://gist.github.com/akrzos/958a3f8dc6a9f2cfd8dc3b00084a0395#file-gistfile1-txt
Small Scale Example: https://gist.github.com/akrzos/afbadb6af23d5d84dd3b83e5dfa2c26a

To remedy the situation, simple clean out the unused ip files (/var/lib/cni/networks/openshift-sdn) from each worker node.

[0] https://github.com/openshift/svt/blob/056d74239ff2d368d68d1dac56b18c67ef108b25/openshift_scalability/config/pyconfigMasterVertScale.yaml#L4

Comment 1 Alex Krzos 2019-03-14 18:54:31 UTC

This bug seems related however appears to be more of a functional issue with two interface multus pods rather than single interface pod. 

https://bugzilla.redhat.com/show_bug.cgi?id=1652535

Comment 7 Alex Krzos 2019-04-09 14:33:34 UTC

Keeping the need info flag marked until I gather the data.

Note You need to log in before you can comment on or make changes to this bug.