Description of problem:
It is possible to create a project with a few objects and delete that project within a few seconds and left over "used" IPs will remain on the worker nodes in the form of files in "/var/lib/cni/networks/openshift-sdn"
I have now seen two versions of this:
1. Create a single project with several templates resembling The MasterVertical test , sleep 2 seconds, delete project and review files in each worker under "/var/lib/cni/networks/openshift-sdn" to find stale or unused ips vs oc get pods on that specific worker
2. Large Scale cluster creating very close to at capacity number of pods and then deleting and cleaning up projects/pods has many left over stale files.
Once enough stale files collect, you will receive this error and pods stuck in containercreating on worker nodes:
Warning FailedCreatePodSandBox 2m34s (x24 over 9m56s) kubelet, ip-10-0-161-155.us-west-2.compute.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_deploymentconfig0-1-deploy_c0_4baa8ff3-45b7-11e9-a8a6-0a191194ab3e_0(9119e9b35c3722ef3f28843d0f65625177e7cd976f7a1352d936cb5c8d948815): Multus: Err adding pod to network "openshift-sdn": Multus: error in invoke Delegate add - "openshift-sdn": CNI request failed with status 400: 'failed to run IPAM for 9119e9b35c3722ef3f28843d0f65625177e7cd976f7a1352d936cb5c8d948815: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.129.2.1-10.129.3.254
Version-Release number of selected component (if applicable):
Payload HTB2 - 4.0.0-0.nightly-2019-03-04-234414
Pretty much always, at small scale it seems to be related to how rapidly you delete the project after all oc create commands are run
Steps to Reproduce:
1. Create Project with deployment configs, upon completion of oc commands, sleep 2, then delete project
2. Retry above several times and/or adjust number of objects created in project before deleting
3. Check worker nodes for stale files (Alternatively loop over this iteration many times until workers exhaust all IPs to receive an error and thus cluster can not spawn pods that have a Pod IP)
Left over files can eventually exhaust worker node and prevent workloads from running
Unused IPs to be cleaned up on project deletion
I tried running the below small scale example without a sleep statement and the ips seems to get cleaned up, so I believe it is a race condition on whether deployment configs are still creating objects while the project is being deleted. The large scale example could be occurring because of other reasons related to the fact it is running a huge load.
Error Gist: https://gist.github.com/akrzos/958a3f8dc6a9f2cfd8dc3b00084a0395#file-gistfile1-txt
Small Scale Example: https://gist.github.com/akrzos/afbadb6af23d5d84dd3b83e5dfa2c26a
To remedy the situation, simple clean out the unused ip files (/var/lib/cni/networks/openshift-sdn) from each worker node.
This bug seems related however appears to be more of a functional issue with two interface multus pods rather than single interface pod.
Keeping the need info flag marked until I gather the data.