Bug 1532965

Summary: CNI ip files remain on disk after pod deleted and ipamgarbagecollect will not be triggered when the IPs are exhausted
Product: OpenShift Container Platform Reporter: Meng Bo <bmeng>
Component: NetworkingAssignee: Dan Williams <dcbw>
Networking sub component: multus QA Contact: Weibin Liang <weliang>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, atragler, bbennett, bmchugh, bmeng, bmorriso, dakini, dcbw, dmoessne, fshaikh, jfiala, jrosenta, palonsor, rbost, ricardo.arguello, snalawad, vizak, wgordon, xmorano
Version: 3.9.0Keywords: OpsBlocker
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-18 14:43:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
docker_log
none
atomic_openshift_node_log
none
failed app create in online-int none

Description Meng Bo 2018-01-10 06:13:05 UTC
Description of problem:
Some of the IP files will not be deleted with the pod after the cluster running for a while. And the IPAMgarbagecollect will not be trigger when the IP exhausted.

Version-Release number of selected component (if applicable):
v3.9.0-0.16.0

How reproducible:
always

Steps to Reproduce:
1. Setup multi-node env with 1 master and 1 node
2. Create a rc with multiple pods
3. Delete the docker containers from the node directly
4. Check the CNI ip files on the node
5. Repeat the step 3 and 4

Actual results:
There are some ip files which do not have associated pod/container remain on the disk. And till the IP exhausted, then the pod cannot be created on the node anymore.

Expected results:
Should try to remove the unused IP files to release the IP when the node is full.

Additional info:
There are 36 ip files but only 5 containers on the node after env running about 10 hours.

[root@ose-node2 ~]# ls /var/lib/cni/networks/openshift-sdn/ 
10.128.1.101  10.128.1.171  10.128.1.229  10.128.1.35  10.128.1.67  10.128.1.83  last_reserved_ip.0
10.128.1.121  10.128.1.188  10.128.1.233  10.128.1.38  10.128.1.68  10.128.1.84
10.128.1.125  10.128.1.189  10.128.1.249  10.128.1.51  10.128.1.69  10.128.1.85
10.128.1.129  10.128.1.199  10.128.1.29   10.128.1.53  10.128.1.72  10.128.1.86
10.128.1.135  10.128.1.20   10.128.1.30   10.128.1.59  10.128.1.81  10.128.1.87
10.128.1.168  10.128.1.21   10.128.1.34   10.128.1.66  10.128.1.82  10.128.1.89
[root@ose-node2 ~]# ls /var/lib/cni/networks/openshift-sdn/ | wc -l
37
[root@ose-master ~]# oc adm manage-node ose-node2.bmeng.local --list-pods

Listing matched pods on node: ose-node2.bmeng.local

NAMESPACE   NAME            READY     STATUS    RESTARTS   AGE
u1p1        test-rc-75rdh   1/1       Running   0          17m
u1p1        test-rc-b628c   1/1       Running   0          17m
u1p1        test-rc-jbfxn   1/1       Running   0          17m
u1p1        test-rc-x57qb   1/1       Running   0          17m
u1p1        test-rc-x72jv   1/1       Running   0          17m


Fill the IP up on the node and try to create more pod, the following errors appear in the node log:

Jan 10 13:11:15 ose-node2.bmeng.local atomic-openshift-node[10358]: E0110 13:11:15.526321   10358 cni.go:227] Error while adding to cni network: CNI request failed with status 400: 'failed to run IPAM for 9161eb31c5f1383dd7f1b019a9a49e43fcc7dcf6d4d5336d5c67854c70dceb42: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.1.1-10.128.1.254
Jan 10 13:11:15 ose-node2.bmeng.local atomic-openshift-node[10358]: '
Jan 10 13:11:16 ose-node2.bmeng.local atomic-openshift-node[10358]: E0110 13:11:16.286726   10358 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "test-rc-xz55f_u1p1" network: CNI request failed with status 400: 'failed to run IPAM for 9161eb31c5f1383dd7f1b019a9a49e43fcc7dcf6d4d5336d5c67854c70dceb42: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.1.1-10.128.1.254
Jan 10 13:11:16 ose-node2.bmeng.local atomic-openshift-node[10358]: '
Jan 10 13:11:16 ose-node2.bmeng.local atomic-openshift-node[10358]: E0110 13:11:16.286777   10358 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "test-rc-xz55f_u1p1(4caf06dd-f5b6-11e7-ae1c-525400dd3698)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "test-rc-xz55f_u1p1" network: CNI request failed with status 400: 'failed to run IPAM for 9161eb31c5f1383dd7f1b019a9a49e43fcc7dcf6d4d5336d5c67854c70dceb42: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.1.1-10.128.1.254
Jan 10 13:11:16 ose-node2.bmeng.local atomic-openshift-node[10358]: '
Jan 10 13:11:16 ose-node2.bmeng.local atomic-openshift-node[10358]: E0110 13:11:16.286791   10358 kuberuntime_manager.go:647] createPodSandbox for pod "test-rc-xz55f_u1p1(4caf06dd-f5b6-11e7-ae1c-525400dd3698)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "test-rc-xz55f_u1p1" network: CNI request failed with status 400: 'failed to run IPAM for 9161eb31c5f1383dd7f1b019a9a49e43fcc7dcf6d4d5336d5c67854c70dceb42: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.1.1-10.128.1.254
Jan 10 13:11:16 ose-node2.bmeng.local atomic-openshift-node[10358]: '
Jan 10 13:11:16 ose-node2.bmeng.local atomic-openshift-node[10358]: E0110 13:11:16.286850   10358 pod_workers.go:186] Error syncing pod 4caf06dd-f5b6-11e7-ae1c-525400dd3698 ("test-rc-xz55f_u1p1(4caf06dd-f5b6-11e7-ae1c-525400dd3698)"), skipping: failed to "CreatePodSandbox" for "test-rc-xz55f_u1p1(4caf06dd-f5b6-11e7-ae1c-525400dd3698)" with CreatePodSandboxError: "CreatePodSandbox for pod \"test-rc-xz55f_u1p1(4caf06dd-f5b6-11e7-ae1c-525400dd3698)\" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"test-rc-xz55f_u1p1\" network: CNI request failed with status 400: 'failed to run IPAM for 9161eb31c5f1383dd7f1b019a9a49e43fcc7dcf6d4d5336d5c67854c70dceb42: failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set: 10.128.1.1-10.128.1.254\n'"

Comment 1 Dan Williams 2018-01-10 18:02:34 UTC
What exact docker command are you using to delete the containers?  are you deleting the container itself, or the "pause" container, or both?

Comment 2 Meng Bo 2018-01-11 05:56:41 UTC
I deleted all the containers on the node with `docker rm -f $(docker ps -aq)`.

I use this method to reproduce the issue(ip files remain on node) since I have seen the issue on other environments.

Comment 3 Dan Williams 2018-01-15 22:26:19 UTC
Do you perform any other steps?  Or do you simply leave the node running with containers, and then run the "docker rm" command?

I was unable to reproduce with v3.9.0-0.16.0 by just:

1) start cluster with one node
2) create 4 pods
3) wait until pods have IP address
4) docker rm -f $(docker ps -aq)
5) wait at least a minute, check /var/lib/cni/networks/openshift-sdn/ for stale IPs

I found it sometimes takes up to 30 seconds or so for GC to kick in and garbage collect the sandbox IPs, but it did happen for me.

Are you able to reproduce at --log-level=5 and attach the node logs so I can debug further?

Comment 4 Meng Bo 2018-01-16 02:11:13 UTC
1. The node is not performed any other operations. Just running containers and delete containers and recreate.
2. I deleted the containers without waiting them have IP address.
Here is the what I was running on the node.
# while true ; do docker rm -f `docker ps -aq` ; sleep 10 ; if [ `ls /var/lib/cni/networks/openshift-sdn/ | wc -l` -eq 254 ] ; then sleep 36000 ; else ls ; fi ; sleep 5 ; done

I can reproduce it with log-level=5 but I am afraid the log file will be quite huge since it may not be reproduced easily.

Comment 6 Dan Williams 2018-01-16 15:32:26 UTC
(In reply to Meng Bo from comment #4)
> 1. The node is not performed any other operations. Just running containers
> and delete containers and recreate.
> 2. I deleted the containers without waiting them have IP address.
> Here is the what I was running on the node.
> # while true ; do docker rm -f `docker ps -aq` ; sleep 10 ; if [ `ls
> /var/lib/cni/networks/openshift-sdn/ | wc -l` -eq 254 ] ; then sleep 36000 ;
> else ls ; fi ; sleep 5 ; done
> 
> I can reproduce it with log-level=5 but I am afraid the log file will be
> quite huge since it may not be reproduced easily.

Can you attach your node config too?  I was able to trigger the bug for about 5 minutes twice yesterday, but then the sandbox was garbage collected while I was investigating.  I guess sometimes GC just takes a few minutes.

I'll try to reproduce again with your bash loop.

Comment 7 Dan Williams 2018-01-16 15:51:49 UTC
One more question.  If you run 'docker ps -a' do you see a large number of exited, non-garbage-collected normal containers? (eg, not containers using the openshift/origin-pod:<version> image)

Comment 8 Dan Williams 2018-01-16 18:23:44 UTC
I was able to get openshift-node into a state where it did not garbage collect a dead container, which prevents garbage collection of the container's sandbox, which prevents garbage collection of IPAM.

This would show up under "docker ps -a" as a long-exited container that has not been deleted by kubelet.

Comment 9 Dan Williams 2018-01-16 23:03:36 UTC
Digging into this more, when using GC defaults we expect kubelet to keep around one dead container (which also keeps its sandbox around) for a given pod.  Kubelet does this to allow quicker restart of a pod's container if needed.

So for example if we have a pod "nginx1", we would expect up to a maximum of:

a) one running app container
b) one running sandbox container owned by (a)
c) one terminated app container
d) (optional) one terminated sandbox container owned by (c)

(d) is optional; the actual container doesn't need to exist, but just (c) existing will prevent the IPAM from being GC-ed.

You'd think that if the sandbox container had already been deleted from docker/crio/etc that Kube would allow GC of the sandbox to happen, but that's not the case.  That should probably get changed, but unfortunately the CRI API between kubelet and the runtimes (docker/crio/etc) doesn't give us enough information to know whether the sandbox container is terminated or starting up.  Given that, we can't reliably use the sandbox state to GC terminated sandboxes or we'd start killing sandboxes that were in the process of being created.

In your case, since you're using a replication controller, each pod has a different name, and thus kubelet might keep around one dead sandbox for each individual pod.  How many replicas had you set?

So if this is indeed the cause of what you're seeing, it'll be somewhat hard to fix without some icky changes upstream.  If not, I can try to create some debugging RPMs with a patch that should help us track this down further.

Comment 10 Meng Bo 2018-01-17 10:55:17 UTC
I was reproduced the issue by setting the replicas to 8 when I reported the bug, and I did not reproduce it this afternoon with a replicas=6 for running about 3 hours.

I saw a lot of exited containers this afternoon and they all be deleted after a while. 

I will try to reproduce it again tonight.

I suspect that when I deleted/created the containers repeatedly, the docker daemon may becomes high load and which will make the kubelet cannot talk to the docker daemon normally.



Here is the master config about the network part:
networkConfig:
  clusterNetworkCIDR: 10.128.0.0/14
  clusterNetworks:
  - cidr: 10.128.0.0/14
    hostSubnetLength: 8
  externalIPNetworkCIDRs: null
  hostSubnetLength: 8
  ingressIPNetworkCIDR: 172.29.0.0/16
  networkPluginName: redhat/openshift-ovs-networkpolicy
  serviceNetworkCIDR: 172.30.0.0/16


Here is the node config:
# cat /etc/origin/node/node-config.yaml
allowDisabledDocker: false
apiVersion: v1
authConfig:
  authenticationCacheSize: 1000
  authenticationCacheTTL: 5m
  authorizationCacheSize: 1000
  authorizationCacheTTL: 5m
dnsBindAddress: ""
dnsDomain: cluster.local
dnsIP: ""
dnsNameservers: null
dnsRecursiveResolvConf: ""
dockerConfig:
  dockerShimRootDirectory: /var/lib/dockershim
  dockerShimSocket: /var/run/dockershim.sock
  execHandlerName: native
enableUnidling: true
imageConfig:
  format: registry.access.redhat.com/openshift3/ose-${component}:${version}
  latest: false
iptablesSyncPeriod: 30s
kind: NodeConfig
masterClientConnectionOverrides:
  acceptContentTypes: application/vnd.kubernetes.protobuf,application/json
  burst: 40
  contentType: application/vnd.kubernetes.protobuf
  qps: 20
masterKubeConfig: node.kubeconfig
networkConfig:
  mtu: 1450
  networkPluginName: redhat/openshift-ovs-networkpolicy
nodeName: ose-node1.bmeng.local
podManifestConfig: null
servingInfo:
  bindAddress: 0.0.0.0:10250
  bindNetwork: tcp4
  certFile: server.crt
  clientCA: node-client-ca.crt
  keyFile: server.key
  namedCertificates: null
volumeConfig:
  localQuota:
    perFSGroup: null
volumeDirectory: /root/openshift.local.volumes
kubeletArguments:
  fail-swap-on:
  - "false"

Comment 11 Dan Williams 2018-01-17 19:44:56 UTC
(In reply to Meng Bo from comment #10)
> I was reproduced the issue by setting the replicas to 8 when I reported the
> bug, and I did not reproduce it this afternoon with a replicas=6 for running
> about 3 hours.
> 
> I saw a lot of exited containers this afternoon and they all be deleted
> after a while. 
> 
> I will try to reproduce it again tonight.

Ok, let me know the results.  If you can also grab "docker ps -a" on the node that shows the problem, and also:

for f in /var/lib/cni/networks/openshift-sdn/*; do echo "${f}: `cat ${f}`; done

so we can match up sandbox containers with their leases.

Comment 12 Meng Bo 2018-01-18 02:51:11 UTC
I have reproduced the issue again with about 10 hours running the bash loop:

# while true ; do docker rm -f `docker ps -aq` ; sleep 5 ; if [ `ls /var/lib/cni/networks/openshift-sdn | wc -l` -eq 254 ] ; then sleep 36000 ; else echo ; ls -l /var/lib/cni/networks/openshift-sdn ; echo ; docker ps -a ; fi ; done

And the ip file for CNI are:
# ls -ltc /var/lib/cni/networks/openshift-sdn/ 
total 256
-rw-r--r--. 1 root root 64 Jan 18 10:29 10.128.1.85
-rw-r--r--. 1 root root 11 Jan 18 10:29 last_reserved_ip.0
-rw-r--r--. 1 root root 64 Jan 18 10:28 10.128.1.84
-rw-r--r--. 1 root root 64 Jan 18 10:28 10.128.1.83
-rw-r--r--. 1 root root 64 Jan 18 10:27 10.128.1.81
-rw-r--r--. 1 root root 64 Jan 18 10:27 10.128.1.77
-rw-r--r--. 1 root root 64 Jan 18 10:18 10.128.1.170
-rw-r--r--. 1 root root 64 Jan 18 10:17 10.128.1.158
-rw-r--r--. 1 root root 64 Jan 18 10:13 10.128.1.92
-rw-r--r--. 1 root root 64 Jan 18 10:11 10.128.1.60
-rw-r--r--. 1 root root 64 Jan 18 10:04 10.128.1.214
-rw-r--r--. 1 root root 64 Jan 18 09:29 10.128.1.121
-rw-r--r--. 1 root root 64 Jan 18 09:29 10.128.1.109
-rw-r--r--. 1 root root 64 Jan 18 09:10 10.128.1.44
-rw-r--r--. 1 root root 64 Jan 18 09:01 10.128.1.156
-rw-r--r--. 1 root root 64 Jan 18 08:58 10.128.1.107
-rw-r--r--. 1 root root 64 Jan 18 08:49 10.128.1.212
-rw-r--r--. 1 root root 64 Jan 18 08:48 10.128.1.193
-rw-r--r--. 1 root root 64 Jan 18 08:42 10.128.1.94
-rw-r--r--. 1 root root 64 Jan 18 08:33 10.128.1.190
-rw-r--r--. 1 root root 64 Jan 18 08:14 10.128.1.123
-rw-r--r--. 1 root root 64 Jan 18 08:03 10.128.1.184
-rw-r--r--. 1 root root 64 Jan 18 07:49 10.128.1.207
-rw-r--r--. 1 root root 64 Jan 18 07:34 10.128.1.218
-rw-r--r--. 1 root root 64 Jan 18 07:33 10.128.1.201
-rw-r--r--. 1 root root 64 Jan 18 07:19 10.128.1.228
-rw-r--r--. 1 root root 64 Jan 18 06:56 10.128.1.114
-rw-r--r--. 1 root root 64 Jan 18 06:43 10.128.1.155
-rw-r--r--. 1 root root 64 Jan 18 06:22 10.128.1.79
-rw-r--r--. 1 root root 64 Jan 18 06:14 10.128.1.205
-rw-r--r--. 1 root root 64 Jan 18 05:57 10.128.1.178
-rw-r--r--. 1 root root 64 Jan 18 05:33 10.128.1.50
-rw-r--r--. 1 root root 64 Jan 18 05:31 10.128.1.17
-rw-r--r--. 1 root root 64 Jan 18 05:23 10.128.1.139
-rw-r--r--. 1 root root 64 Jan 18 05:19 10.128.1.80
-rw-r--r--. 1 root root 64 Jan 18 05:10 10.128.1.199
-rw-r--r--. 1 root root 64 Jan 18 04:52 10.128.1.172
-rw-r--r--. 1 root root 64 Jan 18 04:43 10.128.1.24
-rw-r--r--. 1 root root 64 Jan 18 04:36 10.128.1.161
-rw-r--r--. 1 root root 64 Jan 18 04:05 10.128.1.179
-rw-r--r--. 1 root root 64 Jan 18 03:58 10.128.1.65
-rw-r--r--. 1 root root 64 Jan 18 03:54 10.128.1.10
-rw-r--r--. 1 root root 64 Jan 18 03:43 10.128.1.98
-rw-r--r--. 1 root root 64 Jan 18 03:23 10.128.1.31
-rw-r--r--. 1 root root 64 Jan 18 02:59 10.128.1.169
-rw-r--r--. 1 root root 64 Jan 18 02:53 10.128.1.78
-rw-r--r--. 1 root root 64 Jan 18 02:40 10.128.1.113
-rw-r--r--. 1 root root 64 Jan 18 02:17 10.128.1.239
-rw-r--r--. 1 root root 64 Jan 18 02:02 10.128.1.247
-rw-r--r--. 1 root root 64 Jan 18 01:59 10.128.1.208
-rw-r--r--. 1 root root 64 Jan 18 01:46 10.128.1.7
-rw-r--r--. 1 root root 64 Jan 18 00:17 10.128.1.124
-rw-r--r--. 1 root root 64 Jan 18 00:02 10.128.1.182
-rw-r--r--. 1 root root 64 Jan 17 23:23 10.128.1.70
-rw-r--r--. 1 root root 64 Jan 17 23:22 10.128.1.52
-rw-r--r--. 1 root root 64 Jan 17 23:02 10.128.1.250
-rw-r--r--. 1 root root 64 Jan 17 22:41 10.128.1.188
-rw-r--r--. 1 root root 64 Jan 17 22:35 10.128.1.82
-rw-r--r--. 1 root root 64 Jan 17 21:47 10.128.1.76
-rw-r--r--. 1 root root 64 Jan 17 21:10 10.128.1.240
-rw-r--r--. 1 root root 64 Jan 17 21:07 10.128.1.191
-rw-r--r--. 1 root root 64 Jan 17 20:45 10.128.1.101
-rw-r--r--. 1 root root 64 Jan 17 20:09 10.128.1.57
-rw-r--r--. 1 root root 64 Jan 17 19:58 10.128.1.233


container list on the node:
# docker ps -a 
CONTAINER ID        IMAGE                                                                                                  COMMAND              CREATED             STATUS              PORTS               NAMES
a2427882699a        docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab   "/hello-openshift"   18 minutes ago      Up 18 minutes                           k8s_test-pod_test-rc-wb2t4_bmengp1_0c289d1e-fb5e-11e7-b872-525400dd3698_0
8ace01b49b4c        docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab   "/hello-openshift"   18 minutes ago      Up 18 minutes                           k8s_test-pod_test-rc-xrk5l_bmengp1_0c0f08df-fb5e-11e7-b872-525400dd3698_0
a8de6f3b2e7b        docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab   "/hello-openshift"   19 minutes ago      Up 19 minutes                           k8s_test-pod_test-rc-ztzjl_bmengp1_86f0e87e-fb54-11e7-b872-525400dd3698_0
113db611dfa3        docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab   "/hello-openshift"   19 minutes ago      Up 19 minutes                           k8s_test-pod_test-rc-2pd9t_bmengp1_5b22e7a3-fb67-11e7-b872-525400dd3698_0
2e15bd406db3        docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab   "/hello-openshift"   20 minutes ago      Up 19 minutes                           k8s_test-pod_test-rc-wkzgc_bmengp1_7c83752a-fb4f-11e7-b872-525400dd3698_0
7acb690e8577        registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0                                            "/usr/bin/pod"       20 minutes ago      Up 20 minutes                           k8s_POD_test-rc-wb2t4_bmengp1_0c289d1e-fb5e-11e7-b872-525400dd3698_0
904dadbcb4c2        registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0                                            "/usr/bin/pod"       20 minutes ago      Up 20 minutes                           k8s_POD_test-rc-xrk5l_bmengp1_0c0f08df-fb5e-11e7-b872-525400dd3698_0
6c668d2703c4        registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0                                            "/usr/bin/pod"       20 minutes ago      Up 20 minutes                           k8s_POD_test-rc-2pd9t_bmengp1_5b22e7a3-fb67-11e7-b872-525400dd3698_0
207e6b791576        registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0                                            "/usr/bin/pod"       20 minutes ago      Up 20 minutes                           k8s_POD_test-rc-ztzjl_bmengp1_86f0e87e-fb54-11e7-b872-525400dd3698_0
dd8d314e1851        registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0                                            "/usr/bin/pod"       20 minutes ago      Up 20 minutes                           k8s_POD_test-rc-wkzgc_bmengp1_7c83752a-fb4f-11e7-b872-525400dd3698_0



Container ids in the ip files:
# for f in `ls /var/lib/cni/networks/openshift-sdn/`; do printf "${f}:  " ; cat ${f} ; echo ; done
10.128.1.10:  406c2d8b268cc16ae17d8c2d7bc1d80da412568d56fb19672cfd1f74c2e8ee4d
10.128.1.101:  cf8f853b004c6535a0a7356cbb2c487512bb48a2582d7a2e8b3dfe3fbc5bd013
10.128.1.107:  25d05a673251e571fe9ef308f31b513efabb4ce9536b5677dbbfe58348431a1b
10.128.1.109:  79a30f0c9871ef12698b8cbb06446f2c84702a0e8e2f8e79da9e489a2b9aec5f
10.128.1.113:  241b2aaad11481541b1a351a2eaa1f36404eb13232fde36209bbc29799798d24
10.128.1.114:  6ea0ddea5cf62564abf25afa845a5c42886823a29cc0a6020b5c85214ba8e303
10.128.1.121:  5888e0f412aad21b2ae23e8938f54b2f8a73df9a843b422cf8f6a5d18b056f0d
10.128.1.123:  0011d60fb9b65b08ccdb788c8023dd2ce559d9f54287784032072bd37430a861
10.128.1.124:  a4564ab7ae5c48483cd709d604da1b76f6c598c8d985a645ee9cceb05a5dab49
10.128.1.139:  fefd02c3e140bc7e2617f085a4feeb071f622fa8cb16f07925ab07ba69aa4763
10.128.1.155:  4effcf697b199619d06f941b71583404ee18ea477424477df08f1ac349f20441
10.128.1.156:  d44c1a53e63c2806686be0424d8060717fde61bb0aa14817269e16ddfda54360
10.128.1.158:  6ded324d6f3a6f50c7e93a377d57c11c366b73beac69a948f0556d5c6124d25d
10.128.1.161:  9b5f85324ce9b325c5f8bf802e7ad21a6b31632aef212f557f16eaa4b3c7e839
10.128.1.169:  0bce1eeafb4e82710e50f03acd0a8b6a6bbc02f5d1d5b151596bd2331ab6a661
10.128.1.17:  c404ffa8655d4c4ea2285e20ccad641d185f7234ba321604663557335c161d1d
10.128.1.170:  46e86e5bca489b3acb88b788a18f6710f32dcb1ce6bdc81e9e14632ae6938654
10.128.1.172:  d62589d7c8d87b2f3bca8b20131d004a17fbb817cf391f54c366a7ebb16cd96a
10.128.1.178:  bf2680c04efe4209d6dd48666fb704077479c4ba9c39d504884715e05f326e14
10.128.1.179:  d166650e51542d0db3ce9930946c87791cf0efff7b25c26c3e921d111264bc7a
10.128.1.182:  9bf3abac8d3be141c8c890c84868405af1668af78c91c22fcaee2e90cc154261
10.128.1.184:  0e5966dd2638331b5516165096d07f5f8b2a22a1a302fe0db6674cd2345c9c7d
10.128.1.188:  fda7d7e81166fc03821c264f4c34840466cba8ef0fcfb712fbf33f42bc99143e
10.128.1.190:  09efa7423ad540533671874f3c37992c7359c9a838f4a075aa69cbf02bfe344e
10.128.1.191:  3651b70ccbd478463b28ab3e44ad2b5a3382c146bb17593862a7e1606e367735
10.128.1.193:  2963662a56a91056244da239cc4c73545cb31113fb02fe5465613ec41919284e
10.128.1.199:  aad32556f65fbe315a346955461ec61e74c45c3a14620506020e7a332dc68901
10.128.1.201:  9358bd9af4f21b0877b3cc161c4d86560de1baa97fa6ed608ed4859033a8f57f
10.128.1.205:  c8eab9ef86bdc4139e1f2bb26f05aa77758e5e1a3bb122444177d51d08341145
10.128.1.207:  596dcff6911e31194fcd48d2fc180928d018476ead4e7c545e6d5da6a4cc2c36
10.128.1.208:  7812c187533d354347cb876e06d9ab2fde09a267ea853c77a5baee048f550b23
10.128.1.212:  17049599e640a644b247633770725848925aced4b605bb2ec6881dc1ff7c6e5a
10.128.1.214:  fcd021b90e17791acfe6c63c9fab055a944fc99a50b5ac0b221c5be5f8281b3c
10.128.1.218:  3c92da3558c033283ca086e23ced2704731102608cc8cd31807915e10b0ea665
10.128.1.228:  eca086a33ddfb432f0509449e0ab9a5283196874eaa6b118587d4fb9a829315f
10.128.1.233:  b56760d7a143f854e77fecbc2fdc8a283d30fe94cc817e2ff0a8d517230a4825
10.128.1.239:  1519f4a7081d1702c9394dd85f9348684b7548ca3ca94a5619017e9a98490c4a
10.128.1.24:  eb9ccfb80e572d5bc62bf9373bd0d725d9673df5eae3af19025213f3a26f5254
10.128.1.240:  c0446bc0dab3ff0fe87b9594fd85fe121f65eac48b4e72595d9ddc7f737aae31
10.128.1.247:  fbd8ab45495e796581a7ee4004098e2e2578e396e392354ef76180805c881c61
10.128.1.250:  9bee76c3278f100dff02c0ea54a3ac751a156f8188843115a3cad2661ffcec62
10.128.1.31:  cf6557b30d3cbd2d8fa8f5ea6280485b6bdca891f92a533773afbb344b171c2b
10.128.1.44:  3e0dca968a515d32ddc44534e62b9c2f8bf1c9ca5ae0856aeca0757bf574c169
10.128.1.50:  ff41d9931e95f77ef068fd9131f9633895b22fb904f9f67abbe3a9fa8ca78470
10.128.1.52:  7e1527734b571b464a5ca657d3c7dd2196b8057a55c0d242659a0dd50d49bd24
10.128.1.57:  fdfcb5bbf2ba20489a217b928fc8052c2b48738b1e8f352868452a21847e107d
10.128.1.60:  c6980e5e28d7ad43c08307d02e842273318f8fd51dbb7107fd985e067c181107
10.128.1.65:  8d3e781ce4cacf77441891d00591fe4ea6b13db63913be281c23b6ad569f9d77
10.128.1.7:  c2af3c2dc556f17e4fccf2c30d4bdf10d378c9030adec77de6bcf1ac72a10c2b
10.128.1.70:  ccbca2661f11122e2aec47c4c318985f899df58c1bcb976a8341264fdf50b33c
10.128.1.76:  f1e4a73d1b380e22d8c3949288529f32c5dbecc56778aa28a81f495765a63939
10.128.1.77:  dd8d314e1851c015283530aa208fdd629da4b9722d6cac343d0fd60954a3f402
10.128.1.78:  4e492de4cafa760809ca2574b6469d56634b84f6291679600b4d4bb98b3bf23e
10.128.1.79:  c7b4ed93a532781d96f9d80aa8dfd24a3615519ea03c29759abe548a488f3f52
10.128.1.80:  cddcb0caee1dfe79a26680b049bab210724f27b497264d7f04f9654809049245
10.128.1.81:  6c668d2703c48c2850cf8a75e878365cfc59520eb22a432d7703f16c087ffa63
10.128.1.82:  c91d88914bd6a80c0bb5c26140d28a5e5c2a1621c643996acbd544c593d7bf07
10.128.1.83:  207e6b7915761f7837ee481930147f033cb27896e8838e3faed787efcb1cfa72
10.128.1.84:  904dadbcb4c26b16b3b997e9f7517d43569e5c06b1c58e664424903ba3ec8e43
10.128.1.85:  7acb690e85779940dd13dc7ceb9aa71bf7d9f8ebbf927a2f4909904c9bbd771b
10.128.1.92:  e956dd81b8f1062b070e1e7b1fc78b2a45a271b778128a176ca4f39b065a5002
10.128.1.94:  e2d8e8fe4cae0e57be12356a86cdc52085263824334f600929754ae1777fbd08
10.128.1.98:  d75feeff101f476ddc4e66d3365ac472c7629c5dd766547d650b0adcbe177005
last_reserved_ip.0:  10.128.1.85


Unfortunately, both the docker log and atomic-openshift-node log are rotated, and left only about 1 hour. I will attach them later.

Comment 13 Meng Bo 2018-01-18 02:53:18 UTC
Created attachment 1382724 [details]
docker_log

Comment 14 Meng Bo 2018-01-18 03:18:34 UTC
Created attachment 1382725 [details]
atomic_openshift_node_log

Comment 15 Dan Williams 2018-01-19 16:38:57 UTC
(In reply to Meng Bo from comment #14)
> Created attachment 1382725 [details]
> atomic_openshift_node_log

Any chance we can get the older rotated logs too?

Comment 16 Meng Bo 2018-01-22 15:30:56 UTC
Ok, I am trying to reproduce it again with setting the logrotate to keep more history. Hope I can get more log later.

Comment 19 Meng Bo 2018-02-01 02:27:45 UTC
Yeah, sure.

I think rpms are easy for me to install.

Comment 29 Stefanie Forrester 2018-04-11 21:44:56 UTC
Created attachment 1420518 [details]
failed app create in online-int

Creating an app, and getting the error "failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set".

Comment 30 Dan Williams 2018-04-19 19:53:46 UTC
For the original bug as reported by Meng Bo, I've pushed some upstream PRs to fix the issue he identified and I was able to reproduce:

https://github.com/kubernetes-incubator/cri-o/pull/1531
https://github.com/kubernetes/kubernetes/pull/62874

For the OpenShift online case, we'd need to see detailed node logs to see if that case matches what I've fixed here.

The fix ensures that networking is cleaned up if the actual network plugin returns success, but some other error happens after that but before the runtime (eg dockershim or CRIO) returns to kubelet.

Comment 31 Dan Williams 2018-05-01 14:43:38 UTC
Origin fixes are:

master: https://github.com/openshift/origin/pull/19576
3.9: https://github.com/openshift/origin/pull/19577

Comment 32 Dan Williams 2018-05-09 17:55:46 UTC
Master PR has merged, 3.9 PR is now https://github.com/openshift/origin/pull/19665

Comment 36 Meng Bo 2018-06-01 02:41:09 UTC
Tried on ocp v3.10.0-0.54.0

After deleting/recreating the docker container multiple times. Only the ipam file for the existing containers exist.

Verify the bug.