Bug 1532965
Summary: | CNI ip files remain on disk after pod deleted and ipamgarbagecollect will not be triggered when the IPs are exhausted | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Meng Bo <bmeng> | ||||||||
Component: | Networking | Assignee: | Dan Williams <dcbw> | ||||||||
Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> | ||||||||
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |||||||||
Severity: | urgent | ||||||||||
Priority: | urgent | CC: | aos-bugs, atragler, bbennett, bmchugh, bmeng, bmorriso, dakini, dcbw, dmoessne, fshaikh, jfiala, jrosenta, palonsor, rbost, ricardo.arguello, snalawad, vizak, wgordon, xmorano | ||||||||
Version: | 3.9.0 | Keywords: | OpsBlocker | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | 3.10.0 | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-07-18 14:43:43 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Meng Bo
2018-01-10 06:13:05 UTC
What exact docker command are you using to delete the containers? are you deleting the container itself, or the "pause" container, or both? I deleted all the containers on the node with `docker rm -f $(docker ps -aq)`. I use this method to reproduce the issue(ip files remain on node) since I have seen the issue on other environments. Do you perform any other steps? Or do you simply leave the node running with containers, and then run the "docker rm" command? I was unable to reproduce with v3.9.0-0.16.0 by just: 1) start cluster with one node 2) create 4 pods 3) wait until pods have IP address 4) docker rm -f $(docker ps -aq) 5) wait at least a minute, check /var/lib/cni/networks/openshift-sdn/ for stale IPs I found it sometimes takes up to 30 seconds or so for GC to kick in and garbage collect the sandbox IPs, but it did happen for me. Are you able to reproduce at --log-level=5 and attach the node logs so I can debug further? 1. The node is not performed any other operations. Just running containers and delete containers and recreate. 2. I deleted the containers without waiting them have IP address. Here is the what I was running on the node. # while true ; do docker rm -f `docker ps -aq` ; sleep 10 ; if [ `ls /var/lib/cni/networks/openshift-sdn/ | wc -l` -eq 254 ] ; then sleep 36000 ; else ls ; fi ; sleep 5 ; done I can reproduce it with log-level=5 but I am afraid the log file will be quite huge since it may not be reproduced easily. (In reply to Meng Bo from comment #4) > 1. The node is not performed any other operations. Just running containers > and delete containers and recreate. > 2. I deleted the containers without waiting them have IP address. > Here is the what I was running on the node. > # while true ; do docker rm -f `docker ps -aq` ; sleep 10 ; if [ `ls > /var/lib/cni/networks/openshift-sdn/ | wc -l` -eq 254 ] ; then sleep 36000 ; > else ls ; fi ; sleep 5 ; done > > I can reproduce it with log-level=5 but I am afraid the log file will be > quite huge since it may not be reproduced easily. Can you attach your node config too? I was able to trigger the bug for about 5 minutes twice yesterday, but then the sandbox was garbage collected while I was investigating. I guess sometimes GC just takes a few minutes. I'll try to reproduce again with your bash loop. One more question. If you run 'docker ps -a' do you see a large number of exited, non-garbage-collected normal containers? (eg, not containers using the openshift/origin-pod:<version> image) I was able to get openshift-node into a state where it did not garbage collect a dead container, which prevents garbage collection of the container's sandbox, which prevents garbage collection of IPAM. This would show up under "docker ps -a" as a long-exited container that has not been deleted by kubelet. Digging into this more, when using GC defaults we expect kubelet to keep around one dead container (which also keeps its sandbox around) for a given pod. Kubelet does this to allow quicker restart of a pod's container if needed. So for example if we have a pod "nginx1", we would expect up to a maximum of: a) one running app container b) one running sandbox container owned by (a) c) one terminated app container d) (optional) one terminated sandbox container owned by (c) (d) is optional; the actual container doesn't need to exist, but just (c) existing will prevent the IPAM from being GC-ed. You'd think that if the sandbox container had already been deleted from docker/crio/etc that Kube would allow GC of the sandbox to happen, but that's not the case. That should probably get changed, but unfortunately the CRI API between kubelet and the runtimes (docker/crio/etc) doesn't give us enough information to know whether the sandbox container is terminated or starting up. Given that, we can't reliably use the sandbox state to GC terminated sandboxes or we'd start killing sandboxes that were in the process of being created. In your case, since you're using a replication controller, each pod has a different name, and thus kubelet might keep around one dead sandbox for each individual pod. How many replicas had you set? So if this is indeed the cause of what you're seeing, it'll be somewhat hard to fix without some icky changes upstream. If not, I can try to create some debugging RPMs with a patch that should help us track this down further. I was reproduced the issue by setting the replicas to 8 when I reported the bug, and I did not reproduce it this afternoon with a replicas=6 for running about 3 hours. I saw a lot of exited containers this afternoon and they all be deleted after a while. I will try to reproduce it again tonight. I suspect that when I deleted/created the containers repeatedly, the docker daemon may becomes high load and which will make the kubelet cannot talk to the docker daemon normally. Here is the master config about the network part: networkConfig: clusterNetworkCIDR: 10.128.0.0/14 clusterNetworks: - cidr: 10.128.0.0/14 hostSubnetLength: 8 externalIPNetworkCIDRs: null hostSubnetLength: 8 ingressIPNetworkCIDR: 172.29.0.0/16 networkPluginName: redhat/openshift-ovs-networkpolicy serviceNetworkCIDR: 172.30.0.0/16 Here is the node config: # cat /etc/origin/node/node-config.yaml allowDisabledDocker: false apiVersion: v1 authConfig: authenticationCacheSize: 1000 authenticationCacheTTL: 5m authorizationCacheSize: 1000 authorizationCacheTTL: 5m dnsBindAddress: "" dnsDomain: cluster.local dnsIP: "" dnsNameservers: null dnsRecursiveResolvConf: "" dockerConfig: dockerShimRootDirectory: /var/lib/dockershim dockerShimSocket: /var/run/dockershim.sock execHandlerName: native enableUnidling: true imageConfig: format: registry.access.redhat.com/openshift3/ose-${component}:${version} latest: false iptablesSyncPeriod: 30s kind: NodeConfig masterClientConnectionOverrides: acceptContentTypes: application/vnd.kubernetes.protobuf,application/json burst: 40 contentType: application/vnd.kubernetes.protobuf qps: 20 masterKubeConfig: node.kubeconfig networkConfig: mtu: 1450 networkPluginName: redhat/openshift-ovs-networkpolicy nodeName: ose-node1.bmeng.local podManifestConfig: null servingInfo: bindAddress: 0.0.0.0:10250 bindNetwork: tcp4 certFile: server.crt clientCA: node-client-ca.crt keyFile: server.key namedCertificates: null volumeConfig: localQuota: perFSGroup: null volumeDirectory: /root/openshift.local.volumes kubeletArguments: fail-swap-on: - "false" (In reply to Meng Bo from comment #10) > I was reproduced the issue by setting the replicas to 8 when I reported the > bug, and I did not reproduce it this afternoon with a replicas=6 for running > about 3 hours. > > I saw a lot of exited containers this afternoon and they all be deleted > after a while. > > I will try to reproduce it again tonight. Ok, let me know the results. If you can also grab "docker ps -a" on the node that shows the problem, and also: for f in /var/lib/cni/networks/openshift-sdn/*; do echo "${f}: `cat ${f}`; done so we can match up sandbox containers with their leases. I have reproduced the issue again with about 10 hours running the bash loop: # while true ; do docker rm -f `docker ps -aq` ; sleep 5 ; if [ `ls /var/lib/cni/networks/openshift-sdn | wc -l` -eq 254 ] ; then sleep 36000 ; else echo ; ls -l /var/lib/cni/networks/openshift-sdn ; echo ; docker ps -a ; fi ; done And the ip file for CNI are: # ls -ltc /var/lib/cni/networks/openshift-sdn/ total 256 -rw-r--r--. 1 root root 64 Jan 18 10:29 10.128.1.85 -rw-r--r--. 1 root root 11 Jan 18 10:29 last_reserved_ip.0 -rw-r--r--. 1 root root 64 Jan 18 10:28 10.128.1.84 -rw-r--r--. 1 root root 64 Jan 18 10:28 10.128.1.83 -rw-r--r--. 1 root root 64 Jan 18 10:27 10.128.1.81 -rw-r--r--. 1 root root 64 Jan 18 10:27 10.128.1.77 -rw-r--r--. 1 root root 64 Jan 18 10:18 10.128.1.170 -rw-r--r--. 1 root root 64 Jan 18 10:17 10.128.1.158 -rw-r--r--. 1 root root 64 Jan 18 10:13 10.128.1.92 -rw-r--r--. 1 root root 64 Jan 18 10:11 10.128.1.60 -rw-r--r--. 1 root root 64 Jan 18 10:04 10.128.1.214 -rw-r--r--. 1 root root 64 Jan 18 09:29 10.128.1.121 -rw-r--r--. 1 root root 64 Jan 18 09:29 10.128.1.109 -rw-r--r--. 1 root root 64 Jan 18 09:10 10.128.1.44 -rw-r--r--. 1 root root 64 Jan 18 09:01 10.128.1.156 -rw-r--r--. 1 root root 64 Jan 18 08:58 10.128.1.107 -rw-r--r--. 1 root root 64 Jan 18 08:49 10.128.1.212 -rw-r--r--. 1 root root 64 Jan 18 08:48 10.128.1.193 -rw-r--r--. 1 root root 64 Jan 18 08:42 10.128.1.94 -rw-r--r--. 1 root root 64 Jan 18 08:33 10.128.1.190 -rw-r--r--. 1 root root 64 Jan 18 08:14 10.128.1.123 -rw-r--r--. 1 root root 64 Jan 18 08:03 10.128.1.184 -rw-r--r--. 1 root root 64 Jan 18 07:49 10.128.1.207 -rw-r--r--. 1 root root 64 Jan 18 07:34 10.128.1.218 -rw-r--r--. 1 root root 64 Jan 18 07:33 10.128.1.201 -rw-r--r--. 1 root root 64 Jan 18 07:19 10.128.1.228 -rw-r--r--. 1 root root 64 Jan 18 06:56 10.128.1.114 -rw-r--r--. 1 root root 64 Jan 18 06:43 10.128.1.155 -rw-r--r--. 1 root root 64 Jan 18 06:22 10.128.1.79 -rw-r--r--. 1 root root 64 Jan 18 06:14 10.128.1.205 -rw-r--r--. 1 root root 64 Jan 18 05:57 10.128.1.178 -rw-r--r--. 1 root root 64 Jan 18 05:33 10.128.1.50 -rw-r--r--. 1 root root 64 Jan 18 05:31 10.128.1.17 -rw-r--r--. 1 root root 64 Jan 18 05:23 10.128.1.139 -rw-r--r--. 1 root root 64 Jan 18 05:19 10.128.1.80 -rw-r--r--. 1 root root 64 Jan 18 05:10 10.128.1.199 -rw-r--r--. 1 root root 64 Jan 18 04:52 10.128.1.172 -rw-r--r--. 1 root root 64 Jan 18 04:43 10.128.1.24 -rw-r--r--. 1 root root 64 Jan 18 04:36 10.128.1.161 -rw-r--r--. 1 root root 64 Jan 18 04:05 10.128.1.179 -rw-r--r--. 1 root root 64 Jan 18 03:58 10.128.1.65 -rw-r--r--. 1 root root 64 Jan 18 03:54 10.128.1.10 -rw-r--r--. 1 root root 64 Jan 18 03:43 10.128.1.98 -rw-r--r--. 1 root root 64 Jan 18 03:23 10.128.1.31 -rw-r--r--. 1 root root 64 Jan 18 02:59 10.128.1.169 -rw-r--r--. 1 root root 64 Jan 18 02:53 10.128.1.78 -rw-r--r--. 1 root root 64 Jan 18 02:40 10.128.1.113 -rw-r--r--. 1 root root 64 Jan 18 02:17 10.128.1.239 -rw-r--r--. 1 root root 64 Jan 18 02:02 10.128.1.247 -rw-r--r--. 1 root root 64 Jan 18 01:59 10.128.1.208 -rw-r--r--. 1 root root 64 Jan 18 01:46 10.128.1.7 -rw-r--r--. 1 root root 64 Jan 18 00:17 10.128.1.124 -rw-r--r--. 1 root root 64 Jan 18 00:02 10.128.1.182 -rw-r--r--. 1 root root 64 Jan 17 23:23 10.128.1.70 -rw-r--r--. 1 root root 64 Jan 17 23:22 10.128.1.52 -rw-r--r--. 1 root root 64 Jan 17 23:02 10.128.1.250 -rw-r--r--. 1 root root 64 Jan 17 22:41 10.128.1.188 -rw-r--r--. 1 root root 64 Jan 17 22:35 10.128.1.82 -rw-r--r--. 1 root root 64 Jan 17 21:47 10.128.1.76 -rw-r--r--. 1 root root 64 Jan 17 21:10 10.128.1.240 -rw-r--r--. 1 root root 64 Jan 17 21:07 10.128.1.191 -rw-r--r--. 1 root root 64 Jan 17 20:45 10.128.1.101 -rw-r--r--. 1 root root 64 Jan 17 20:09 10.128.1.57 -rw-r--r--. 1 root root 64 Jan 17 19:58 10.128.1.233 container list on the node: # docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a2427882699a docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab "/hello-openshift" 18 minutes ago Up 18 minutes k8s_test-pod_test-rc-wb2t4_bmengp1_0c289d1e-fb5e-11e7-b872-525400dd3698_0 8ace01b49b4c docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab "/hello-openshift" 18 minutes ago Up 18 minutes k8s_test-pod_test-rc-xrk5l_bmengp1_0c0f08df-fb5e-11e7-b872-525400dd3698_0 a8de6f3b2e7b docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab "/hello-openshift" 19 minutes ago Up 19 minutes k8s_test-pod_test-rc-ztzjl_bmengp1_86f0e87e-fb54-11e7-b872-525400dd3698_0 113db611dfa3 docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab "/hello-openshift" 19 minutes ago Up 19 minutes k8s_test-pod_test-rc-2pd9t_bmengp1_5b22e7a3-fb67-11e7-b872-525400dd3698_0 2e15bd406db3 docker.io/aosqe/pod-for-ping@sha256:23ff71f5a8774055faedb0ae5aa882fcf8c1dd93e4a19b09ca1085478ad315ab "/hello-openshift" 20 minutes ago Up 19 minutes k8s_test-pod_test-rc-wkzgc_bmengp1_7c83752a-fb4f-11e7-b872-525400dd3698_0 7acb690e8577 registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0 "/usr/bin/pod" 20 minutes ago Up 20 minutes k8s_POD_test-rc-wb2t4_bmengp1_0c289d1e-fb5e-11e7-b872-525400dd3698_0 904dadbcb4c2 registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0 "/usr/bin/pod" 20 minutes ago Up 20 minutes k8s_POD_test-rc-xrk5l_bmengp1_0c0f08df-fb5e-11e7-b872-525400dd3698_0 6c668d2703c4 registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0 "/usr/bin/pod" 20 minutes ago Up 20 minutes k8s_POD_test-rc-2pd9t_bmengp1_5b22e7a3-fb67-11e7-b872-525400dd3698_0 207e6b791576 registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0 "/usr/bin/pod" 20 minutes ago Up 20 minutes k8s_POD_test-rc-ztzjl_bmengp1_86f0e87e-fb54-11e7-b872-525400dd3698_0 dd8d314e1851 registry.access.redhat.com/openshift3/ose-pod:v3.9.0-0.20.0 "/usr/bin/pod" 20 minutes ago Up 20 minutes k8s_POD_test-rc-wkzgc_bmengp1_7c83752a-fb4f-11e7-b872-525400dd3698_0 Container ids in the ip files: # for f in `ls /var/lib/cni/networks/openshift-sdn/`; do printf "${f}: " ; cat ${f} ; echo ; done 10.128.1.10: 406c2d8b268cc16ae17d8c2d7bc1d80da412568d56fb19672cfd1f74c2e8ee4d 10.128.1.101: cf8f853b004c6535a0a7356cbb2c487512bb48a2582d7a2e8b3dfe3fbc5bd013 10.128.1.107: 25d05a673251e571fe9ef308f31b513efabb4ce9536b5677dbbfe58348431a1b 10.128.1.109: 79a30f0c9871ef12698b8cbb06446f2c84702a0e8e2f8e79da9e489a2b9aec5f 10.128.1.113: 241b2aaad11481541b1a351a2eaa1f36404eb13232fde36209bbc29799798d24 10.128.1.114: 6ea0ddea5cf62564abf25afa845a5c42886823a29cc0a6020b5c85214ba8e303 10.128.1.121: 5888e0f412aad21b2ae23e8938f54b2f8a73df9a843b422cf8f6a5d18b056f0d 10.128.1.123: 0011d60fb9b65b08ccdb788c8023dd2ce559d9f54287784032072bd37430a861 10.128.1.124: a4564ab7ae5c48483cd709d604da1b76f6c598c8d985a645ee9cceb05a5dab49 10.128.1.139: fefd02c3e140bc7e2617f085a4feeb071f622fa8cb16f07925ab07ba69aa4763 10.128.1.155: 4effcf697b199619d06f941b71583404ee18ea477424477df08f1ac349f20441 10.128.1.156: d44c1a53e63c2806686be0424d8060717fde61bb0aa14817269e16ddfda54360 10.128.1.158: 6ded324d6f3a6f50c7e93a377d57c11c366b73beac69a948f0556d5c6124d25d 10.128.1.161: 9b5f85324ce9b325c5f8bf802e7ad21a6b31632aef212f557f16eaa4b3c7e839 10.128.1.169: 0bce1eeafb4e82710e50f03acd0a8b6a6bbc02f5d1d5b151596bd2331ab6a661 10.128.1.17: c404ffa8655d4c4ea2285e20ccad641d185f7234ba321604663557335c161d1d 10.128.1.170: 46e86e5bca489b3acb88b788a18f6710f32dcb1ce6bdc81e9e14632ae6938654 10.128.1.172: d62589d7c8d87b2f3bca8b20131d004a17fbb817cf391f54c366a7ebb16cd96a 10.128.1.178: bf2680c04efe4209d6dd48666fb704077479c4ba9c39d504884715e05f326e14 10.128.1.179: d166650e51542d0db3ce9930946c87791cf0efff7b25c26c3e921d111264bc7a 10.128.1.182: 9bf3abac8d3be141c8c890c84868405af1668af78c91c22fcaee2e90cc154261 10.128.1.184: 0e5966dd2638331b5516165096d07f5f8b2a22a1a302fe0db6674cd2345c9c7d 10.128.1.188: fda7d7e81166fc03821c264f4c34840466cba8ef0fcfb712fbf33f42bc99143e 10.128.1.190: 09efa7423ad540533671874f3c37992c7359c9a838f4a075aa69cbf02bfe344e 10.128.1.191: 3651b70ccbd478463b28ab3e44ad2b5a3382c146bb17593862a7e1606e367735 10.128.1.193: 2963662a56a91056244da239cc4c73545cb31113fb02fe5465613ec41919284e 10.128.1.199: aad32556f65fbe315a346955461ec61e74c45c3a14620506020e7a332dc68901 10.128.1.201: 9358bd9af4f21b0877b3cc161c4d86560de1baa97fa6ed608ed4859033a8f57f 10.128.1.205: c8eab9ef86bdc4139e1f2bb26f05aa77758e5e1a3bb122444177d51d08341145 10.128.1.207: 596dcff6911e31194fcd48d2fc180928d018476ead4e7c545e6d5da6a4cc2c36 10.128.1.208: 7812c187533d354347cb876e06d9ab2fde09a267ea853c77a5baee048f550b23 10.128.1.212: 17049599e640a644b247633770725848925aced4b605bb2ec6881dc1ff7c6e5a 10.128.1.214: fcd021b90e17791acfe6c63c9fab055a944fc99a50b5ac0b221c5be5f8281b3c 10.128.1.218: 3c92da3558c033283ca086e23ced2704731102608cc8cd31807915e10b0ea665 10.128.1.228: eca086a33ddfb432f0509449e0ab9a5283196874eaa6b118587d4fb9a829315f 10.128.1.233: b56760d7a143f854e77fecbc2fdc8a283d30fe94cc817e2ff0a8d517230a4825 10.128.1.239: 1519f4a7081d1702c9394dd85f9348684b7548ca3ca94a5619017e9a98490c4a 10.128.1.24: eb9ccfb80e572d5bc62bf9373bd0d725d9673df5eae3af19025213f3a26f5254 10.128.1.240: c0446bc0dab3ff0fe87b9594fd85fe121f65eac48b4e72595d9ddc7f737aae31 10.128.1.247: fbd8ab45495e796581a7ee4004098e2e2578e396e392354ef76180805c881c61 10.128.1.250: 9bee76c3278f100dff02c0ea54a3ac751a156f8188843115a3cad2661ffcec62 10.128.1.31: cf6557b30d3cbd2d8fa8f5ea6280485b6bdca891f92a533773afbb344b171c2b 10.128.1.44: 3e0dca968a515d32ddc44534e62b9c2f8bf1c9ca5ae0856aeca0757bf574c169 10.128.1.50: ff41d9931e95f77ef068fd9131f9633895b22fb904f9f67abbe3a9fa8ca78470 10.128.1.52: 7e1527734b571b464a5ca657d3c7dd2196b8057a55c0d242659a0dd50d49bd24 10.128.1.57: fdfcb5bbf2ba20489a217b928fc8052c2b48738b1e8f352868452a21847e107d 10.128.1.60: c6980e5e28d7ad43c08307d02e842273318f8fd51dbb7107fd985e067c181107 10.128.1.65: 8d3e781ce4cacf77441891d00591fe4ea6b13db63913be281c23b6ad569f9d77 10.128.1.7: c2af3c2dc556f17e4fccf2c30d4bdf10d378c9030adec77de6bcf1ac72a10c2b 10.128.1.70: ccbca2661f11122e2aec47c4c318985f899df58c1bcb976a8341264fdf50b33c 10.128.1.76: f1e4a73d1b380e22d8c3949288529f32c5dbecc56778aa28a81f495765a63939 10.128.1.77: dd8d314e1851c015283530aa208fdd629da4b9722d6cac343d0fd60954a3f402 10.128.1.78: 4e492de4cafa760809ca2574b6469d56634b84f6291679600b4d4bb98b3bf23e 10.128.1.79: c7b4ed93a532781d96f9d80aa8dfd24a3615519ea03c29759abe548a488f3f52 10.128.1.80: cddcb0caee1dfe79a26680b049bab210724f27b497264d7f04f9654809049245 10.128.1.81: 6c668d2703c48c2850cf8a75e878365cfc59520eb22a432d7703f16c087ffa63 10.128.1.82: c91d88914bd6a80c0bb5c26140d28a5e5c2a1621c643996acbd544c593d7bf07 10.128.1.83: 207e6b7915761f7837ee481930147f033cb27896e8838e3faed787efcb1cfa72 10.128.1.84: 904dadbcb4c26b16b3b997e9f7517d43569e5c06b1c58e664424903ba3ec8e43 10.128.1.85: 7acb690e85779940dd13dc7ceb9aa71bf7d9f8ebbf927a2f4909904c9bbd771b 10.128.1.92: e956dd81b8f1062b070e1e7b1fc78b2a45a271b778128a176ca4f39b065a5002 10.128.1.94: e2d8e8fe4cae0e57be12356a86cdc52085263824334f600929754ae1777fbd08 10.128.1.98: d75feeff101f476ddc4e66d3365ac472c7629c5dd766547d650b0adcbe177005 last_reserved_ip.0: 10.128.1.85 Unfortunately, both the docker log and atomic-openshift-node log are rotated, and left only about 1 hour. I will attach them later. Created attachment 1382724 [details]
docker_log
Created attachment 1382725 [details]
atomic_openshift_node_log
(In reply to Meng Bo from comment #14) > Created attachment 1382725 [details] > atomic_openshift_node_log Any chance we can get the older rotated logs too? Ok, I am trying to reproduce it again with setting the logrotate to keep more history. Hope I can get more log later. Yeah, sure. I think rpms are easy for me to install. Created attachment 1420518 [details]
failed app create in online-int
Creating an app, and getting the error "failed to run CNI IPAM ADD: failed to allocate for range 0: no IP addresses available in range set".
For the original bug as reported by Meng Bo, I've pushed some upstream PRs to fix the issue he identified and I was able to reproduce: https://github.com/kubernetes-incubator/cri-o/pull/1531 https://github.com/kubernetes/kubernetes/pull/62874 For the OpenShift online case, we'd need to see detailed node logs to see if that case matches what I've fixed here. The fix ensures that networking is cleaned up if the actual network plugin returns success, but some other error happens after that but before the runtime (eg dockershim or CRIO) returns to kubelet. Origin fixes are: master: https://github.com/openshift/origin/pull/19576 3.9: https://github.com/openshift/origin/pull/19577 Master PR has merged, 3.9 PR is now https://github.com/openshift/origin/pull/19665 Tried on ocp v3.10.0-0.54.0 After deleting/recreating the docker container multiple times. Only the ipam file for the existing containers exist. Verify the bug. |