Bug 1395183
| Summary: | Unable to create pods | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jiří Mencák <jmencak> | ||||
| Component: | Networking | Assignee: | Dan Williams <dcbw> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Meng Bo <bmeng> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 3.4.0 | CC: | aloughla, aos-bugs, bbennett, dcbw, ekuric, jeder, jmencak, mifiedle, tdawson | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | aos-scalability-34 | ||||||
| Fixed In Version: | Doc Type: | No Doc Update | |||||
| Doc Text: |
undefined
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-01-18 12:55:05 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Jiří Mencák
2016-11-15 11:09:13 UTC
Can we get node logs from the node that ran out of addresses? I've generated sosreport on a node that the pod failed to start. Please see: http://ekuric.usersys.redhat.com/jmencak/BZ1395183/sosreport-1395183-20161116040906.txt http://ekuric.usersys.redhat.com/jmencak/BZ1395183/sosreport-JMencak.1395183-20161116040906.tar.xz (In reply to jmencak from comment #0) > Steps to Reproduce: > 1. Created 1600 hello-openshift pod-service-routes > (https://github.com/jmencak/projects/tree/master/haproxy/apps/hello- > openshift) across 16 different projects. > 2. Tested the router (HAProxy) with varying amount of loads (not sure if > necessary to reproduce) until the point of HAProxy's failure-reloads. > 3. Deleted all 16 projects (verified that all projects were deleted). How long did you wait between steps 3 and 4? > 4. EC2 VM shutdown. > 5. EC2 VM start Also, on a node that has problems, can you report the contents of the /var/lib/cni/networks/openshift-sdn/ directory? No need to tar it up or anything, just an 'ls' would be good. Next, before you shut the VM down, can you do a quick 'journalctl -b -u openshift-node > /tmp/node.log', grab the node log, and get it to me somehow? Not sure about the time between 3) and 4), but probably not very long because of EC2 costs. http://ekuric.usersys.redhat.com/jmencak/BZ1395183/openshift-sdn-2016116.txt root@ip-172-31-42-75: ~ # journalctl -b -u openshift-node -- No entries -- root@ip-172-31-42-75: ~ # Broke yet another EC2 cluster. $ oc version oc v3.4.0.26+f7e109e kubernetes v1.4.0+776c994 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-31-2-138.us-west-2.compute.internal:8443 openshift v3.4.0.26+f7e109e kubernetes v1.4.0+776c994 This time I noticed that I run out of disk space before hitting this issue. No reboots involved. $ oc get ev 23m 23m 1 hello-openshift-4-bxnea Pod Warning FailedSync {kubelet ip-172-31-2-147.us-west-2.compute.internal} Error syncing pod, skipping: failed to "SetupNetwork" for "hello-openshift-4-bxnea_hello-openshift-128" with SetupNetworkError: "Failed to setup network for pod \"hello-openshift-4-bxnea_hello-openshift-128(4ceb366f-acab-11e6-8c3b-0253c164b30f)\" using network plugins \"cni\": CNI request failed with status 400: 'failed to run IPAM for 8aa9145f6aeeee60ea0e7274e6d3d37e41bcee6d5c6b932a0a83d957baefac56: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn\n'; Skipping pod" Not ruling out the same happened in the previous case, i.e. running out of disk space. (In reply to jmencak from comment #5) > root@ip-172-31-42-75: ~ # journalctl -b -u openshift-node Can you do: journalctl -b -u atomic-openshift-node Interim fix for openshift-sdn here: https://github.com/openshift/origin/pull/11964 Real fix for upstream kubernetes: https://github.com/kubernetes/kubernetes/pull/37036 root@ip-172-31-42-75: ~ # journalctl -b -u atomic-openshift-node > /tmp/atomic-openshift-node-20161118.log Please find the requested log here: http://ekuric.usersys.redhat.com/jmencak/BZ1395183/atomic-openshift-node-20161118.log Cherry-pick to 1.4: https://github.com/openshift/origin/pull/11983 FWIW, the same issue on another cluster. $ oc version oc v3.4.0.28+dfe3a66 kubernetes v1.4.0+776c994 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-31-13-129.us-west-2.compute.internal:8443 openshift v3.4.0.28+dfe3a66 kubernetes v1.4.0+776c994 without running out of disk space and reboots. 25m 25m 1 nginx-1-ocsdv Pod Warning FailedSync {kubelet ip-172-31-13-134.us-west-2.compute.internal} Error syncing pod, skipping: failed to "SetupNetwork" for "nginx-1-ocsdv_nginx3" with SetupNetworkError: "Failed to setup network for pod \"nginx-1-ocsdv_nginx3(b79b4358-b168-11e6-bd8a-02153dcdaf69)\" using network plugins \"cni\": CNI request failed with status 400: 'failed to run IPAM for a7f88d071467875f2b70a0ee943c42162142c0c6f9b07e236ae6092d1350f20c: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn\n'; Skipping pod" It finally made it through the merge queue and landed in the 1.4 branch on 2016-11-22 7:09PM EST. I don't know when that will get to OSE 3.4 and what build that will be in. This has been merged into ocp and is in OCP v3.4.0.29 or newer. Created attachment 1229405 [details]
node_log
Test this with following steps on OCP 3.4.0.32:
1. Setup env with 1 master 1 node (set the host subnet length to 8)
2. Create one pod on the env
3. Delete the pod above
4. Generate the IP files under /var/lib/cni/network/openshift-sdn manually
$ for i in {1..254} ; do echo ef0652fdc8ed9e1239ece47f4e38bd1ffb55303c7abd663d2376f29b59fc7f66 > 10.128.0.$i ; done
5. Try to create another pod
6. Check the node log
Result:
5. Pod cannot start due to
4m 4m 1 {kubelet ose-node1.bmeng.local} Warning FailedSync Error syncing pod, skipping: failed to "SetupNetwork" for "caddy-docker_bmengp1" with SetupNetworkError: "Failed to setup network for pod \"caddy-docker_bmengp1(32292159-bd21-11e6-940f-525400dd3698)\" using network plugins \"cni\": CNI request failed with status 400: 'failed to run IPAM for 6a938e84d63fb13ae295c8431b070e1949bdeb7641f9631b38857ff7cfe1c144: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn\n'; Skipping pod"
6. node log attached.
sorry for forgot to change the status in above comment. This error would be expected once, and then the next time kubelet tries to recreate the pod, it should succeed. The IPAM garbage collection runs when IPAM fails, so after that warning I think if you wait a few seconds for the kubelet backoff retry, it should succeed on the second attempt. After the first failure I'd expect all the stale IPs in /var/lib/cni/networks/openshift-sdn/ to be cleaned up though. The failure turned out to be that when echoing random container IDs to the files in /var/lib/cni/networks/openshift-sdn, echo puts a newline at the end. The host-local backend does exact matching of the file contents and the container ID, leading to no matches found in the IP reservation files for container IDs when they were garbage collected, and thus no files would be removed. Arguably container IDs shouldn't contain newlines or whitespace (at the very least, no newlines), and while this is a pretty edge case, I've submitted this PR for CNI: https://github.com/containernetworking/cni/pull/341 Thanks. That works. Use `printf` instead of `echo` and repeat the steps in comment#20 The manually generated IP files are deleted when the new pod created. Verify the bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0066 |