Description of problem: There are two nodes in us-west-1 that are showing this error leading to high (100%) sandbox creation failure rates: failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn Nodes: starter-us-west-2-node-compute-39cac starter-us-west-2-node-compute-a0a21 Version-Release number of selected component (if applicable): 3.7.9-1.git.0.7c71a2d.el7
Can we get node logs?
Created attachment 1360515 [details] node.log attached node logs from ip-172-31-48-182.us-west-2.compute.internal (starter-us-west-2-node-compute-a0a21)
starter-us-west-2-node-compute-48549 just joined in as well
Can you upload the complete journal from one of those nodes (not just for atomic-openshift-node, but for everything) since the time of upgrade somewhere? Eric had gotten logs from one of the earlier nodes, but there were already tons of docker errors happening in the earliest logs there.
dcbw had thought this was caused by bad cleanup on restart: > <dcbw> danw: so for the moment I was concentrating on the us-west-2 case > <dcbw> there, the node runs into docker stupidity (context deadline exceeded) when trying to do anything. that means that docker didn't respond for over 2 minutes > <dcbw> so after a while of that, the node process gets restarted by something > <dcbw> so then, while we're starting up, the node tries to stop/clean up all the pods that used to be running in docker > <dcbw> of course, the network plugin isn't ready yet, so that all fails > <dcbw> danw: then eventually the SDN is ready, but by that point we've somehow lost the fact that the sandboxes are still there and should be cleaned up now that we can clean them up but the logs show the CNI errors happening long before the restart; first "CNI IPAM ADD" error was at 04:37:22, and OpenShift restarted at 08:03:41. Uh... and then OpenShift restarted again four minutes later. Nice... They both appear to have been triggered by ansible... it looks like ansible first restarted dnsmasq, which caused systemd to restart atomic-openshift-node as a side effect, and then after that, it separately restarted atomic-openshift-node.
Looking at the logs from 6e915, docker is really screwed up. There are numerous pods that OpenShift has been trying to delete for *days*, and every time it tries, it gets atomic-openshift-node: E1130 03:24:35.375048 2759 remote_runtime.go:109] StopPodSandbox "7e7e06281ecb7d632a5ebd8abf45fb4e4006f70ea06af24b0b0f173353a2f65a" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded (This goes as far back as the logs go. In particular, before the 3.7 update.) And in fact, at some point it apparently actually succeeded in killing the pod without realizing it, because you occasionally see: dockerd-current: time="2017-11-30T03:30:57.870986309Z" level=error msg="Handler for POST /v1.24/containers/7e7e3a38ef6b4505a1a56c639a947b7e0a5f56e41c5b67165c0190a1419ce428/stop?t=10 returned error: Container 7e7e3a38ef6b4505a1a56c639a947b7e0a5f56e41c5b67165c0190a1419ce428 is already stopped" But over the course of 3 days and 3250 attempts, every single StopPodSandbox attempt has timed out before docker returned an answer, so OpenShift still doesn't know. I'm not sure how this relates to the IPAM exhaustion though... each time it tries and fails to call StopPodSandbox, it ought to also be calling TearDownPod too, so even if it failed once, it ought to succeed the next time. So that makes it seem like the IPAM exhaustion is caused by something else. I can't tell if it was already out of IPAM addresses before the upgrade to 3.7 because the master hadn't tried to schedule a pod onto that node since the start of the logs 3 days ago, for whatever reason.
Oh, fwiw you should be able to fix it by draining the node (if there are even any pods on it), stopping the node service, "rm -rf /var/lib/cni/networks", and then restarting the node service.
actually "ls -l /var/lib/cni/networks/openshift-sdn/" first
*** Bug 1521159 has been marked as a duplicate of this bug. ***
Filed https://github.com/openshift/openshift-ansible/pull/6426 to clean out stale IPAM files during upgrade, but we're going to need to add some cleanup/gc to the code as well.
openshift-ansible PR is now merged
Tested with openshift-ansible 3.9.0-0.41.0 Upgrade from 3.7.9 to 3.9 The task was called during the upgrade, but the ipam files are not cleaned. TASK [openshift_node : Clean up IPAM data] ************************************* task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/config_changes.yml:25 ok: [ec2-xxxxxx.compute-1.amazonaws.com] => {"changed": false, "path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent"} [root@ip-172-18-15-220 ~]# ls /var/lib/cni/networks/openshift-sdn/ | wc -l 191 Can you help check if anything is wrong?
So does /var/lib/dockershim/sandbox/ get deleted? I just copied that rule and changed the path so I don't see how one would work and the other not... Is there a debug mode or something you can use to make ansible log more about what it's doing?
Seems the fake containers created under /var/lib/dockershim/sandbox/ will be deleted automatically even without upgrading. They will be deleted in a few seconds right after creation. Here is some more log about the tasks Clean up dockershim data and Clean up IPAM data <ec2-xxxxx.compute-1.amazonaws.com> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="/opt/app-root/src/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/opt/app-root/src/.ansible/cp/%h-%r -tt ec2-xxxxx.compute-1.amazonaws.com '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1518145396.19-261028870478972/file.py; rm -rf "/root/.ansible/tmp/ansible-tmp-1518145396.19-261028870478972/" > /dev/null 2>&1 && sleep 0'"'"'' <ec2-xxxxx.compute-1.amazonaws.com> (0, '\r\n{"invocation": {"module_args": {"directory_mode": null, "force": false, "remote_src": null, "path": "/var/lib/dockershim/sandbox/", "owner": null, "follow": false, "group": null, "unsafe_writes": null, "state": "absent", "content": null, "serole": null, "diff_peek": null, "setype": null, "selevel": null, "original_basename": null, "regexp": null, "validate": null, "src": null, "seuser": null, "recurse": false, "delimiter": null, "mode": null, "attributes": null, "backup": null}}, "path": "/var/lib/dockershim/sandbox/", "state": "absent", "changed": true, "diff": {"after": {"path": "/var/lib/dockershim/sandbox/", "state": "absent"}, "before": {"path": "/var/lib/dockershim/sandbox/", "state": "directory"}}}\r\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 58: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 73\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to ec2-xxxxx.compute-1.amazonaws.com closed.\r\n') changed: [ec2-xxxxx.compute-1.amazonaws.com] => { "changed": true, "diff": { "after": { "path": "/var/lib/dockershim/sandbox/", "state": "absent" }, "before": { "path": "/var/lib/dockershim/sandbox/", "state": "directory" } }, "invocation": { "module_args": { "attributes": null, "backup": null, "content": null, "delimiter": null, "diff_peek": null, "directory_mode": null, "follow": false, "force": false, "group": null, "mode": null, "original_basename": null, "owner": null, "path": "/var/lib/dockershim/sandbox/", "recurse": false, "regexp": null, "remote_src": null, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": null, "state": "absent", "unsafe_writes": null, "validate": null } }, "path": "/var/lib/dockershim/sandbox/", "state": "absent" } <ec2-xxxxx.compute-1.amazonaws.com> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="/opt/app-root/src/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/opt/app-root/src/.ansible/cp/%h-%r -tt ec2-xxxxx.compute-1.amazonaws.com '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1518145397.03-53627287305306/file.py; rm -rf "/root/.ansible/tmp/ansible-tmp-1518145397.03-53627287305306/" > /dev/null 2>&1 && sleep 0'"'"'' <ec2-xxxxx.compute-1.amazonaws.com> (0, '\r\n{"invocation": {"module_args": {"directory_mode": null, "force": false, "remote_src": null, "path": "/var/lib/cni/networks/openshift-sdn/", "owner": null, "follow": false, "group": null, "unsafe_writes": null, "state": "absent", "content": null, "serole": null, "diff_peek": null, "setype": null, "selevel": null, "original_basename": null, "regexp": null, "validate": null, "src": null, "seuser": null, "recurse": false, "delimiter": null, "mode": null, "attributes": null, "backup": null}}, "path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent", "changed": true, "diff": {"after": {"path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent"}, "before": {"path": "/var/lib/cni/networks/openshift-sdn/", "state": "directory"}}}\r\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 58: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 73\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to ec2-xxxxx.compute-1.amazonaws.com closed.\r\n') changed: [ec2-xxxxx.compute-1.amazonaws.com] => { "changed": true, "diff": { "after": { "path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent" }, "before": { "path": "/var/lib/cni/networks/openshift-sdn/", "state": "directory" } }, "invocation": { "module_args": { "attributes": null, "backup": null, "content": null, "delimiter": null, "diff_peek": null, "directory_mode": null, "follow": false, "force": false, "group": null, "mode": null, "original_basename": null, "owner": null, "path": "/var/lib/cni/networks/openshift-sdn/", "recurse": false, "regexp": null, "remote_src": null, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": null, "state": "absent", "unsafe_writes": null, "validate": null } }, "path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent" }
"changed": true, "diff": { "after": { "path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent" }, "before": { "path": "/var/lib/cni/networks/openshift-sdn/", "state": "directory" } }, That... really looks like it's saying it deleted the directory. Are you sure you ran the "ls /var/lib/cni/networks/openshift-sdn/ | wc -l" on the same host that ansible ran the cleanup task on? (I can't really tell from what you posted since the ansible output seems to be using a public hostname and the shell command is using a private one, or something.) Can you check the timestamp of the directory and the files in it before running the task, and then again after the task?
That's weird... It works when I try it today. The CNI files and the openshift-sdn dir are all deleted and recreated during the upgrade. Not sure what happened before. Will mark the bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489