Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1518912

Summary:	[us-west-2] failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn
Product:	OpenShift Container Platform	Reporter:	Seth Jennings <sjenning>
Component:	Networking	Assignee:	Dan Williams <dcbw>
Status:	CLOSED ERRATA	QA Contact:	Meng Bo <bmeng>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.7.0	CC:	aos-bugs, atragler, bbennett, bingli, bleanhar, bmeng, byount, danw, dcbw, fshaikh, hongli, jupierce, simon.gunzenreiner, sreber, xmorano
Target Milestone:	---
Target Release:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-28 14:13:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Seth Jennings 2017-11-29 18:21:19 UTC

Description of problem:

There are two nodes in us-west-1 that are showing this error leading to high (100%) sandbox creation failure rates:

failed to run CNI IPAM ADD: no IP addresses available in network: openshift-sdn

Nodes:
starter-us-west-2-node-compute-39cac
starter-us-west-2-node-compute-a0a21

Version-Release number of selected component (if applicable):
3.7.9-1.git.0.7c71a2d.el7

Comment 1 Dan Williams 2017-11-29 18:35:11 UTC

Can we get node logs?

Comment 2 Seth Jennings 2017-11-29 18:42:27 UTC

Created attachment 1360515 [details]
node.log

attached node logs from ip-172-31-48-182.us-west-2.compute.internal (starter-us-west-2-node-compute-a0a21)

Comment 3 Seth Jennings 2017-11-29 21:58:57 UTC

starter-us-west-2-node-compute-48549 just joined in as well

Comment 5 Dan Winship 2017-12-01 15:29:57 UTC

Can you upload the complete journal from one of those nodes (not just for atomic-openshift-node, but for everything) since the time of upgrade somewhere? Eric had gotten logs from one of the earlier nodes, but there were already tons of docker errors happening in the earliest logs there.

Comment 7 Dan Winship 2017-12-01 16:47:25 UTC

dcbw had thought this was caused by bad cleanup on restart:

> <dcbw> danw: so for the moment I was concentrating on the us-west-2 case
> <dcbw> there, the node runs into docker stupidity (context deadline exceeded) when trying to do anything.  that means that docker didn't respond for over 2 minutes
> <dcbw> so after a while of that, the node process gets restarted by something
> <dcbw> so then, while we're starting up, the node tries to stop/clean up all the pods that used to be running in docker
> <dcbw> of course, the network plugin isn't ready yet, so that all fails
> <dcbw> danw: then eventually the SDN is ready, but by that point we've somehow lost the fact that the sandboxes are still there and should be cleaned up now that we can clean them up

but the logs show the CNI errors happening long before the restart; first "CNI IPAM ADD" error was at 04:37:22, and OpenShift restarted at 08:03:41. Uh... and then OpenShift restarted again four minutes later. Nice... They both appear to have been triggered by ansible... it looks like ansible first restarted dnsmasq, which caused systemd to restart atomic-openshift-node as a side effect, and then after that, it separately restarted atomic-openshift-node.

Comment 8 Dan Winship 2017-12-04 18:38:13 UTC

Looking at the logs from 6e915, docker is really screwed up. There are numerous pods that OpenShift has been trying to delete for *days*, and every time it tries, it gets

    atomic-openshift-node: E1130 03:24:35.375048    2759 remote_runtime.go:109] StopPodSandbox "7e7e06281ecb7d632a5ebd8abf45fb4e4006f70ea06af24b0b0f173353a2f65a" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded

(This goes as far back as the logs go. In particular, before the 3.7 update.) And in fact, at some point it apparently actually succeeded in killing the pod without realizing it, because you occasionally see:

    dockerd-current: time="2017-11-30T03:30:57.870986309Z" level=error msg="Handler for POST /v1.24/containers/7e7e3a38ef6b4505a1a56c639a947b7e0a5f56e41c5b67165c0190a1419ce428/stop?t=10 returned error: Container 7e7e3a38ef6b4505a1a56c639a947b7e0a5f56e41c5b67165c0190a1419ce428 is already stopped"

But over the course of 3 days and 3250 attempts, every single StopPodSandbox attempt has timed out before docker returned an answer, so OpenShift still doesn't know.


I'm not sure how this relates to the IPAM exhaustion though... each time it tries and fails to call StopPodSandbox, it ought to also be calling TearDownPod too, so even if it failed once, it ought to succeed the next time. So that makes it seem like the IPAM exhaustion is caused by something else.

I can't tell if it was already out of IPAM addresses before the upgrade to 3.7 because the master hadn't tried to schedule a pod onto that node since the start of the logs 3 days ago, for whatever reason.

Comment 9 Dan Winship 2017-12-04 18:41:04 UTC

Oh, fwiw you should be able to fix it by draining the node (if there are even any pods on it), stopping the node service, "rm -rf /var/lib/cni/networks", and then restarting the node service.

Comment 10 Dan Winship 2017-12-04 18:41:54 UTC

actually "ls -l /var/lib/cni/networks/openshift-sdn/" first

Comment 11 Ben Bennett 2017-12-08 20:46:14 UTC

*** Bug 1521159 has been marked as a duplicate of this bug. ***

Comment 13 Dan Winship 2017-12-11 16:25:32 UTC

Filed https://github.com/openshift/openshift-ansible/pull/6426 to clean out stale IPAM files during upgrade, but we're going to need to add some cleanup/gc to the code as well.

Comment 14 Dan Winship 2018-01-14 14:45:42 UTC

openshift-ansible PR is now merged

Comment 20 Meng Bo 2018-02-08 10:23:46 UTC

Tested with openshift-ansible 3.9.0-0.41.0

Upgrade from 3.7.9 to 3.9

The task was called during the upgrade, but the ipam files are not cleaned.

TASK [openshift_node : Clean up IPAM data] *************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/config_changes.yml:25
ok: [ec2-xxxxxx.compute-1.amazonaws.com] => {"changed": false, "path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent"}


[root@ip-172-18-15-220 ~]# ls /var/lib/cni/networks/openshift-sdn/ | wc -l
191



Can you help check if anything is wrong?

Comment 21 Dan Winship 2018-02-08 13:18:53 UTC

So does /var/lib/dockershim/sandbox/ get deleted? I just copied that rule and changed the path so I don't see how one would work and the other not...

Is there a debug mode or something you can use to make ansible log more about what it's doing?

Comment 22 Meng Bo 2018-02-09 03:24:13 UTC

Seems the fake containers created under /var/lib/dockershim/sandbox/ will be deleted automatically even without upgrading. They will be deleted in a few seconds right after creation.

Here is some more log about the tasks Clean up dockershim data and Clean up IPAM data

<ec2-xxxxx.compute-1.amazonaws.com> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="/opt/app-root/src/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/opt/app-root/src/.ansible/cp/%h-%r -tt ec2-xxxxx.compute-1.amazonaws.com '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1518145396.19-261028870478972/file.py; rm -rf "/root/.ansible/tmp/ansible-tmp-1518145396.19-261028870478972/" > /dev/null 2>&1 && sleep 0'"'"''
<ec2-xxxxx.compute-1.amazonaws.com> (0, '\r\n{"invocation": {"module_args": {"directory_mode": null, "force": false, "remote_src": null, "path": "/var/lib/dockershim/sandbox/", "owner": null, "follow": false, "group": null, "unsafe_writes": null, "state": "absent", "content": null, "serole": null, "diff_peek": null, "setype": null, "selevel": null, "original_basename": null, "regexp": null, "validate": null, "src": null, "seuser": null, "recurse": false, "delimiter": null, "mode": null, "attributes": null, "backup": null}}, "path": "/var/lib/dockershim/sandbox/", "state": "absent", "changed": true, "diff": {"after": {"path": "/var/lib/dockershim/sandbox/", "state": "absent"}, "before": {"path": "/var/lib/dockershim/sandbox/", "state": "directory"}}}\r\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 58: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 73\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to ec2-xxxxx.compute-1.amazonaws.com closed.\r\n')
changed: [ec2-xxxxx.compute-1.amazonaws.com] => {
    "changed": true, 
    "diff": {
        "after": {
            "path": "/var/lib/dockershim/sandbox/", 
            "state": "absent"
        }, 
        "before": {
            "path": "/var/lib/dockershim/sandbox/", 
            "state": "directory"
        }
    }, 
    "invocation": {
        "module_args": {
            "attributes": null, 
            "backup": null, 
            "content": null, 
            "delimiter": null, 
            "diff_peek": null, 
            "directory_mode": null, 
            "follow": false, 
            "force": false, 
            "group": null, 
            "mode": null, 
            "original_basename": null, 
            "owner": null, 
            "path": "/var/lib/dockershim/sandbox/", 
            "recurse": false, 
            "regexp": null, 
            "remote_src": null, 
            "selevel": null, 
            "serole": null, 
            "setype": null, 
            "seuser": null, 
            "src": null, 
            "state": "absent", 
            "unsafe_writes": null, 
            "validate": null
        }
    }, 
    "path": "/var/lib/dockershim/sandbox/", 
    "state": "absent"
}


<ec2-xxxxx.compute-1.amazonaws.com> SSH: EXEC ssh -vvv -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o 'IdentityFile="/opt/app-root/src/libra.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=root -o ConnectTimeout=10 -o ControlPath=/opt/app-root/src/.ansible/cp/%h-%r -tt ec2-xxxxx.compute-1.amazonaws.com '/bin/sh -c '"'"'/usr/bin/python /root/.ansible/tmp/ansible-tmp-1518145397.03-53627287305306/file.py; rm -rf "/root/.ansible/tmp/ansible-tmp-1518145397.03-53627287305306/" > /dev/null 2>&1 && sleep 0'"'"''
<ec2-xxxxx.compute-1.amazonaws.com> (0, '\r\n{"invocation": {"module_args": {"directory_mode": null, "force": false, "remote_src": null, "path": "/var/lib/cni/networks/openshift-sdn/", "owner": null, "follow": false, "group": null, "unsafe_writes": null, "state": "absent", "content": null, "serole": null, "diff_peek": null, "setype": null, "selevel": null, "original_basename": null, "regexp": null, "validate": null, "src": null, "seuser": null, "recurse": false, "delimiter": null, "mode": null, "attributes": null, "backup": null}}, "path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent", "changed": true, "diff": {"after": {"path": "/var/lib/cni/networks/openshift-sdn/", "state": "absent"}, "before": {"path": "/var/lib/cni/networks/openshift-sdn/", "state": "directory"}}}\r\n', 'OpenSSH_7.4p1, OpenSSL 1.0.2k-fips  26 Jan 2017\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 58: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 73\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to ec2-xxxxx.compute-1.amazonaws.com closed.\r\n')
changed: [ec2-xxxxx.compute-1.amazonaws.com] => {
    "changed": true, 
    "diff": {
        "after": {
            "path": "/var/lib/cni/networks/openshift-sdn/", 
            "state": "absent"
        }, 
        "before": {
            "path": "/var/lib/cni/networks/openshift-sdn/", 
            "state": "directory"
        }
    }, 
    "invocation": {
        "module_args": {
            "attributes": null, 
            "backup": null, 
            "content": null, 
            "delimiter": null, 
            "diff_peek": null, 
            "directory_mode": null, 
            "follow": false, 
            "force": false, 
            "group": null, 
            "mode": null, 
            "original_basename": null, 
            "owner": null, 
            "path": "/var/lib/cni/networks/openshift-sdn/", 
            "recurse": false, 
            "regexp": null, 
            "remote_src": null, 
            "selevel": null, 
            "serole": null, 
            "setype": null, 
            "seuser": null, 
            "src": null, 
            "state": "absent", 
            "unsafe_writes": null, 
            "validate": null
        }
    }, 
    "path": "/var/lib/cni/networks/openshift-sdn/", 
    "state": "absent"
}

Comment 24 Dan Winship 2018-02-09 15:41:34 UTC

    "changed": true, 
    "diff": {
        "after": {
            "path": "/var/lib/cni/networks/openshift-sdn/", 
            "state": "absent"
        }, 
        "before": {
            "path": "/var/lib/cni/networks/openshift-sdn/", 
            "state": "directory"
        }
    }, 

That... really looks like it's saying it deleted the directory.

Are you sure you ran the "ls /var/lib/cni/networks/openshift-sdn/ | wc -l" on the same host that ansible ran the cleanup task on? (I can't really tell from what you posted since the ansible output seems to be using a public hostname and the shell command is using a private one, or something.)

Can you check the timestamp of the directory and the files in it before running the task, and then again after the task?

Comment 25 Meng Bo 2018-02-11 07:45:11 UTC

That's weird... It works when I try it today. The CNI files and the openshift-sdn dir are all deleted and recreated during the upgrade. 

Not sure what happened before. Will mark the bug as verified.

Comment 34 errata-xmlrpc 2018-03-28 14:13:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489