Bug 1566439

Summary: docker ps is hanging
Product: Red Hat Enterprise Linux 7 Reporter: Michele Baldessari <michele>
Component: dockerAssignee: Tom Sweeney <tsweeney>
Status: CLOSED ERRATA QA Contact: atomic-bugs <atomic-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.5CC: agurenko, amurdaca, chjones, dciabrin, ddarrah, dwalsh, fkluknav, hdalwadi, jeckersb, lsm5, madgupta, mandreou, mbracho, mburns, michele, tsweeney, yprokule
Target Milestone: rcKeywords: Extras
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-14 16:52:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michele Baldessari 2018-04-12 10:31:30 UTC
Description of problem:
QE on OSP13 hits this in an OSP12->13 scenario. We consistently get docker ps to hang. (docker exec, docker logs, docker info all work. docker ps just hangs)

And here is the weird part:
1) docker ps hangs
2) We restart the docker service
systemctl restart docker
3) docker ps still hangs
4) We kill the docker cleanup processes (which were also hanging) manually
ps -aux | grep DEAD
kill 848551
kill 848552
5) systemctl restart docker
6) Now docker ps works.

We are not sure how much the cleanup processes matter here, but we did observe
that after killing them a docker service restart made things functional again.

Version-Release number of selected component (if applicable):
Client:
 Version:         1.13.1
 API version:     1.26
 Package version: docker-1.13.1-58.git87f2fab.el7.x86_64
 Go version:      go1.9.2
 Git commit:      87f2fab/1.13.1
 Built:           Mon Mar 19 18:55:01 2018
 OS/Arch:         linux/amd64

Server:
 Version:         1.13.1
 API version:     1.26 (minimum version 1.12)
 Package version: docker-1.13.1-58.git87f2fab.el7.x86_64
 Go version:      go1.9.2
 Git commit:      87f2fab/1.13.1
 Built:           Mon Mar 19 18:55:01 2018
 OS/Arch:         linux/amd64
 Experimental:    false

I will be attaching a sosreport and a go stack trace

Comment 3 Daniel Walsh 2018-04-12 11:05:51 UTC
Does it go into hang again, or just when updating from 1.12 to 1.13?

Comment 4 Michele Baldessari 2018-04-12 11:28:40 UTC
(In reply to Daniel Walsh from comment #3)
> Does it go into hang again, or just when updating from 1.12 to 1.13?

Oh no this time we do not even upgrade the package (we start with 1.13.1-58 on OSP12 and stay the at the same version on OSP13)

Comment 5 Michele Baldessari 2018-04-12 15:34:16 UTC
So some additional investigation, maybe we can restrict the surface of issue a bit.
 
So what triggers it is a tool of ours that calls 'docker rm -f <...>' on a bunch of containers.
 
Below is the full list of containers on a controller:
CONTAINER ID        IMAGE                                                                      COMMAND                  CREATED             STATUS                   PORTS               NAMES
3ac529a689aa        192.168.24.1:8787/rhceph/rhceph-3-rhel7:latest                             "/entrypoint.sh"         5 minutes ago       Up 5 minutes                                 ceph-mon-controller-2
3e68f8f2153f        192.168.24.1:8787/rhosp13/openstack-keystone:2018-04-03.3                  "/bin/bash -c '/us..."   7 minutes ago       Up 7 minutes                                 keystone_cron
dfd630aedbac        192.168.24.1:8787/rhosp13/openstack-keystone:2018-04-03.3                  "kolla_start"            7 minutes ago       Up 7 minutes (healthy)                       keystone
46fe7ff66584        192.168.24.1:8787/rhosp13/openstack-nova-placement-api:2018-04-03.3        "kolla_start"            7 minutes ago       Up 7 minutes                                 nova_placement
b24bbd944c66        192.168.24.1:8787/rhosp13/openstack-horizon:2018-04-03.3                   "kolla_start"            7 minutes ago       Up 7 minutes                                 horizon
f36eb6868f43        192.168.24.1:8787/rhosp13/openstack-iscsid:2018-04-03.3                    "kolla_start"            4 hours ago         Up 4 hours                                   iscsid
7919ecf7b66a        192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest                     "/bin/bash /usr/lo..."   4 hours ago         Up 4 hours                                   haproxy-bundle-docker-1
ae9e6e924eba        192.168.24.1:8787/rhosp13/openstack-memcached:2018-04-03.3                 "/bin/bash -c 'sou..."   22 hours ago        Up 22 hours                                  memcached
3c0e5ed5590c        192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest                       "/bin/bash /usr/lo..."   22 hours ago        Up 22 hours                                  redis-bundle-docker-2
c160f86ab2e7        192.168.24.1:8787/rhosp12/openstack-nova-api:2018-04-05.1                  "kolla_start"            3 days ago          Up 3 days (healthy)                          nova_metadata
eb39e0ed3b33        192.168.24.1:8787/rhosp12/openstack-nova-api:2018-04-05.1                  "kolla_start"            3 days ago          Up 3 days (healthy)                          nova_api
4f9e30acb9fe        192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1             "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_account_server
ef4a66e5e952        192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1           "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_container_auditor
7197007cb871        192.168.24.1:8787/rhosp12/openstack-heat-api:2018-04-05.1                  "kolla_start"            3 days ago          Up 3 days (healthy)                          heat_api_cron
75a0fe29c955        192.168.24.1:8787/rhosp12/openstack-swift-proxy-server:2018-04-05.1        "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_object_expirer
2b52e7e23297        192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1              "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_object_updater
c7b511b1b45c        192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1           "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_container_replicator
53e94f8e6d65        192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1             "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_account_auditor
87dc2054f9a8        192.168.24.1:8787/rhosp12/openstack-cron:2018-04-05.1                      "kolla_start"            3 days ago          Up 3 days                                    logrotate_crond
da77914a4eff        192.168.24.1:8787/rhosp12/openstack-heat-api-cfn:2018-04-05.1              "kolla_start"            3 days ago          Up 3 days (healthy)                          heat_api_cfn
c790a8f459d5        192.168.24.1:8787/rhosp12/openstack-nova-conductor:2018-04-05.1            "kolla_start"            3 days ago          Up 3 days (healthy)                          nova_conductor
d884d2d9ba3c        192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1              "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_object_replicator
3b2b9f3c06be        192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1           "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_container_server
3d1625f53640        192.168.24.1:8787/rhosp12/openstack-heat-engine:2018-04-05.1               "kolla_start"            3 days ago          Up 3 days (healthy)                          heat_engine
6d05cae4c35b        192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1              "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_rsync
cdf72996c60a        192.168.24.1:8787/rhosp12/openstack-nova-novncproxy:2018-04-05.1           "kolla_start"            3 days ago          Up 3 days (healthy)                          nova_vnc_proxy
34940e011178        192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1             "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_account_reaper
1d5cb65d2f73        192.168.24.1:8787/rhosp12/openstack-nova-consoleauth:2018-04-05.1          "kolla_start"            3 days ago          Up 3 days (healthy)                          nova_consoleauth
a3b52157bacf        192.168.24.1:8787/rhosp12/openstack-nova-api:2018-04-05.1                  "kolla_start"            3 days ago          Up 3 days (healthy)                          nova_api_cron
0dbd07b075bb        192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1             "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_account_replicator
011a904a4500        192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1              "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_object_auditor
4645e3333342        192.168.24.1:8787/rhosp12/openstack-heat-api:2018-04-05.1                  "kolla_start"            3 days ago          Up 3 days (healthy)                          heat_api
47b774d35a17        192.168.24.1:8787/rhosp12/openstack-swift-proxy-server:2018-04-05.1        "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_proxy
e37a755dc48a        192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1              "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_object_server
9f29c804ac22        192.168.24.1:8787/rhosp12/openstack-nova-scheduler:2018-04-05.1            "kolla_start"            3 days ago          Up 3 days (healthy)                          nova_scheduler
1abc79025eb1        192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1           "kolla_start"            3 days ago          Up 3 days (healthy)                          swift_container_updater
7607595eed98        192.168.24.1:8787/rhosp12/openstack-gnocchi-api:2018-04-05.1               "kolla_start"            3 days ago          Up 3 days                                    gnocchi_api
f81002972633        192.168.24.1:8787/rhosp12/openstack-gnocchi-statsd:2018-04-05.1            "kolla_start"            3 days ago          Up 3 days                                    gnocchi_statsd
3094c369ccea        192.168.24.1:8787/rhosp12/openstack-gnocchi-metricd:2018-04-05.1           "kolla_start"            3 days ago          Up 3 days                                    gnocchi_metricd
4e9d6f286db4        192.168.24.1:8787/rhosp12/openstack-panko-api:2018-04-05.1                 "kolla_start"            3 days ago          Up 3 days                                    panko_api
97b00f4f5e81        192.168.24.1:8787/rhosp12/openstack-glance-api:2018-04-05.1                "kolla_start"            3 days ago          Up 3 days (healthy)                          glance_api
0fd920e9331c        192.168.24.1:8787/rhosp12/openstack-aodh-listener:2018-04-05.1             "kolla_start"            3 days ago          Up 3 days (healthy)                          aodh_listener
560682c78e1a        192.168.24.1:8787/rhosp12/openstack-aodh-api:2018-04-05.1                  "kolla_start"            3 days ago          Up 3 days                                    aodh_api
160140b7895a        192.168.24.1:8787/rhosp12/openstack-ceilometer-notification:2018-04-05.1   "kolla_start"            3 days ago          Up 3 days (healthy)                          ceilometer_agent_notification
e2655319a0bb        192.168.24.1:8787/rhosp12/openstack-aodh-notifier:2018-04-05.1             "kolla_start"            3 days ago          Up 3 days (healthy)                          aodh_notifier
f974833d2caf        192.168.24.1:8787/rhosp12/openstack-ceilometer-central:2018-04-05.1        "kolla_start"            3 days ago          Up 3 days (healthy)                          ceilometer_agent_central
f11da1c30f0f        192.168.24.1:8787/rhosp12/openstack-aodh-evaluator:2018-04-05.1            "kolla_start"            3 days ago          Up 3 days (healthy)                          aodh_evaluator
 
Now just to be sure we start froma clean slate we restart docker.
# systemctl restart docker
 
At this stage docker works correctly and all docker ps commands work.
Now if we do the following we can reliably trigger the hang:
 
[root@controller-2 paunch]# docker ps -q | xargs -n1 docker rm -f
3ac529a689aa
3e68f8f2153f
<HANG>
 
systemctl restart docker
 
[root@controller-2 paunch]# docker ps -q | xargs -n1 docker rm -f
b5e59a580f1e
dfd630aedbac
<HANG>
 
systemctl restart docker
 
[root@controller-2 paunch]# docker ps -q | xargs -n1 docker rm -f
a7a0a5e991fe
46fe7ff66584
<HANG>
 
Now if we add a sleep after the docker rm -f this we observe that we go further
but eventually we still hang:
[root@controller-2 paunch]# for i in $(docker ps -q); do docker rm -f "$i"; sleep 5; done
974aaa55590f
b24bbd944c66
f36eb6868f43
7919ecf7b66a
ae9e6e924eba
3c0e5ed5590c
c160f86ab2e7
eb39e0ed3b33
<HANG>

Comment 8 Tom Sweeney 2018-04-13 15:34:59 UTC
Michael,

    First off, thank you very much for the very detailed analysis and reproducer.

    I've just tried to quickly repro the issue on my test system and I'm not able to in the first pass.  However, I'm using RHEL 7.4, not RHEL7.5.  I'll see if I can stand up a 7.5 machine to test this on.  However if you happen to have a RHEL 7.4 machine handy and have the time to test, could you try it on that?  

    That won't be the end solution obviously, but would help to narrow the issue down.

    Thanks!

Comment 9 Michele Baldessari 2018-04-13 15:42:51 UTC
(In reply to Tom Sweeney from comment #8)
> Michael,
> 
>     First off, thank you very much for the very detailed analysis and
> reproducer.
> 
>     I've just tried to quickly repro the issue on my test system and I'm not
> able to in the first pass.  However, I'm using RHEL 7.4, not RHEL7.5.  I'll
> see if I can stand up a 7.5 machine to test this on.  However if you happen
> to have a RHEL 7.4 machine handy and have the time to test, could you try it
> on that?  
> 
>     That won't be the end solution obviously, but would help to narrow the
> issue down.
> 
>     Thanks!

Hi Tom,

It does happen on centos 7.4 as well (I tried there too). Make sure the two containers are actually up and running and are not restarting (We had a colleague try the reproducer and there was some either file or selinux permission that put the containers in endless restarting mode and in that case it was not reproducing).

We normally always wait until both containers show an httpd process running (the horizon container takes a bit longer due to some startup tasks that take place) and then can reproduce.

cheers,
Michele

Comment 10 Tom Sweeney 2018-04-13 19:14:49 UTC
Michele,

That's my problem, the containers are just in a continual loop of restarting.  It's not able to get to  kolla config file.  Will have to figure out how to get that installed on the test machine.

Comment 11 John Eckersberg 2018-04-13 19:32:18 UTC
(In reply to Tom Sweeney from comment #10)
> Michele,
> 
> That's my problem, the containers are just in a continual loop of
> restarting.  It's not able to get to  kolla config file.  Will have to
> figure out how to get that installed on the test machine.

After you untar the repro directory, chown -R root: repro.  It gets unpacked as some unusable uid and then kolla can't read the config once it gets bind mounted in.  At least that's what happened the first time I tried it!

Comment 17 John Eckersberg 2018-04-19 21:51:23 UTC
Here's what I think is happening based on my investigation so far.  The crux of the problem is that the shim is losing its stderr fifo, which hangs things later on.

- Container launches, fifos are created for stdin/stdout/stderr (stdin appears unused, probably because containers are non-interactive).

- At this point the shim has stdout and stderr open for write, and dockerd has them open for read.

- dockerd shuts down

- The shim closes its file descriptor to the stderr fifo.  The shim keeps its file descriptor for stdout open.  I'm not sure why there is this discrepancy yet.  If we can make stderr stay open, this likely fixes everything.

- dockerd starts and begins restoring the container

- dockerd successfully reattaches to the stdout fifo

- dockerd calls open() on the stderr fifo, but this blocks *forever*, because a fifo open does not return until the other end is also connected.  And the shim already closed the other end and never reopens it.

- Things are mostly "ok" at this point

- Running docker rm -f on the container:

  - Locks the container

  - Waits on the StreamConfig which covers the underlying fifos

  - StreamConfig waits forever because the stderr fifo open() is still blocked

  - Container stays locked forever

At this point, subsequent container actions that need to acquire the lock (e.g. docker ps) needs to take the lock.

On a related note, this upstream change reduces the number of places that take container locks:

https://github.com/moby/moby/pull/31273

Although I don't think it's very feasible for backport to help here.

Comment 18 John Eckersberg 2018-04-19 21:55:09 UTC
One additional note, you can use the above to workaround the problem by opening and closing the write end of the fifo like so:

[root@localhost repro]# for f in $(lsof -p $(pgrep dockerd) | grep stderr | awk '{print $NF}'); do echo -n > $f; done
[root@localhost repro]# docker ps -q | xargs -n1 docker rm -f
ec50f18d2caf
e642c8cb6f64
[root@localhost repro]#

Comment 19 Tom Sweeney 2018-04-19 23:08:06 UTC
I've a probable patch put together for this.  I say probably as when I first used the patch, I still had the same issue occurring.  However I don't think I reset the new docker daemon image appropriately.  I reconfigured my system and since then have not run into the error.  Test run below.

It appears the problem was a lock with Docker logging.  I found a patch in Moby with the right timing for when Docker-ce was fixed and the right footprint.  It is:  https://github.com/moby/moby/pull/33897.

My PR is https://github.com/projectatomic/docker/pull/310

I need to test further tomorrow.  Michele (or anyone else) if you'd like me to get you some RHEL binaries to test with, let me know and I'll tar them up.

To be honest, I'm a little leary on this change.  It's a patch from a version of Docker that's effectively 3 versions newer than RHEL's and I'd to do a bit of surgery on the logger.

John Eckersberg :  Thanks for looking into this further and discovering the workaround.  Always nice to have that in the back pocket.


Test Run:

# . break.sh
5500f7c39a571a6a3f010a6176ba73422db4a9298de10a68e8bd9a62fb532d59
a3634065169c2d22bb448cb7766c5d77eb441e7c913a86685e8198ea8a5f3606

# sleep 60

# docker ps
CONTAINER ID        IMAGE                                                                                 COMMAND             CREATED             STATUS              PORTS               NAMES
a3634065169c        rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-keystone:2018-03-29.1   "kolla_start"       4 minutes ago       Up 4 minutes                            stupefied_williams
5500f7c39a57        rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-horizon:2018-03-29.1    "kolla_start"       4 minutes ago       Up 4 minutes                            quirky_minsky

# systemctl restart docker

# docker ps
CONTAINER ID        IMAGE                                                                                 COMMAND             CREATED             STATUS              PORTS               NAMES
a3634065169c        rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-keystone:2018-03-29.1   "kolla_start"       4 minutes ago       Up 14 seconds                           stupefied_williams
5500f7c39a57        rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-horizon:2018-03-29.1    "kolla_start"       4 minutes ago       Up 14 seconds                           quirky_minsky

#docke docker ps -q | xargs -n1 docker rm -f 
a3634065169c
5500f7c39a57

# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Comment 23 Tom Sweeney 2018-04-26 12:44:56 UTC
Michele thanks for testing it.  I'm curious why it seemed to work on my system but not on yours.  That's got me scratching my head more than anything at this point.

Can you tell me if the work around worked for you on both the "pristine" machine and/or on the machine with the new rpms?

Couple of questions.  On the machine with the patch, can you stop the containers and the do "docker rm" for each?  Does that succeed?  I don't think it did on the original machine.

Please let me know if you've any time to play on your machine today.

Comment 24 Tom Sweeney 2018-04-26 17:06:03 UTC
Just a quick update.  Michele, John and I met earlier today.  Michele and John believe that they can add the '-t' parameter to the 'docker run' command they have in one of their scripts which will enable a pseudo terminal to be used.  This looks like it's alleviating the stderr bashing/corruption of the FIFO's that's going on.  Michele will be adding more details after a bit more investigation.

Comment 25 John Eckersberg 2018-04-26 19:24:57 UTC
I've got a much simpler reproducer.  This uses bash magic to close fd2 (stderr) and sleep forever.

[root@localhost ~]# id=$(docker run -d fedora:latest bash -c 'exec sleep infinity 2>&-')
[root@localhost ~]# systemctl restart docker
[root@localhost ~]# docker rm -f $id

(hangs here)

This fails in the same way if you use fd1 (stdout) instead of fd2 (stderr).

So basically if the init process in the container closes stdout/stderr, dockerd cannot successfully reattach after startup and hangs if you try to delete the container.

We hit this on openstack because we run httpd as pid1 in many containers, and httpd closes its stderr and reopens it to point to the httpd error_log.

Comment 26 Tom Sweeney 2018-04-26 21:14:43 UTC
John thanks for the miniaturized reproducer.  Just to make things faster I tried using alpine rather than fedora and the problem doesn't replicate with that OS.  It does replicate for me with fedora though.

fwiw:

id=$(docker run -d alpine sh -c 'exec sleep infinity 2>&-')

Comment 27 John Eckersberg 2018-04-26 21:29:05 UTC
(In reply to Tom Sweeney from comment #26)
> John thanks for the miniaturized reproducer.  Just to make things faster I
> tried using alpine rather than fedora and the problem doesn't replicate with
> that OS.  It does replicate for me with fedora though.
> 
> fwiw:
> 
> id=$(docker run -d alpine sh -c 'exec sleep infinity 2>&-')

That's just because alpine has a version of sleep that doesn't understand infinity:

[root@localhost ~]# docker run alpine sleep infinity
sleep: invalid number 'infinity'

It reproduces if you replace infinity with 3600:

[root@localhost ~]# id=$(docker run -d alpine sh -c 'exec sleep 3600 2>&-')
[root@localhost ~]# systemctl restart docker
[root@localhost ~]# docker rm -f $id

(hangs here)

Comment 28 Daniel Walsh 2018-04-27 10:35:26 UTC
Antonio do you have any ideas now that we have a simple reproducer?

Comment 30 Tom Sweeney 2018-04-28 03:31:28 UTC
I've just updated my PR at https://github.com/projectatomic/docker/pull/310/files.  This is a much smaller fix and after our investigation and discoveries over the past few days, I feel much better about it.  In short if containers were tagged with live-restore, their FIFO's were not reset as they should be after a restart.  This change fixes that.  It's also a much smaller PR than my first, so a lot less room for added issues.

Michele and John if you could take the PR for a test run, I'd appreciate it.  With the minimal test that John set up, it's been working well for me.

Lokesh if you could spin up another rpm on brew for me, that would be super!

Comment 35 Tom Sweeney 2018-05-01 19:50:10 UTC
Just posted a PR for Docker-1.13.1 with the same fix.  https://github.com/projectatomic/docker/pull/311

Comment 41 errata-xmlrpc 2018-05-14 16:52:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1419