Bug 1566439
Summary: | docker ps is hanging | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Michele Baldessari <michele> |
Component: | docker | Assignee: | Tom Sweeney <tsweeney> |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.5 | CC: | agurenko, amurdaca, chjones, dciabrin, ddarrah, dwalsh, fkluknav, hdalwadi, jeckersb, lsm5, madgupta, mandreou, mbracho, mburns, michele, tsweeney, yprokule |
Target Milestone: | rc | Keywords: | Extras |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-05-14 16:52:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Michele Baldessari
2018-04-12 10:31:30 UTC
Does it go into hang again, or just when updating from 1.12 to 1.13? (In reply to Daniel Walsh from comment #3) > Does it go into hang again, or just when updating from 1.12 to 1.13? Oh no this time we do not even upgrade the package (we start with 1.13.1-58 on OSP12 and stay the at the same version on OSP13) So some additional investigation, maybe we can restrict the surface of issue a bit. So what triggers it is a tool of ours that calls 'docker rm -f <...>' on a bunch of containers. Below is the full list of containers on a controller: CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3ac529a689aa 192.168.24.1:8787/rhceph/rhceph-3-rhel7:latest "/entrypoint.sh" 5 minutes ago Up 5 minutes ceph-mon-controller-2 3e68f8f2153f 192.168.24.1:8787/rhosp13/openstack-keystone:2018-04-03.3 "/bin/bash -c '/us..." 7 minutes ago Up 7 minutes keystone_cron dfd630aedbac 192.168.24.1:8787/rhosp13/openstack-keystone:2018-04-03.3 "kolla_start" 7 minutes ago Up 7 minutes (healthy) keystone 46fe7ff66584 192.168.24.1:8787/rhosp13/openstack-nova-placement-api:2018-04-03.3 "kolla_start" 7 minutes ago Up 7 minutes nova_placement b24bbd944c66 192.168.24.1:8787/rhosp13/openstack-horizon:2018-04-03.3 "kolla_start" 7 minutes ago Up 7 minutes horizon f36eb6868f43 192.168.24.1:8787/rhosp13/openstack-iscsid:2018-04-03.3 "kolla_start" 4 hours ago Up 4 hours iscsid 7919ecf7b66a 192.168.24.1:8787/rhosp13/openstack-haproxy:pcmklatest "/bin/bash /usr/lo..." 4 hours ago Up 4 hours haproxy-bundle-docker-1 ae9e6e924eba 192.168.24.1:8787/rhosp13/openstack-memcached:2018-04-03.3 "/bin/bash -c 'sou..." 22 hours ago Up 22 hours memcached 3c0e5ed5590c 192.168.24.1:8787/rhosp13/openstack-redis:pcmklatest "/bin/bash /usr/lo..." 22 hours ago Up 22 hours redis-bundle-docker-2 c160f86ab2e7 192.168.24.1:8787/rhosp12/openstack-nova-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) nova_metadata eb39e0ed3b33 192.168.24.1:8787/rhosp12/openstack-nova-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) nova_api 4f9e30acb9fe 192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_account_server ef4a66e5e952 192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_container_auditor 7197007cb871 192.168.24.1:8787/rhosp12/openstack-heat-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) heat_api_cron 75a0fe29c955 192.168.24.1:8787/rhosp12/openstack-swift-proxy-server:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_object_expirer 2b52e7e23297 192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_object_updater c7b511b1b45c 192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_container_replicator 53e94f8e6d65 192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_account_auditor 87dc2054f9a8 192.168.24.1:8787/rhosp12/openstack-cron:2018-04-05.1 "kolla_start" 3 days ago Up 3 days logrotate_crond da77914a4eff 192.168.24.1:8787/rhosp12/openstack-heat-api-cfn:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) heat_api_cfn c790a8f459d5 192.168.24.1:8787/rhosp12/openstack-nova-conductor:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) nova_conductor d884d2d9ba3c 192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_object_replicator 3b2b9f3c06be 192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_container_server 3d1625f53640 192.168.24.1:8787/rhosp12/openstack-heat-engine:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) heat_engine 6d05cae4c35b 192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_rsync cdf72996c60a 192.168.24.1:8787/rhosp12/openstack-nova-novncproxy:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) nova_vnc_proxy 34940e011178 192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_account_reaper 1d5cb65d2f73 192.168.24.1:8787/rhosp12/openstack-nova-consoleauth:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) nova_consoleauth a3b52157bacf 192.168.24.1:8787/rhosp12/openstack-nova-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) nova_api_cron 0dbd07b075bb 192.168.24.1:8787/rhosp12/openstack-swift-account:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_account_replicator 011a904a4500 192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_object_auditor 4645e3333342 192.168.24.1:8787/rhosp12/openstack-heat-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) heat_api 47b774d35a17 192.168.24.1:8787/rhosp12/openstack-swift-proxy-server:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_proxy e37a755dc48a 192.168.24.1:8787/rhosp12/openstack-swift-object:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_object_server 9f29c804ac22 192.168.24.1:8787/rhosp12/openstack-nova-scheduler:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) nova_scheduler 1abc79025eb1 192.168.24.1:8787/rhosp12/openstack-swift-container:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) swift_container_updater 7607595eed98 192.168.24.1:8787/rhosp12/openstack-gnocchi-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days gnocchi_api f81002972633 192.168.24.1:8787/rhosp12/openstack-gnocchi-statsd:2018-04-05.1 "kolla_start" 3 days ago Up 3 days gnocchi_statsd 3094c369ccea 192.168.24.1:8787/rhosp12/openstack-gnocchi-metricd:2018-04-05.1 "kolla_start" 3 days ago Up 3 days gnocchi_metricd 4e9d6f286db4 192.168.24.1:8787/rhosp12/openstack-panko-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days panko_api 97b00f4f5e81 192.168.24.1:8787/rhosp12/openstack-glance-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) glance_api 0fd920e9331c 192.168.24.1:8787/rhosp12/openstack-aodh-listener:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) aodh_listener 560682c78e1a 192.168.24.1:8787/rhosp12/openstack-aodh-api:2018-04-05.1 "kolla_start" 3 days ago Up 3 days aodh_api 160140b7895a 192.168.24.1:8787/rhosp12/openstack-ceilometer-notification:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) ceilometer_agent_notification e2655319a0bb 192.168.24.1:8787/rhosp12/openstack-aodh-notifier:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) aodh_notifier f974833d2caf 192.168.24.1:8787/rhosp12/openstack-ceilometer-central:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) ceilometer_agent_central f11da1c30f0f 192.168.24.1:8787/rhosp12/openstack-aodh-evaluator:2018-04-05.1 "kolla_start" 3 days ago Up 3 days (healthy) aodh_evaluator Now just to be sure we start froma clean slate we restart docker. # systemctl restart docker At this stage docker works correctly and all docker ps commands work. Now if we do the following we can reliably trigger the hang: [root@controller-2 paunch]# docker ps -q | xargs -n1 docker rm -f 3ac529a689aa 3e68f8f2153f <HANG> systemctl restart docker [root@controller-2 paunch]# docker ps -q | xargs -n1 docker rm -f b5e59a580f1e dfd630aedbac <HANG> systemctl restart docker [root@controller-2 paunch]# docker ps -q | xargs -n1 docker rm -f a7a0a5e991fe 46fe7ff66584 <HANG> Now if we add a sleep after the docker rm -f this we observe that we go further but eventually we still hang: [root@controller-2 paunch]# for i in $(docker ps -q); do docker rm -f "$i"; sleep 5; done 974aaa55590f b24bbd944c66 f36eb6868f43 7919ecf7b66a ae9e6e924eba 3c0e5ed5590c c160f86ab2e7 eb39e0ed3b33 <HANG> Michael, First off, thank you very much for the very detailed analysis and reproducer. I've just tried to quickly repro the issue on my test system and I'm not able to in the first pass. However, I'm using RHEL 7.4, not RHEL7.5. I'll see if I can stand up a 7.5 machine to test this on. However if you happen to have a RHEL 7.4 machine handy and have the time to test, could you try it on that? That won't be the end solution obviously, but would help to narrow the issue down. Thanks! (In reply to Tom Sweeney from comment #8) > Michael, > > First off, thank you very much for the very detailed analysis and > reproducer. > > I've just tried to quickly repro the issue on my test system and I'm not > able to in the first pass. However, I'm using RHEL 7.4, not RHEL7.5. I'll > see if I can stand up a 7.5 machine to test this on. However if you happen > to have a RHEL 7.4 machine handy and have the time to test, could you try it > on that? > > That won't be the end solution obviously, but would help to narrow the > issue down. > > Thanks! Hi Tom, It does happen on centos 7.4 as well (I tried there too). Make sure the two containers are actually up and running and are not restarting (We had a colleague try the reproducer and there was some either file or selinux permission that put the containers in endless restarting mode and in that case it was not reproducing). We normally always wait until both containers show an httpd process running (the horizon container takes a bit longer due to some startup tasks that take place) and then can reproduce. cheers, Michele Michele, That's my problem, the containers are just in a continual loop of restarting. It's not able to get to kolla config file. Will have to figure out how to get that installed on the test machine. (In reply to Tom Sweeney from comment #10) > Michele, > > That's my problem, the containers are just in a continual loop of > restarting. It's not able to get to kolla config file. Will have to > figure out how to get that installed on the test machine. After you untar the repro directory, chown -R root: repro. It gets unpacked as some unusable uid and then kolla can't read the config once it gets bind mounted in. At least that's what happened the first time I tried it! Here's what I think is happening based on my investigation so far. The crux of the problem is that the shim is losing its stderr fifo, which hangs things later on. - Container launches, fifos are created for stdin/stdout/stderr (stdin appears unused, probably because containers are non-interactive). - At this point the shim has stdout and stderr open for write, and dockerd has them open for read. - dockerd shuts down - The shim closes its file descriptor to the stderr fifo. The shim keeps its file descriptor for stdout open. I'm not sure why there is this discrepancy yet. If we can make stderr stay open, this likely fixes everything. - dockerd starts and begins restoring the container - dockerd successfully reattaches to the stdout fifo - dockerd calls open() on the stderr fifo, but this blocks *forever*, because a fifo open does not return until the other end is also connected. And the shim already closed the other end and never reopens it. - Things are mostly "ok" at this point - Running docker rm -f on the container: - Locks the container - Waits on the StreamConfig which covers the underlying fifos - StreamConfig waits forever because the stderr fifo open() is still blocked - Container stays locked forever At this point, subsequent container actions that need to acquire the lock (e.g. docker ps) needs to take the lock. On a related note, this upstream change reduces the number of places that take container locks: https://github.com/moby/moby/pull/31273 Although I don't think it's very feasible for backport to help here. One additional note, you can use the above to workaround the problem by opening and closing the write end of the fifo like so: [root@localhost repro]# for f in $(lsof -p $(pgrep dockerd) | grep stderr | awk '{print $NF}'); do echo -n > $f; done [root@localhost repro]# docker ps -q | xargs -n1 docker rm -f ec50f18d2caf e642c8cb6f64 [root@localhost repro]# I've a probable patch put together for this. I say probably as when I first used the patch, I still had the same issue occurring. However I don't think I reset the new docker daemon image appropriately. I reconfigured my system and since then have not run into the error. Test run below. It appears the problem was a lock with Docker logging. I found a patch in Moby with the right timing for when Docker-ce was fixed and the right footprint. It is: https://github.com/moby/moby/pull/33897. My PR is https://github.com/projectatomic/docker/pull/310 I need to test further tomorrow. Michele (or anyone else) if you'd like me to get you some RHEL binaries to test with, let me know and I'll tar them up. To be honest, I'm a little leary on this change. It's a patch from a version of Docker that's effectively 3 versions newer than RHEL's and I'd to do a bit of surgery on the logger. John Eckersberg : Thanks for looking into this further and discovering the workaround. Always nice to have that in the back pocket. Test Run: # . break.sh 5500f7c39a571a6a3f010a6176ba73422db4a9298de10a68e8bd9a62fb532d59 a3634065169c2d22bb448cb7766c5d77eb441e7c913a86685e8198ea8a5f3606 # sleep 60 # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a3634065169c rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-keystone:2018-03-29.1 "kolla_start" 4 minutes ago Up 4 minutes stupefied_williams 5500f7c39a57 rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-horizon:2018-03-29.1 "kolla_start" 4 minutes ago Up 4 minutes quirky_minsky # systemctl restart docker # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a3634065169c rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-keystone:2018-03-29.1 "kolla_start" 4 minutes ago Up 14 seconds stupefied_williams 5500f7c39a57 rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-horizon:2018-03-29.1 "kolla_start" 4 minutes ago Up 14 seconds quirky_minsky #docke docker ps -q | xargs -n1 docker rm -f a3634065169c 5500f7c39a57 # docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES Michele thanks for testing it. I'm curious why it seemed to work on my system but not on yours. That's got me scratching my head more than anything at this point. Can you tell me if the work around worked for you on both the "pristine" machine and/or on the machine with the new rpms? Couple of questions. On the machine with the patch, can you stop the containers and the do "docker rm" for each? Does that succeed? I don't think it did on the original machine. Please let me know if you've any time to play on your machine today. Just a quick update. Michele, John and I met earlier today. Michele and John believe that they can add the '-t' parameter to the 'docker run' command they have in one of their scripts which will enable a pseudo terminal to be used. This looks like it's alleviating the stderr bashing/corruption of the FIFO's that's going on. Michele will be adding more details after a bit more investigation. I've got a much simpler reproducer. This uses bash magic to close fd2 (stderr) and sleep forever. [root@localhost ~]# id=$(docker run -d fedora:latest bash -c 'exec sleep infinity 2>&-') [root@localhost ~]# systemctl restart docker [root@localhost ~]# docker rm -f $id (hangs here) This fails in the same way if you use fd1 (stdout) instead of fd2 (stderr). So basically if the init process in the container closes stdout/stderr, dockerd cannot successfully reattach after startup and hangs if you try to delete the container. We hit this on openstack because we run httpd as pid1 in many containers, and httpd closes its stderr and reopens it to point to the httpd error_log. John thanks for the miniaturized reproducer. Just to make things faster I tried using alpine rather than fedora and the problem doesn't replicate with that OS. It does replicate for me with fedora though. fwiw: id=$(docker run -d alpine sh -c 'exec sleep infinity 2>&-') (In reply to Tom Sweeney from comment #26) > John thanks for the miniaturized reproducer. Just to make things faster I > tried using alpine rather than fedora and the problem doesn't replicate with > that OS. It does replicate for me with fedora though. > > fwiw: > > id=$(docker run -d alpine sh -c 'exec sleep infinity 2>&-') That's just because alpine has a version of sleep that doesn't understand infinity: [root@localhost ~]# docker run alpine sleep infinity sleep: invalid number 'infinity' It reproduces if you replace infinity with 3600: [root@localhost ~]# id=$(docker run -d alpine sh -c 'exec sleep 3600 2>&-') [root@localhost ~]# systemctl restart docker [root@localhost ~]# docker rm -f $id (hangs here) Antonio do you have any ideas now that we have a simple reproducer? I've just updated my PR at https://github.com/projectatomic/docker/pull/310/files. This is a much smaller fix and after our investigation and discoveries over the past few days, I feel much better about it. In short if containers were tagged with live-restore, their FIFO's were not reset as they should be after a restart. This change fixes that. It's also a much smaller PR than my first, so a lot less room for added issues. Michele and John if you could take the PR for a test run, I'd appreciate it. With the minimal test that John set up, it's been working well for me. Lokesh if you could spin up another rpm on brew for me, that would be super! Just posted a PR for Docker-1.13.1 with the same fix. https://github.com/projectatomic/docker/pull/311 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1419 |