Bug 1671861

Summary: Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker service on master nodes
Product: Red Hat Enterprise Linux 7 Reporter: Marius Cornea <mcornea>
Component: dockerAssignee: Valentin Rothberg <vrothber>
Status: CLOSED ERRATA QA Contact: atomic-bugs <atomic-bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.6CC: afariasa, alexander.lackner, amodi, amurdaca, apevec, apmukher, atonner, bendik.paulsrud, chris.liles, cshereme, dbecker, dominik.mierzejewski, dornelas, eaj, fan-wxa, fkluknav, fshaikh, hasuzuki, igreen, jrosenta, kelly.brown1, lsm5, lsu, magnus.heino, m.andre, mburns, mcornea, mnoguera, morazi, mukeshk2, ncurry, nils.ketelsen, palonsor, pasik, pdwyer, rludva, rmanes, rsandu, sasha, subhat, travi, uobergfe, v.nandha, vrothber, wmeng, wwurzbac, yprokule
Target Milestone: rcKeywords: Extras, Regression, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: docker-1.13.1-93.gitb2f74b2.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-13 01:52:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1186913, 1647729, 1661804, 1663922    

Description Marius Cornea 2019-02-01 22:19:01 UTC
Description of problem:

Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker on master nodes.

Snippet from /var/lib/mistral/openshift/openshift/playbook.log:

TASK [container_runtime : Fix SELinux Permissions on /var/lib/containers] ******
ok: [openshift-infra-2]
ok: [openshift-infra-1]
ok: [openshift-infra-0]
ok: [openshift-master-2]
ok: [openshift-master-0]
ok: [openshift-master-1]
ok: [openshift-worker-2]
ok: [openshift-worker-0]
ok: [openshift-worker-1]

RUNNING HANDLER [container_runtime : restart container runtime] ****************
changed: [openshift-infra-2]
changed: [openshift-infra-1]
changed: [openshift-worker-2]
changed: [openshift-infra-0]
changed: [openshift-worker-0]
changed: [openshift-worker-1]

We can see that the task ran fine on non-master nodes. Checking the docker processes on one of the master node we can see:

[root@openshift-master-0 heat-admin]# ps axu | grep docker
root     17174  0.0  0.0 136600 12472 ?        Ssl  16:24   0:00 /usr/libexec/docker/rhel-push-plugin
root     27861  0.0  0.0 134820  1440 ?        S    16:38   0:00 /bin/systemctl restart docker
root     27899  0.0  0.1 677516 24084 ?        Ssl  16:38   0:02 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --signature-verification=False -s overlay2 --mtu=1450 --add-registry registry.redhat.io --insecure-registry 192.168.24.1:8787 --add-registry registry.access.redhat.com --add-registry docker.io --add-registry registry.fedoraproject.org --add-registry quay.io --add-registry registry.centos.org
root     27908  0.0  0.0 394508 13568 ?        Ssl  16:38   0:01 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc --runtime-args --systemd-cgroup=true
root     27997  0.0  0.0 115300  1464 ?        Ss   16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28001  0.0  0.0 115300   652 ?        S    16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28002  0.0  0.0 166116  9248 ?        Sl   16:59   0:00 /usr/bin/docker-current ps -aq -f status=dead


The docker service is stuck in activating state:

[root@openshift-master-0 heat-admin]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─99-unset-mountflags.conf, custom.conf
   Active: activating (start) since Fri 2019-02-01 16:38:48 EST; 36min ago
     Docs: http://docs.docker.com
 Main PID: 27899 (dockerd-current)
    Tasks: 27
   Memory: 879.7M
   CGroup: /system.slice/docker.service
           ├─27899 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy...
           └─27908 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-...

Feb 01 16:38:48 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...
Feb 01 16:38:49 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:49.173839784-05:00" level=info msg="libcontainerd: new containerd process, pid: 27908"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.284350795-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.285523377-05:00" level=info msg="Loading containers: start."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.303415591-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412...6b2943e34278)."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f...1ba97f1c1462)."
Hint: Some lines were ellipsized, use -l to show in full.


Version-Release number of selected component (if applicable):
docker-1.13.1-90.git07f3374.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 3.11 via OpenStack Director

Actual results:
Deployment cannot complete because docker gets stuck.

Expected results:
Deployment doesn't get stuck.

Additional info:

I can see in the job history that deployment passed with a lower version of docker so this could potentially be a regression introduced by the newer docker package.

Working version:
docker-1.13.1-88.git07f3374.el7.x86_64

Comment 1 Martin André 2019-02-04 08:42:14 UTC
The infra and worker nodes do not have any containers running at this step, this could be the reason why they had no issue when restarting docker.

The full logs from the master node provide more context:

Feb 01 16:30:58 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:30:58.875673621-05:00" level=warning msg="d8e8ffa50cb6850aa67e7c3ca064d33ad84f397a53814d0566013f6bc0a52965 cleanup: failed to unmount secrets: invalid argument"
Feb 01 16:38:47 openshift-master-0 systemd[1]: Stopping Docker Application Container Engine...
Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.679732066-05:00" level=info msg="Processing signal 'terminated'"
Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.682139079-05:00" level=info msg="stopping containerd after receiving terminated"
Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.68246551-05:00" level=fatal msg="containerd: serve grpc" error="accept unix /var/run/docker/libcontainerd/docker-containerd.sock: use of closed network connection"
Feb 01 16:38:48 openshift-master-0 systemd[1]: Stopped Docker Application Container Engine.
Feb 01 16:38:48 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...
Feb 01 16:38:49 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:49.173839784-05:00" level=info msg="libcontainerd: new containerd process, pid: 27908"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.284350795-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.285523377-05:00" level=info msg="Loading containers: start."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.303415591-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)."

Seems like an issue in containerd.

Comment 2 Martin André 2019-02-04 09:10:45 UTC
Additional note, ran "sudo systemctl stop docker" twice and docker finally came up after some time:

Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)."
Feb 04 03:57:41 openshift-master-0 dockerd-current[27899]: time="2019-02-04T03:57:41.136471082-05:00" level=info msg="Processing signal 'terminated'"                                                                
Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service stop-sigterm timed out. Killing.                                                                                                                       
Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service: main process exited, code=killed, status=9/KILL                                                                                                       
Feb 04 03:59:41 openshift-master-0 systemd[1]: Unit docker.service entered failed state.                                                                                                                             
Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service failed.                                                                                                                                                
Feb 04 03:59:41 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...                                                                                                                       
Feb 04 03:59:42 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:42.494812650-05:00" level=info msg="libcontainerd: new containerd process, pid: 14499"                                             
Feb 04 03:59:42 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:42.515315761-05:00" level=error msg="containerd: notify OOM events" error="open /proc/21179/cgroup: no such file or directory"     
Feb 04 03:59:43 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:43.570918238-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"                                   
Feb 04 03:59:43 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:43.571784432-05:00" level=info msg="Loading containers: start."                                                                    
Feb 04 03:59:49 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:49.348632137-05:00" level=info msg="Processing signal 'terminated'"                                                                
Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service stop-sigterm timed out. Killing.                                                                                                                       
Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service: main process exited, code=killed, status=9/KILL                                                                                                       
Feb 04 04:01:49 openshift-master-0 systemd[1]: Unit docker.service entered failed state.                                                                                                                             
Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service failed.                                                                                                                                                
Feb 04 04:01:49 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...                                                                                                                       
Feb 04 04:01:50 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:50.202769479-05:00" level=info msg="libcontainerd: new containerd process, pid: 14681"                                             
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.267182649-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"                                   
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.268272443-05:00" level=info msg="Loading containers: start."                                                                    
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.271485461-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)."                                                                                                                                                        
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.271762013-05:00" level=warning msg="libcontainerd: failed to retrieve container 2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462 state: rpc error: code = 2 desc = containerd: container not found"                                                                                                                                  
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.282410560-05:00" level=warning msg="2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462 cleanup: failed to unmount secrets: invalid argument"                                                                                                                                                                                           
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.286743847-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)."                                                                                                                                                        
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.286900077-05:00" level=warning msg="libcontainerd: failed to retrieve container 9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278 state: rpc error: code = 2 desc = containerd: container not found"                                                                                                                                  
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.300409683-05:00" level=warning msg="9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278 cleanup: failed to unmount secrets: invalid argument"                                                                                                                                                                                           
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.326403986-05:00" level=info msg="Firewalld running: false"                                                                      
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.414573477-05:00" level=info msg="Removing stale sandbox 4398493a2f2c78b6d76f811fec87decf3a838e177252a2b3e0bfb7f8bada9006 (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)"                                                                                                                                                         
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.546565500-05:00" level=info msg="Removing stale sandbox fad8d18437d6522d8bf7ae954044f56725b2e400f3d93efba287789efe7a090e (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)"                                                                                                                                                         
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.572361662-05:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.31.0.0/24. Daemon option --bip can be used to set a preferred IP address"                                                                                                                                                                               
Feb 04 04:01:51 openshift-master-0 oci-umount[14886]: umounthook <debug>: prestart container_id:9237ecfa8341 rootfs:/var/lib/docker/overlay2/cd2092479b3df841fe082c0b632fc0ff78d042445d657267f1a8ea5fdaad3dec/merged 
Feb 04 04:01:51 openshift-master-0 oci-umount[14887]: umounthook <debug>: prestart container_id:2559d51999b2 rootfs:/var/lib/docker/overlay2/423a7edd1e0fab34950a0562320acfe8b3f015557de5624f0d0008f396ecfd87/merged 
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: + sudo -E kolla_set_configs                                                                                                                               
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: + sudo -E kolla_set_configs

Comment 3 Daniel Walsh 2019-02-05 15:48:48 UTC
Antonio any idea?

Comment 6 Antonio Murdaca 2019-02-07 14:56:39 UTC
Can you all provide a core dump of both dockerd and containerd for inspection please?

Comment 10 Antonio Murdaca 2019-02-08 17:38:36 UTC
We've identified the broken commit which introduced this regression and reverted it https://github.com/projectatomic/containerd/commit/7989550b83317f799af20ab4df3a5b6487767fc9

Comment 12 Alan Pevec 2019-02-13 22:45:53 UTC
Should this be closed as a duplicate of bug 1673692 (or vice-versa) ?

Comment 15 Robb Manes 2019-02-14 22:46:20 UTC
*** Bug 1673692 has been marked as a duplicate of this bug. ***

Comment 18 Radomir Ludva 2019-02-18 08:56:13 UTC
*** Bug 1672327 has been marked as a duplicate of this bug. ***

Comment 26 Derrick Ornelas 2019-03-06 20:24:43 UTC
*** Bug 1678096 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2019-03-13 01:52:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0487

Comment 31 Daniel Walsh 2019-03-20 12:51:31 UTC
*** Bug 1690640 has been marked as a duplicate of this bug. ***

Comment 33 Lukas Bezdicka 2019-05-03 08:55:13 UTC
*** Bug 1687944 has been marked as a duplicate of this bug. ***

Comment 34 Jose Luis Franco 2019-06-12 13:00:53 UTC
*** Bug 1679858 has been marked as a duplicate of this bug. ***