RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1671861 - Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker service on master nodes
Summary: Director deployed OCP 3.11 deployment fails with openshift-ansible getting st...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: docker
Version: 7.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Valentin Rothberg
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
: 1672327 1673692 1678096 1679858 1687944 (view as bug list)
Depends On:
Blocks: 1186913 1647729 1661804 1663922
TreeView+ depends on / blocked
 
Reported: 2019-02-01 22:19 UTC by Marius Cornea
Modified: 2023-10-06 18:07 UTC (History)
47 users (show)

Fixed In Version: docker-1.13.1-93.gitb2f74b2.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-13 01:52:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3895221 0 Troubleshoot None Docker hangs after daemon using live-restore is restarted and containers are deleted 2019-02-14 22:46:20 UTC
Red Hat Product Errata RHSA-2019:0487 0 None None None 2019-03-13 01:52:49 UTC

Internal Links: 1673692

Description Marius Cornea 2019-02-01 22:19:01 UTC
Description of problem:

Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker on master nodes.

Snippet from /var/lib/mistral/openshift/openshift/playbook.log:

TASK [container_runtime : Fix SELinux Permissions on /var/lib/containers] ******
ok: [openshift-infra-2]
ok: [openshift-infra-1]
ok: [openshift-infra-0]
ok: [openshift-master-2]
ok: [openshift-master-0]
ok: [openshift-master-1]
ok: [openshift-worker-2]
ok: [openshift-worker-0]
ok: [openshift-worker-1]

RUNNING HANDLER [container_runtime : restart container runtime] ****************
changed: [openshift-infra-2]
changed: [openshift-infra-1]
changed: [openshift-worker-2]
changed: [openshift-infra-0]
changed: [openshift-worker-0]
changed: [openshift-worker-1]

We can see that the task ran fine on non-master nodes. Checking the docker processes on one of the master node we can see:

[root@openshift-master-0 heat-admin]# ps axu | grep docker
root     17174  0.0  0.0 136600 12472 ?        Ssl  16:24   0:00 /usr/libexec/docker/rhel-push-plugin
root     27861  0.0  0.0 134820  1440 ?        S    16:38   0:00 /bin/systemctl restart docker
root     27899  0.0  0.1 677516 24084 ?        Ssl  16:38   0:02 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --signature-verification=False -s overlay2 --mtu=1450 --add-registry registry.redhat.io --insecure-registry 192.168.24.1:8787 --add-registry registry.access.redhat.com --add-registry docker.io --add-registry registry.fedoraproject.org --add-registry quay.io --add-registry registry.centos.org
root     27908  0.0  0.0 394508 13568 ?        Ssl  16:38   0:01 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc --runtime-args --systemd-cgroup=true
root     27997  0.0  0.0 115300  1464 ?        Ss   16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28001  0.0  0.0 115300   652 ?        S    16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28002  0.0  0.0 166116  9248 ?        Sl   16:59   0:00 /usr/bin/docker-current ps -aq -f status=dead


The docker service is stuck in activating state:

[root@openshift-master-0 heat-admin]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─99-unset-mountflags.conf, custom.conf
   Active: activating (start) since Fri 2019-02-01 16:38:48 EST; 36min ago
     Docs: http://docs.docker.com
 Main PID: 27899 (dockerd-current)
    Tasks: 27
   Memory: 879.7M
   CGroup: /system.slice/docker.service
           ├─27899 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy...
           └─27908 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-...

Feb 01 16:38:48 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...
Feb 01 16:38:49 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:49.173839784-05:00" level=info msg="libcontainerd: new containerd process, pid: 27908"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.284350795-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.285523377-05:00" level=info msg="Loading containers: start."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.303415591-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412...6b2943e34278)."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f...1ba97f1c1462)."
Hint: Some lines were ellipsized, use -l to show in full.


Version-Release number of selected component (if applicable):
docker-1.13.1-90.git07f3374.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 3.11 via OpenStack Director

Actual results:
Deployment cannot complete because docker gets stuck.

Expected results:
Deployment doesn't get stuck.

Additional info:

I can see in the job history that deployment passed with a lower version of docker so this could potentially be a regression introduced by the newer docker package.

Working version:
docker-1.13.1-88.git07f3374.el7.x86_64

Comment 1 Martin André 2019-02-04 08:42:14 UTC
The infra and worker nodes do not have any containers running at this step, this could be the reason why they had no issue when restarting docker.

The full logs from the master node provide more context:

Feb 01 16:30:58 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:30:58.875673621-05:00" level=warning msg="d8e8ffa50cb6850aa67e7c3ca064d33ad84f397a53814d0566013f6bc0a52965 cleanup: failed to unmount secrets: invalid argument"
Feb 01 16:38:47 openshift-master-0 systemd[1]: Stopping Docker Application Container Engine...
Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.679732066-05:00" level=info msg="Processing signal 'terminated'"
Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.682139079-05:00" level=info msg="stopping containerd after receiving terminated"
Feb 01 16:38:47 openshift-master-0 dockerd-current[17126]: time="2019-02-01T16:38:47.68246551-05:00" level=fatal msg="containerd: serve grpc" error="accept unix /var/run/docker/libcontainerd/docker-containerd.sock: use of closed network connection"
Feb 01 16:38:48 openshift-master-0 systemd[1]: Stopped Docker Application Container Engine.
Feb 01 16:38:48 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...
Feb 01 16:38:49 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:49.173839784-05:00" level=info msg="libcontainerd: new containerd process, pid: 27908"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.284350795-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.285523377-05:00" level=info msg="Loading containers: start."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.303415591-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)."

Seems like an issue in containerd.

Comment 2 Martin André 2019-02-04 09:10:45 UTC
Additional note, ran "sudo systemctl stop docker" twice and docker finally came up after some time:

Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)."
Feb 04 03:57:41 openshift-master-0 dockerd-current[27899]: time="2019-02-04T03:57:41.136471082-05:00" level=info msg="Processing signal 'terminated'"                                                                
Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service stop-sigterm timed out. Killing.                                                                                                                       
Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service: main process exited, code=killed, status=9/KILL                                                                                                       
Feb 04 03:59:41 openshift-master-0 systemd[1]: Unit docker.service entered failed state.                                                                                                                             
Feb 04 03:59:41 openshift-master-0 systemd[1]: docker.service failed.                                                                                                                                                
Feb 04 03:59:41 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...                                                                                                                       
Feb 04 03:59:42 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:42.494812650-05:00" level=info msg="libcontainerd: new containerd process, pid: 14499"                                             
Feb 04 03:59:42 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:42.515315761-05:00" level=error msg="containerd: notify OOM events" error="open /proc/21179/cgroup: no such file or directory"     
Feb 04 03:59:43 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:43.570918238-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"                                   
Feb 04 03:59:43 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:43.571784432-05:00" level=info msg="Loading containers: start."                                                                    
Feb 04 03:59:49 openshift-master-0 dockerd-current[14489]: time="2019-02-04T03:59:49.348632137-05:00" level=info msg="Processing signal 'terminated'"                                                                
Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service stop-sigterm timed out. Killing.                                                                                                                       
Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service: main process exited, code=killed, status=9/KILL                                                                                                       
Feb 04 04:01:49 openshift-master-0 systemd[1]: Unit docker.service entered failed state.                                                                                                                             
Feb 04 04:01:49 openshift-master-0 systemd[1]: docker.service failed.                                                                                                                                                
Feb 04 04:01:49 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...                                                                                                                       
Feb 04 04:01:50 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:50.202769479-05:00" level=info msg="libcontainerd: new containerd process, pid: 14681"                                             
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.267182649-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"                                   
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.268272443-05:00" level=info msg="Loading containers: start."                                                                    
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.271485461-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)."                                                                                                                                                        
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.271762013-05:00" level=warning msg="libcontainerd: failed to retrieve container 2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462 state: rpc error: code = 2 desc = containerd: container not found"                                                                                                                                  
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.282410560-05:00" level=warning msg="2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462 cleanup: failed to unmount secrets: invalid argument"                                                                                                                                                                                           
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.286743847-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)."                                                                                                                                                        
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.286900077-05:00" level=warning msg="libcontainerd: failed to retrieve container 9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278 state: rpc error: code = 2 desc = containerd: container not found"                                                                                                                                  
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.300409683-05:00" level=warning msg="9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278 cleanup: failed to unmount secrets: invalid argument"                                                                                                                                                                                           
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.326403986-05:00" level=info msg="Firewalld running: false"                                                                      
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.414573477-05:00" level=info msg="Removing stale sandbox 4398493a2f2c78b6d76f811fec87decf3a838e177252a2b3e0bfb7f8bada9006 (9237ecfa83412005b786c07f51911478fe6786bca720b978ee8a6b2943e34278)"                                                                                                                                                         
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.546565500-05:00" level=info msg="Removing stale sandbox fad8d18437d6522d8bf7ae954044f56725b2e400f3d93efba287789efe7a090e (2559d51999b2f905915807ee0501a81df04f028ebecab057db1a1ba97f1c1462)"                                                                                                                                                         
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: time="2019-02-04T04:01:51.572361662-05:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.31.0.0/24. Daemon option --bip can be used to set a preferred IP address"                                                                                                                                                                               
Feb 04 04:01:51 openshift-master-0 oci-umount[14886]: umounthook <debug>: prestart container_id:9237ecfa8341 rootfs:/var/lib/docker/overlay2/cd2092479b3df841fe082c0b632fc0ff78d042445d657267f1a8ea5fdaad3dec/merged 
Feb 04 04:01:51 openshift-master-0 oci-umount[14887]: umounthook <debug>: prestart container_id:2559d51999b2 rootfs:/var/lib/docker/overlay2/423a7edd1e0fab34950a0562320acfe8b3f015557de5624f0d0008f396ecfd87/merged 
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: + sudo -E kolla_set_configs                                                                                                                               
Feb 04 04:01:51 openshift-master-0 dockerd-current[14671]: + sudo -E kolla_set_configs

Comment 3 Daniel Walsh 2019-02-05 15:48:48 UTC
Antonio any idea?

Comment 6 Antonio Murdaca 2019-02-07 14:56:39 UTC
Can you all provide a core dump of both dockerd and containerd for inspection please?

Comment 10 Antonio Murdaca 2019-02-08 17:38:36 UTC
We've identified the broken commit which introduced this regression and reverted it https://github.com/projectatomic/containerd/commit/7989550b83317f799af20ab4df3a5b6487767fc9

Comment 12 Alan Pevec 2019-02-13 22:45:53 UTC
Should this be closed as a duplicate of bug 1673692 (or vice-versa) ?

Comment 15 Robb Manes 2019-02-14 22:46:20 UTC
*** Bug 1673692 has been marked as a duplicate of this bug. ***

Comment 18 Radomir Ludva 2019-02-18 08:56:13 UTC
*** Bug 1672327 has been marked as a duplicate of this bug. ***

Comment 26 Derrick Ornelas 2019-03-06 20:24:43 UTC
*** Bug 1678096 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2019-03-13 01:52:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0487

Comment 31 Daniel Walsh 2019-03-20 12:51:31 UTC
*** Bug 1690640 has been marked as a duplicate of this bug. ***

Comment 33 Lukas Bezdicka 2019-05-03 08:55:13 UTC
*** Bug 1687944 has been marked as a duplicate of this bug. ***

Comment 34 Jose Luis Franco 2019-06-12 13:00:53 UTC
*** Bug 1679858 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.