Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1691449

Summary: Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker service on master nodes
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: RHOS Maint <rhos-maint>
Status: CLOSED WONTFIX QA Contact: Sasha Smolyak <ssmolyak>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: dbecker, eduen, m.andre, mburns, morazi, slinaber
Target Milestone: z2Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
There is currently a known issue where director can hang while deploying OCP. This occurs because the fix described in https://bugzilla.redhat.com/show_bug.cgi?id=1671861 is not a part of the `overcloud-full` image for the Red Hat OpenStack Platform 14 z1 release. As a workaround, prior to deploying the overcloud, follow the steps below to update the docker package in the `overcloud-full` image. For more information on this procedure, see https://access.redhat.com/articles/1556833. After completing these steps, you can expect the director to successfully deploy OCP: $ sudo yum install -y libguestfs-tools $ virt-customize --selinux-relabel -a overcloud-full.qcow2 --install docker $ source stackrc $ openstack overcloud image upload --update-existing
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-18 16:48:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2019-03-21 15:50:42 UTC
This bug was initially created as a copy of Bug #1671861

I am copying this bug because: 

The issue is still present when deploying OCP via Director because the images provided by rhosp-director-images-14.0-20190304.2.el7ost.noarch contain the broken docker version:

[root@openshift-master-0 heat-admin]# rpm -q docker
docker-1.13.1-91.git07f3374.el7.x86_64

Deployment gets stuck even if an updated version of docker is available for update in the rhel-7-server-extras-rpms repo:

[root@openshift-master-0 heat-admin]# yum check-updates docker
Loaded plugins: product-id, search-disabled-repos, subscription-manager

docker.x86_64                                                                                       2:1.13.1-94.gitb2f74b2.el7                                                                                       rhel-7-server-extras-rpms


Description of problem:

Director deployed OCP 3.11 deployment fails with openshift-ansible getting stuck when restarting docker on master nodes.

Snippet from /var/lib/mistral/openshift/openshift/playbook.log:

TASK [container_runtime : Fix SELinux Permissions on /var/lib/containers] ******
ok: [openshift-infra-2]
ok: [openshift-infra-1]
ok: [openshift-infra-0]
ok: [openshift-master-2]
ok: [openshift-master-0]
ok: [openshift-master-1]
ok: [openshift-worker-2]
ok: [openshift-worker-0]
ok: [openshift-worker-1]

RUNNING HANDLER [container_runtime : restart container runtime] ****************
changed: [openshift-infra-2]
changed: [openshift-infra-1]
changed: [openshift-worker-2]
changed: [openshift-infra-0]
changed: [openshift-worker-0]
changed: [openshift-worker-1]

We can see that the task ran fine on non-master nodes. Checking the docker processes on one of the master node we can see:

[root@openshift-master-0 heat-admin]# ps axu | grep docker
root     17174  0.0  0.0 136600 12472 ?        Ssl  16:24   0:00 /usr/libexec/docker/rhel-push-plugin
root     27861  0.0  0.0 134820  1440 ?        S    16:38   0:00 /bin/systemctl restart docker
root     27899  0.0  0.1 677516 24084 ?        Ssl  16:38   0:02 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy-path=/usr/libexec/docker/docker-proxy-current --init-path=/usr/libexec/docker/docker-init-current --seccomp-profile=/etc/docker/seccomp.json --selinux-enabled --signature-verification=False -s overlay2 --mtu=1450 --add-registry registry.redhat.io --insecure-registry 192.168.24.1:8787 --add-registry registry.access.redhat.com --add-registry docker.io --add-registry registry.fedoraproject.org --add-registry quay.io --add-registry registry.centos.org
root     27908  0.0  0.0 394508 13568 ?        Ssl  16:38   0:01 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc --runtime-args --systemd-cgroup=true
root     27997  0.0  0.0 115300  1464 ?        Ss   16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28001  0.0  0.0 115300   652 ?        S    16:59   0:00 /usr/bin/sh -c DEAD=`docker ps -aq -f status=dead` && [ -n "$DEAD" ] && docker rm $DEAD; exit 0
root     28002  0.0  0.0 166116  9248 ?        Sl   16:59   0:00 /usr/bin/docker-current ps -aq -f status=dead


The docker service is stuck in activating state:

[root@openshift-master-0 heat-admin]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─99-unset-mountflags.conf, custom.conf
   Active: activating (start) since Fri 2019-02-01 16:38:48 EST; 36min ago
     Docs: http://docs.docker.com
 Main PID: 27899 (dockerd-current)
    Tasks: 27
   Memory: 879.7M
   CGroup: /system.slice/docker.service
           ├─27899 /usr/bin/dockerd-current --add-runtime docker-runc=/usr/libexec/docker/docker-runc-current --default-runtime=docker-runc --authorization-plugin=rhel-push-plugin --exec-opt native.cgroupdriver=systemd --userland-proxy...
           └─27908 /usr/bin/docker-containerd-current -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-...

Feb 01 16:38:48 openshift-master-0 systemd[1]: Starting Docker Application Container Engine...
Feb 01 16:38:49 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:49.173839784-05:00" level=info msg="libcontainerd: new containerd process, pid: 27908"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.284350795-05:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.285523377-05:00" level=info msg="Loading containers: start."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.303415591-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (9237ecfa83412...6b2943e34278)."
Feb 01 16:38:50 openshift-master-0 dockerd-current[27899]: time="2019-02-01T16:38:50.317157074-05:00" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced container (2559d51999b2f...1ba97f1c1462)."
Hint: Some lines were ellipsized, use -l to show in full.


Version-Release number of selected component (if applicable):
docker-1.13.1-90.git07f3374.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Deploy OCP 3.11 via OpenStack Director

Actual results:
Deployment cannot complete because docker gets stuck.

Expected results:
Deployment doesn't get stuck.

Additional info:

I can see in the job history that deployment passed with a lower version of docker so this could potentially be a regression introduced by the newer docker package.

Working version:
docker-1.13.1-88.git07f3374.el7.x86_64

Comment 2 Martin André 2019-04-01 17:18:03 UTC
It is possible to workaround the issue bu installing the updated docker package in overcloud-full image. From the undercloud prior to deploying the overcloud:

$ sudo yum install -y libguestfs-tools
$ virt-customize --selinux-relabel -a overcloud-full.qcow2 --install docker
$ source stackrc
$ openstack overcloud image upload --update-existing

Comment 3 Martin André 2019-04-01 17:27:14 UTC
It will also be necessary to register the image. See the whole process at https://access.redhat.com/articles/1556833.