Bug 1479035 - Installation using ansible fails during restart docker after updating ca trust
Summary: Installation using ansible fails during restart docker after updating ca trust
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 3.6.z
Assignee: Jhon Honce
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks: 1724792
TreeView+ depends on / blocked
 
Reported: 2017-08-07 19:13 UTC by mdunn
Modified: 2019-06-28 16:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-08-23 10:38:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description mdunn 2017-08-07 19:13:49 UTC
Description of problem:
During the advanced installation of OCP 3.5, all nodes in the cluster failed during the task "restart docker after updating ca trust". Looking at the output from systemctl status docker, it can be seen that the docker service is in a failed state. Any attempt thus far to restart the service has been met with failure.

Prior to running the ansible-playbook for the advanced installation of OpenShift, Docker was running on every node.

The following is the installation command used:
ansible-playbook -i /etc/ansible/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/config.yml

Version-Release number of the following components:
[root@dhcp-41-56 /]# oc version
oc v3.5.5.31
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp-41-56.bos.redhat.com:8443
openshift v3.5.5.31
kubernetes v1.5.2+43a9be4
[root@dhcp-41-56 /]# rpm -q openshift-ansible
openshift-ansible-3.5.101-1.git.0.0107544.el7.noarch
[root@dhcp-41-56 /]# rpm -q ansible
ansible-2.3.1.0-3.el7.noarch
[root@dhcp-41-56 /]# ansible --version
ansible 2.3.1.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]


How reproducible:
Reproduction has occurred in approximately 50% of cases so far. After initially hitting the error when using newly deployed VMs on one host, a repro attempt was made using newly deployed VMs on another host. That installation hit the same error as the first. Two subsequent installation attempts (each time on new VMs) have both met with success.

Steps to Reproduce:
1. Deploy new virtual machines
2. Follow installation guide here: https://access.redhat.com/documentation/en-us/openshift_container_platform/3.5/html-single/installation_and_configuration/
3. When running the advanced installation playbook for OpenShift, an error occurs while attempting to restart docker after updating ca trust.

Actual results:
RUNNING HANDLER [openshift_node_certificates : restart docker after updating ca trust}
fatal: [dhcp-41-56.bos.redhat.com]: FAILED! => {
"changed":false,
"failed": true
}
MSG:
Unable to restart service docker: Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.

The same error was repeated for each of the nodes. The following is the output of journalctl -xe:
Aug 07 14:39:46 dhcp-41-56.bos.redhat.com systemd[1]: Starting Docker Storage Setup...
-- Subject: Unit docker-storage-setup.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker-storage-setup.service has begun starting up.
Aug 07 14:39:46 dhcp-41-56.bos.redhat.com systemd[1]: Starting SystemWide Container Registries...
-- Subject: Unit registries.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit registries.service has begun starting up.
Aug 07 14:39:46 dhcp-41-56.bos.redhat.com registries[32135]: Registry information should be stored in /etc/containers/registries.conf and not in /etc/sysconfig/docker
Aug 07 14:39:46 dhcp-41-56.bos.redhat.com systemd[1]: Started SystemWide Container Registries.
-- Subject: Unit registries.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit registries.service has finished starting up.
-- 
-- The start-up result is done.
Aug 07 14:39:46 dhcp-41-56.bos.redhat.com container-storage-setup[32134]: INFO: Device /dev/vdb is already partitioned and is part of volume group docker-vg
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com container-storage-setup[32134]: INFO: Found an already configured thin pool /dev/mapper/docker--vg-docker--pool in /etc/sysconfig/docker-storage
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com container-storage-setup[32134]: INFO: Device node /dev/mapper/docker--vg-docker--pool exists.
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com container-storage-setup[32134]: Logical volume docker-vg/docker-pool changed.
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com systemd[1]: Started Docker Storage Setup.
-- Subject: Unit docker-storage-setup.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker-storage-setup.service has finished starting up.
-- 
-- The start-up result is done.
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com systemd[1]: Starting Docker Application Container Engine...
-- Subject: Unit docker.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has begun starting up.
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com systemd[32214]: Failed at step MEMORY spawning /usr/bin/dockerd-current: Cannot allocate memory
-- Subject: Process /usr/bin/dockerd-current could not be executed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- The process /usr/bin/dockerd-current could not be executed and failed.
-- 
-- The error number returned by this process is 12.
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com systemd[1]: docker.service: main process exited, code=exited, status=204/MEMORY
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com systemd[1]: Failed to start Docker Application Container Engine.
-- Subject: Unit docker.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit docker.service has failed.
-- 
-- The result is failed.
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com systemd[1]: Unit docker.service entered failed state.
Aug 07 14:39:47 dhcp-41-56.bos.redhat.com systemd[1]: docker.service failed.

Each of the VMs has 32GB of memory, 4 vCPUs, and a 75GB filesystem containing /var/
After seeing the error stating "Cannot allocate memory", a review of /proc/meminfo showed that there was a significant amount of memory available at the time of the failure. For example:

MemTotal:       32781876 kB
MemFree:        30184860 kB
MemAvailable:   32011332 kB

Expected results:
The installation should complete successfully.

Additional info:
It has not been possible so far to obtain logs with the -vvv flag since I have not been able to get the docker service to start again on the VMs that have hit failures. I have two separate clusters that are different sizes that are still in this failed state. I can provide connection and login info upon request.

Comment 1 Jhon Honce 2017-08-14 16:11:40 UTC
Are there any denied entries in the audit.log during this period?

Comment 3 Jhon Honce 2017-08-15 16:47:11 UTC
There was an error in the /etc/sysconfig/docker file:

OPTIONS=' "--insecure-registry 172.30.0.0/16 --log-driver=journald --signature-verification=false'

systemd and dockerd error report was not clear.

I removed the double quote and systemd was able to start dockerd without issue.

OPTIONS=' --insecure-registry 172.30.0.0/16 --log-driver=journald --signature-verification=false'

Comment 5 Johnny Liu 2017-08-23 08:58:16 UTC
Based on the comment 3, there is no any code change, just seem like user's mistake, so leave the verification work to reporter. 

@mdunn, if you could confirm comment 3 is the root cause, pls close this bug as NOTABUG by yourself.

Comment 6 mdunn 2017-08-23 10:38:36 UTC
Comment 3 confirmed as resolution to this problem. Closing as NOTABUG


Note You need to log in before you can comment on or make changes to this bug.