Bug 1573307

Summary: FFU: ceph upgrade fails because Docker service is not running on the Ceph OSD nodes
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: Amit Ugol <augol>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 13.0 (Queens)CC: dbecker, lbezdick, mburns, morazi
Target Milestone: beta   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-02 15:24:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2018-04-30 19:58:48 UTC
Description of problem:

FFU: ceph upgrade fails because Docker service is not running on the Ceph OSD nodes, snippet from /var/log/mistral/ceph-install-workflow.log:

[...]
2018-04-30 15:53:22,353 p=11902 u=mistral |  task path: /usr/share/ceph-ansible/roles/ceph-docker-common/tasks/fetch_image.yml:179
2018-04-30 15:53:22,353 p=11902 u=mistral |  Monday 30 April 2018  15:53:22 -0400 (0:00:00.036)       0:06:32.099 ********** 
2018-04-30 15:53:22,788 p=11902 u=mistral |  FAILED - RETRYING: pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image (3 retries left).
2018-04-30 15:53:33,033 p=11902 u=mistral |  FAILED - RETRYING: pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image (2 retries left).
2018-04-30 15:53:43,266 p=11902 u=mistral |  FAILED - RETRYING: pulling registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest image (1 retries left).
2018-04-30 15:53:53,508 p=11902 u=mistral |  fatal: [192.168.24.10]: FAILED! => {"attempts": 3, "changed": false, "cmd": ["timeout", "300s", "docker", "pull", "registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest"], "delta": "0:00:00.025122", "end": "2018-04-30 19:53:52.221277", "msg": "non-zero return code", "rc": 1, "start": "2018-04-30 19:53:52.196155", "stderr": "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?", "stderr_lines": ["Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"], "stdout": "", "stdout_lines": []}
2018-04-30 15:53:53,510 p=11902 u=mistral |  PLAY RECAP *********************************************************************
2018-04-30 15:53:53,511 p=11902 u=mistral |  192.168.24.10              : ok=42   changed=4    unreachable=0    failed=1   
2018-04-30 15:53:53,511 p=11902 u=mistral |  192.168.24.12              : ok=121  changed=26   unreachable=0    failed=0   
2018-04-30 15:53:53,511 p=11902 u=mistral |  192.168.24.13              : ok=111  changed=21   unreachable=0    failed=0   
2018-04-30 15:53:53,511 p=11902 u=mistral |  192.168.24.18              : ok=2    changed=0    unreachable=0    failed=0   
2018-04-30 15:53:53,511 p=11902 u=mistral |  192.168.24.19              : ok=110  changed=22   unreachable=0    failed=0   
2018-04-30 15:53:53,511 p=11902 u=mistral |  192.168.24.23              : ok=2    changed=0    unreachable=0    failed=0   
2018-04-30 15:53:53,511 p=11902 u=mistral |  localhost                  : ok=0    changed=0    unreachable=0    failed=0   
2018-04-30 15:53:53,512 p=11902 u=mistral |  Monday 30 April 2018  15:53:53 -0400 (0:00:31.158)       0:07:03.257 ********** 
2018-04-30 15:53:53,512 p=11902 u=mistral |  =============================================================================== 


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-8.0.2-4.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. openstack overcloud ffwd-upgrade prepare 
2. openstack overcloud ffwd-upgrade run
3. openstack overcloud upgrade run --roles Controller --skip-tags validation
4. openstack overcloud upgrade run --roles Compute --skip-tags validation
5. openstack overcloud ffwd-upgrade converge
6. openstack overcloud ceph-upgrade run \
--timeout 100 \
--templates /usr/share/openstack-tripleo-heat-templates \
--stack overcloud \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /home/stack/virt/hostnames.yml \
-e /home/stack/virt/debug.yaml \
-e /home/stack/ffu_repos.yaml \
-e /home/stack/cli_opts_params.yaml \
-e /home/stack/ceph-ansible-env.yaml \
--ceph-ansible-playbook '/usr/share/ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml,/usr/share/ceph-ansible/infrastructure-playbooks/rolling_update.yml'

Actual results:
switch-from-non-containerized-to-containerized-ceph-daemons.yml playbook fails because the Docker service on ceph OSD nodes is not running

Expected results:
Ceph upgrade playbook finish without errors.

Additional info:

Comment 1 Lukas Bezdicka 2018-05-02 15:24:40 UTC
Upgrade step - openstack overcloud upgrade run - has to run on all nodes including Ceph.