Description of problem: Before running the overcloud prepare we need to set ContainerCeph3DaemonImage to trigger the ceph-ansile docker-to-podman playbook using an rhcs3 container image; on converge though we should unset it or the condition at [1] will trigger rolling_update too with the rhcs3 image 1. https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L340
This issue presents when after the converge step. When you run `openstack overcloud external-upgrade run --stack $STACK --tags ceph` it fails with the following ceph-ansible error: 2020-06-15 11:08:47,273 p=264551 u=root n=ansible | TASK [container | disallow pre-nautilus OSDs and enable all new nautilus-only functionality] *** 2020-06-15 11:08:47,274 p=264551 u=root n=ansible | Monday 15 June 2020 11:08:47 -0400 (0:00:00.485) 0:20:09.714 *********** 2020-06-15 11:08:49,215 p=264551 u=root n=ansible | fatal: [osp-test-octopi-zorillas-controller-0 -> 10.10.0.116]: FAILED! => changed=true cmd: - podman - exec - ceph-mon-osp-test-octopi-zorillas-controller-0 - ceph - osd - require-osd-release - nautilus delta: '0:00:01.537954' end: '2020-06-15 15:08:49.184682' msg: non-zero return code rc: 22 start: '2020-06-15 15:08:47.646728' stderr: |- Invalid command: nautilus not in luminous osd require-osd-release luminous {--yes-i-really-mean-it} : set the minimum allowed OSD release to participate in the cluster Error EINVAL: invalid command Error: non zero exit code: 22: OCI runtime error stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> 2020-06-15 11:08:49,216 p=264551 u=root n=ansible | NO MORE HOSTS LEFT ************************************************************* 2020-06-15 11:08:49,219 p=264551 u=root n=ansible | PLAY RECAP ********************************************************************* 2020-06-15 11:08:49,219 p=264551 u=root n=ansible | localhost : ok=1 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0 2020-06-15 11:08:49,219 p=264551 u=root n=ansible | osp-test-octopi-zorillas-cephstorage-0 : ok=160 changed=14 unreachable=0 failed=0 skipped=251 rescued=0 ignored=0 2020-06-15 11:08:49,219 p=264551 u=root n=ansible | osp-test-octopi-zorillas-cephstorage-1 : ok=160 changed=14 unreachable=0 failed=0 skipped=251 rescued=0 ignored=0 2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-cephstorage-2 : ok=161 changed=13 unreachable=0 failed=0 skipped=250 rescued=0 ignored=0 2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-controller-0 : ok=420 changed=47 unreachable=0 failed=1 skipped=612 rescued=0 ignored=0 2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-controller-1 : ok=297 changed=29 unreachable=0 failed=0 skipped=494 rescued=0 ignored=0 2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-controller-2 : ok=293 changed=27 unreachable=0 failed=0 skipped=484 rescued=0 ignored=0 2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-novacompute-0 : ok=114 changed=8 unreachable=0 failed=0 skipped=236 rescued=0 ignored=0 2020-06-15 11:08:49,220 p=264551 u=root n=ansible | osp-test-octopi-zorillas-novacompute-1 : ok=111 changed=7 unreachable=0 failed=0 skipped=225 rescued=0 ignored=0 2020-06-15 11:08:49,220 p=264551 u=root n=ansible | Monday 15 June 2020 11:08:49 -0400 (0:00:01.946) 0:20:11.661 *********** 2020-06-15 11:08:49,221 p=264551 u=root n=ansible | =============================================================================== 2020-06-15 11:08:49,225 p=264551 u=root n=ansible | waiting for clean pgs... ----------------------------------------------- 36.07s 2020-06-15 11:08:49,225 p=264551 u=root n=ansible | gather and delegate facts ---------------------------------------------- 28.58s 2020-06-15 11:08:49,225 p=264551 u=root n=ansible | stop standby ceph mds -------------------------------------------------- 26.17s 2020-06-15 11:08:49,225 p=264551 u=root n=ansible | ceph-container-common : pulling osp-test-octopi-zorillas-undercloud.ctlplane.hextupleo.lab:8787/rhceph/rhceph-3-rhel7:3-40 image -- 17.95s
As per comment #2 the rolling_update playbook was using not RHCSv4 containers but RHCSv3 containers! That is why the following task fails. https://github.com/ceph/ceph-ansible/blob/v4.0.23/infrastructure-playbooks/rolling_update.yml#L945 So we need a way in THT for the person doing the upgrade to specify that they want ceph4 containers to be used.
HOW TO AVOID THIS ISSUE 1. before running the converge step create a file called no_ceph3.yaml (or something similar) containing the following value: parameter_defaults: ContainerCeph3DaemonImage: '' 2. When you run converge step include the file as the last argument of your openstack overcloud deploy command. E.g. "openstack overcloud deploy ... -e no_ceph3.yaml If you've already run the converge step and encountered this bug, then you may re-run run it. 3. Proceed to the ceph upgrade as usual by running a command like: `openstack overcloud external-upgrade run --stack $STACK --tags ceph`
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148