Description of problem: When upgrading a ceph environment from OSP13 to OSP16, we can see that the noout/norecover/nobackfill/norebalance/nodeep-scrub flags are being set: 2020-08-19 18:47:33 | TASK [Set noout flag] ********************************************************** 2020-08-19 18:47:33 | Wednesday 19 August 2020 18:47:12 -0400 (0:00:01.299) 0:00:08.525 ****** 2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=noout) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set noout", "delta": "0:00:01.613765", "end": "2020-08-19 22:47:14.241117", "item": "noout", "rc": 0, "start": "2020-08-19 22:47:12.627352", "stderr": "noout is set", "stderr_lines": ["noout is set"], "stdout": "", "stdout_lines": []} 2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=norecover) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set norecover", "delta": "0:00:01.742412", "end": "2020-08-19 22:47:16.287107", "item": "norecover", "rc": 0, "start": "2020-08-19 22:47:14.544695", "stderr": "norecover is set", "stderr_lines": ["norecover is set"], "stdout": "", "stdout_lines": []} 2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=nobackfill) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set nobackfill", "delta": "0:00:01.782815", "end": "2020-08-19 22:47:18.384204", "item": "nobackfill", "rc": 0, "start": "2020-08-19 22:47:16.601389", "stderr": "nobackfill is set", "stderr_lines": ["nobackfill is set"], "stdout": "", "stdout_lines": []} 2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=norebalance) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set norebalance", "delta": "0:00:01.736558", "end": "2020-08-19 22:47:20.399497", "item": "norebalance", "rc": 0, "start": "2020-08-19 22:47:18.662939", "stderr": "norebalance is set", "stderr_lines": ["norebalance is set"], "stdout": "", "stdout_lines": []} 2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=nodeep-scrub) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set nodeep-scrub", "delta": "0:00:01.822504", "end": "2020-08-19 22:47:22.503267", "item": "nodeep-scrub", "rc": 0, "start": "2020-08-19 22:47:20.680763", "stderr": "nodeep-scrub is set", "stderr_lines": ["nodeep-scrub is set"], "stdout": "", "stdout_lines": []} But no signs of the Unset task. As a result the CI job fails during Ceph upgrade step. This set/unsetting task was added in the following patch: https://review.opendev.org/#/c/744018 The setting is being run before the leapp upgrade but the unsetting is supposed to run in the post_upgrade tasks. However, when we look at the CephStorage role post_upgrade_tasks_playbook we can't see them: [root@undercloud-0 a41bc081-6cbd-498e-bdec-d36326e683f6]# cat CephStorage/post_upgrade_tasks.yaml - include_role: name: tripleo-podman tasks_from: tripleo_docker_purge.yml name: Purge everything about docker on the host when: - (step | int) == 3 - include_role: name: tripleo-podman tasks_from: tripleo_docker_stop.yml name: Stop docker - include_role: name: tripleo-podman tasks_from: tripleo_podman_purge.yml name: Purge Podman when: - (step | int) == 3 - container_cli == 'podman' Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Run CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/ 2. 3. Actual results: Expected results: Additional info:
The problem seems to be that the post_upgrade_tasks aren't being imported into the specific ceph tasks from ceph-base. As you can see, in the upgrade-tasks the ceph-base upgrade_tasks are always imported: https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-mon.yaml#L72 However, for the post_upgrade_tasks this doesn't occur. What leads to now adding the post_upgrade_tasks for the CephStorage role. Also, another problem is the step in which the Unsetting occurs... it's running right after the leapp upgrade when no Podman will be avaialable. Therefore, this step needs to be run at some later step in the process, like the post_upgrade but without the system_upgrade tags.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:4284