1870617 – [OSP13->OSP16.1] noout ceph flag not unset during the FFU workflow

Bug 1870617 - [OSP13->OSP16.1] noout ceph flag not unset during the FFU workflow

Summary: [OSP13->OSP16.1] noout ceph flag not unset during the FFU workflow

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Ronnie Rasouli
QA Contact:	Jose Luis Franco
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-20 13:27 UTC by Jose Luis Franco
Modified:	2020-10-28 15:39 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.2-1.20200914170155.29a02c1.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-28 15:39:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	747141	0	None	MERGED	[FFWD Ceph] Fix ceph post_upgrade_tasks for osd options	2021-02-17 16:44:32 UTC
Red Hat Product Errata	RHEA-2020:4284	0	None	None	None	2020-10-28 15:39:30 UTC

Description Jose Luis Franco 2020-08-20 13:27:03 UTC

Description of problem:

When upgrading a ceph environment from OSP13 to OSP16, we can see that the noout/norecover/nobackfill/norebalance/nodeep-scrub flags are being set:

2020-08-19 18:47:33 | TASK [Set noout flag] **********************************************************
2020-08-19 18:47:33 | Wednesday 19 August 2020  18:47:12 -0400 (0:00:01.299)       0:00:08.525 ******
2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=noout) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set noout", "delta": "0:00:01.613765", "end": "2020-08-19 22:47:14.241117", "item": "noout", "rc": 0, "start": "2020-08-19 22:47:12.627352", "stderr": "noout is set", "stderr_lines": ["noout is set"], "stdout": "", "stdout_lines": []}
2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=norecover) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set norecover", "delta": "0:00:01.742412", "end": "2020-08-19 22:47:16.287107", "item": "norecover", "rc": 0, "start": "2020-08-19 22:47:14.544695", "stderr": "norecover is set", "stderr_lines": ["norecover is set"], "stdout": "", "stdout_lines": []}
2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=nobackfill) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set nobackfill", "delta": "0:00:01.782815", "end": "2020-08-19 22:47:18.384204", "item": "nobackfill", "rc": 0, "start": "2020-08-19 22:47:16.601389", "stderr": "nobackfill is set", "stderr_lines": ["nobackfill is set"], "stdout": "", "stdout_lines": []}
2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=norebalance) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set norebalance", "delta": "0:00:01.736558", "end": "2020-08-19 22:47:20.399497", "item": "norebalance", "rc": 0, "start": "2020-08-19 22:47:18.662939", "stderr": "norebalance is set", "stderr_lines": ["norebalance is set"], "stdout": "", "stdout_lines": []}
2020-08-19 18:47:33 | changed: [ceph-0 -> 192.168.24.30] => (item=nodeep-scrub) => {"ansible_loop_var": "item", "changed": true, "cmd": "docker exec -u root ceph-mon-${HOSTNAME} ceph osd set nodeep-scrub", "delta": "0:00:01.822504", "end": "2020-08-19 22:47:22.503267", "item": "nodeep-scrub", "rc": 0, "start": "2020-08-19 22:47:20.680763", "stderr": "nodeep-scrub is set", "stderr_lines": ["nodeep-scrub is set"], "stdout": "", "stdout_lines": []}


But no signs of the Unset task. As a result the CI job fails during Ceph upgrade step.

This set/unsetting task was added in the following patch:

https://review.opendev.org/#/c/744018

The setting is being run before the leapp upgrade but the unsetting is supposed to run in the post_upgrade tasks. However, when we look at the CephStorage role post_upgrade_tasks_playbook we can't see them:

[root@undercloud-0 a41bc081-6cbd-498e-bdec-d36326e683f6]# cat CephStorage/post_upgrade_tasks.yaml                                                                           
- include_role:
    name: tripleo-podman
    tasks_from: tripleo_docker_purge.yml
  name: Purge everything about docker on the host
  when:
  - (step | int) == 3
- include_role:
    name: tripleo-podman
    tasks_from: tripleo_docker_stop.yml
  name: Stop docker
- include_role:
    name: tripleo-podman
    tasks_from: tripleo_podman_purge.yml
  name: Purge Podman
  when:
  - (step | int) == 3
  - container_cli == 'podman'



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Run CI job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/upgrades/view/ffu/job/DFG-upgrades-ffu-ffu-upgrade-13-16.1_director-rhel-virthost-3cont_2comp_3ceph-ipv4-vxlan-HA/
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jose Luis Franco 2020-08-20 14:43:43 UTC

The problem seems to be that the post_upgrade_tasks aren't being imported into the specific ceph tasks from ceph-base. As you can see, in the upgrade-tasks the ceph-base upgrade_tasks are always imported:

https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-mon.yaml#L72

However, for the post_upgrade_tasks this doesn't occur. What leads to now adding the post_upgrade_tasks for the CephStorage role.

Also, another problem is the step in which the Unsetting occurs... it's running right after the leapp upgrade when no Podman will be avaialable. Therefore, this step needs to be run at some later step in the process, like the post_upgrade but without the system_upgrade tags.

Comment 15 errata-xmlrpc 2020-10-28 15:39:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:4284

Note You need to log in before you can comment on or make changes to this bug.