Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2254036

Summary: [FFU] The Host System upgrade of HCI nodes fails on setting noout flags
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: openstack-tripleo-heat-templatesAssignee: Manoj Katari <mkatari>
Status: CLOSED ERRATA QA Contact: Marian Krcmarik <mkrcmari>
Severity: high Docs Contact:
Priority: high    
Version: 17.1 (Wallaby)CC: bshephar, dhughes, erpeters, gbrinn, gregraka, jamsmith, johfulto, mariel, mburns, mkatari
Target Milestone: z3Keywords: Triaged
Target Release: 17.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-14.3.1-17.1.20231103010833.el9ost Doc Type: Bug Fix
Doc Text:
Before this update, during a DCN FFU system upgrade of nodes on the setup with multiple stacks, the Red Hat Ceph Storage task `Set noout flag` might fail to run the ceph command on the right host. + After the update, a system upgrade on any node in a multi-stack setup now delegates the Red Hat Ceph Storage task `Set noout flag` to the relevant host, and the `ceph` commands are run on the specific cluster.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-05-22 20:42:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1997638    

Description Marian Krcmarik 2023-12-11 16:27:51 UTC
Description of problem:
The upgrade_tasks_step1.yaml playbook is executed during the Host System upgrade (from RHEL 8.4 to 9.2) and It fails on the first task called "Set noout flag":
       - - name: Set noout flag
            shell: "cephadm shell -- ceph osd set {{ item }}"
            become: true
            with_items:
              - noout
              - norecover
              - nobackfill
              - norebalance
              - nodeep-scrub
            delegate_to: "{{ ceph_mon_short_bootstrap_node_name }}"
https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/cephadm/ceph-osd.yaml#L109

Because the task is delegated to "ceph_mon_short_bootstrap_node_name" which points to one of the controllers which is not included in used inventory for DCN stack and I assume It would set the flags on the central ceph cluster anyway which is pointless when the DCN site has a different ceph cluster. 
Moreover I assume the command cephadm shell -- ceph osd set {{ item }} would fail anyway because It would not find the ceph cluster credentials. 

So there are two problems need to be fixed:
1. select the right ceph_mon node in the delegation
2. select the right cluster assuming we solve step 1
   The command cephadm shell -- ceph osd set {{ item }} should be able to find the right ceph cluster and look like something like:
cephadm  --fsid {{ tripleo_cephadm_fsid }} -c /etc/ceph/{{ tripleo_cephadm_cluster }}.conf -k /etc/ceph/{{ tripleo_cephadm_cluster }}.client.{{ select_keyring| default('admin') }}.keyring shell -- ceph osd set <flag>

Version-Release number of selected component (if applicable):
openstack-tripleo-common-containers-15.4.1-17.1.20230927010819.el9ost.noarch
puppet-tripleo-14.2.3-17.1.20231102190827.40278e1.el9ost.noarch
ansible-tripleo-ipsec-11.0.1-17.1.20230620172008.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.3.1-17.1.20230627190951.8d29d9e.el9ost.noarch
ansible-role-tripleo-modify-image-1.5.1-17.1.20230621064242.b6eedb6.el9ost.noarch
python3-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
openstack-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
tripleo-ansible-3.3.1-17.1.20231101230823.4d015bf.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-17.1.20231103010823.el9ost.noarch
openstack-tripleo-validations-14.3.2-17.1.20231026020815.2b526f8.el9ost.noarch
python3-tripleoclient-16.5.1-17.1.20230927000827.f3599d0.el9ost.noarch
openstack-tripleo-image-elements-13.1.3-17.1.20230621111410.a641940.el9ost.noarch
openstack-tripleo-puppet-elements-14.1.3-17.1.20230810141019.b4e0cbd.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Execute the Host system upgrade of HCI compute nodes of DCN env during the FFU procedure.

Comment 4 Manoj Katari 2024-01-11 10:26:51 UTC
removed needinfo as John answered it in comment3

Comment 12 Manoj Katari 2024-05-02 07:41:51 UTC
Thanks Erin for the Doc text update, it looks good.

Comment 19 errata-xmlrpc 2024-05-22 20:42:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: openstack-tripleo-heat-templates and tripleo-ansible update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2736