Bug 2254036 - [FFU] The Host System upgrade of HCI nodes fails on setting noout flags
Summary: [FFU] The Host System upgrade of HCI nodes fails on setting noout flags
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 17.1 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 17.1
Assignee: Manoj Katari
QA Contact: Marian Krcmarik
URL:
Whiteboard:
Depends On:
Blocks: 1997638
TreeView+ depends on / blocked
 
Reported: 2023-12-11 16:27 UTC by Marian Krcmarik
Modified: 2024-05-22 20:42 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-heat-templates-14.3.1-17.1.20231103010833.el9ost
Doc Type: Bug Fix
Doc Text:
Before this update, during a DCN FFU system upgrade of nodes on the setup with multiple stacks, the Red Hat Ceph Storage task `Set noout flag` might fail to run the ceph command on the right host. + After the update, a system upgrade on any node in a multi-stack setup now delegates the Red Hat Ceph Storage task `Set noout flag` to the relevant host, and the `ceph` commands are run on the specific cluster.
Clone Of:
Environment:
Last Closed: 2024-05-22 20:42:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 903403 0 None MERGED Fix DCN system upgrade failure to set ceph noout 2023-12-19 07:05:51 UTC
Red Hat Issue Tracker OSP-30715 0 None None None 2023-12-11 16:28:11 UTC
Red Hat Product Errata RHSA-2024:2736 0 None None None 2024-05-22 20:42:33 UTC

Description Marian Krcmarik 2023-12-11 16:27:51 UTC
Description of problem:
The upgrade_tasks_step1.yaml playbook is executed during the Host System upgrade (from RHEL 8.4 to 9.2) and It fails on the first task called "Set noout flag":
       - - name: Set noout flag
            shell: "cephadm shell -- ceph osd set {{ item }}"
            become: true
            with_items:
              - noout
              - norecover
              - nobackfill
              - norebalance
              - nodeep-scrub
            delegate_to: "{{ ceph_mon_short_bootstrap_node_name }}"
https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/cephadm/ceph-osd.yaml#L109

Because the task is delegated to "ceph_mon_short_bootstrap_node_name" which points to one of the controllers which is not included in used inventory for DCN stack and I assume It would set the flags on the central ceph cluster anyway which is pointless when the DCN site has a different ceph cluster. 
Moreover I assume the command cephadm shell -- ceph osd set {{ item }} would fail anyway because It would not find the ceph cluster credentials. 

So there are two problems need to be fixed:
1. select the right ceph_mon node in the delegation
2. select the right cluster assuming we solve step 1
   The command cephadm shell -- ceph osd set {{ item }} should be able to find the right ceph cluster and look like something like:
cephadm  --fsid {{ tripleo_cephadm_fsid }} -c /etc/ceph/{{ tripleo_cephadm_cluster }}.conf -k /etc/ceph/{{ tripleo_cephadm_cluster }}.client.{{ select_keyring| default('admin') }}.keyring shell -- ceph osd set <flag>

Version-Release number of selected component (if applicable):
openstack-tripleo-common-containers-15.4.1-17.1.20230927010819.el9ost.noarch
puppet-tripleo-14.2.3-17.1.20231102190827.40278e1.el9ost.noarch
ansible-tripleo-ipsec-11.0.1-17.1.20230620172008.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.3.1-17.1.20230627190951.8d29d9e.el9ost.noarch
ansible-role-tripleo-modify-image-1.5.1-17.1.20230621064242.b6eedb6.el9ost.noarch
python3-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
openstack-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
tripleo-ansible-3.3.1-17.1.20231101230823.4d015bf.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-17.1.20231103010823.el9ost.noarch
openstack-tripleo-validations-14.3.2-17.1.20231026020815.2b526f8.el9ost.noarch
python3-tripleoclient-16.5.1-17.1.20230927000827.f3599d0.el9ost.noarch
openstack-tripleo-image-elements-13.1.3-17.1.20230621111410.a641940.el9ost.noarch
openstack-tripleo-puppet-elements-14.1.3-17.1.20230810141019.b4e0cbd.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Execute the Host system upgrade of HCI compute nodes of DCN env during the FFU procedure.

Comment 4 Manoj Katari 2024-01-11 10:26:51 UTC
removed needinfo as John answered it in comment3

Comment 12 Manoj Katari 2024-05-02 07:41:51 UTC
Thanks Erin for the Doc text update, it looks good.

Comment 19 errata-xmlrpc 2024-05-22 20:42:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: openstack-tripleo-heat-templates and tripleo-ansible update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2736


Note You need to log in before you can comment on or make changes to this bug.