Bug 2272946 - tripleo_ceph_client role is not applied when allovercloud,undercloud is used as a limit
Summary: tripleo_ceph_client role is not applied when allovercloud,undercloud is used ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.1 (Wallaby)
Hardware: All
OS: All
high
medium
Target Milestone: z4
: 17.1
Assignee: Manoj Katari
QA Contact: Alfredo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-04-03 12:23 UTC by Alex Stupnikov
Modified: 2024-11-21 09:40 UTC (History)
8 users (show)

Fixed In Version: tripleo-ansible-3.3.1-17.1.20240502120759.8debef3.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-11-21 09:40:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-31801 0 None None None 2024-04-03 12:26:17 UTC
Red Hat Product Errata RHBA-2024:9974 0 None None None 2024-11-21 09:40:09 UTC

Description Alex Stupnikov 2024-04-03 12:23:51 UTC
Description of problem:

This was originally reported by a customer running the following command during FFU procedure:

openstack overcloud upgrade run --yes --stack <stack> --debug --limit allovercloud,undercloud --playbook all

https://access.redhat.com/documentation/de-de/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#upgrading-rhosp-on-all-nodes-in-each-stack_overcloud-upgrade


The problem is that "allovercloud" is group name, but it is treated as string/host name by the following play:
https://github.com/openstack-archive/tripleo-ansible/blob/stable/wallaby/tripleo_ansible/roles/tripleo_ceph_client/tasks/effective_clients_limit.yml#L20-L24

As a result, tripleo_ceph_client_include created on top of tripleo_ceph_client_limit_list is the following list:
tripleo_ceph_client_include: ['allovercloud', 'undercloud']

Since 'allovercloud' is treated as string, then ansible is unable to find valid intersection between list of Ceph client hosts and 'allovercloud' when building tripleo_ceph_client_effective_clients


Version-Release number of selected component (if applicable):
RHOSP 17.1

How reproducible:
Steps and analysis were provided in description


Actual results:
Ceph client play is not applied on valid overcloud hosts

Expected results:
From python-tripleoclient perspective --limit argument should contain list of hosts, so other recommendations in the document may be incorrect. But if we will decide to preserve it, then Ceph client configuration should be applied on all groups['ceph_client'] hosts.

Comment 3 Manoj Katari 2024-04-05 11:53:55 UTC
Thanks @fpantano for your inputs.

As the upgrade command uses `--limit allovercloud,undercloud`, i think ansible_limit is generated as 'allovercloud,undercloud' where as the code in [1] expects it as  'listofnodes_in_overcloud, undercloud' , so the tripleo_ceph_client_effective_clients generated in L55  will result in empty list.

@john, We need to review [1] and decide if the fix is needed in the code or upgrade doc.

[1] https://github.com/openstack-archive/tripleo-ansible/blob/stable/wallaby/tripleo_ansible/roles/tripleo_ceph_client/tasks/effective_clients_limit.yml#L22

Comment 4 Eric Nothen 2024-04-08 11:06:52 UTC
The customer is currently working this issue around by passing an explicit list of hostnames to the --limit parameter. This works in the smaller test/dev clusters, and should work as well in production, unless they hit a cli limit preventing them from explicitly passing ~200 FQDNs. 

FWIW, I have tested the openstack overcloud upgrade command with >2000 FQDNs as the value of --limit and the command runs (it fails down the road because I don't actually have 2k overcloud nodes, but it does seem to run so it's not an issue for bash).

Comment 5 John Fulton 2024-04-08 21:59:35 UTC
If the upgrade is run with "CephConfigPath: /etc/ceph", as suggested in the docs section 3.1, I expect the workaround from comment #4 will be unnecessary. See /etc/ceph/ on the compute nodes to confirm if the keys are already present.

This is what I think is happening:

1. The system has "CephConfigPath: /var/lib/tripleo-config/ceph" and that path is empty so the keys appear to be missing. The director is creating new versions of the keys and trying to copy them to compute nodes.

2. Per docs section 8.1, "--limit allovercloud,undercloud" is passed which effectively results in keys not getting copied to any computes.

I suspect this problem never presented itself as a bug in our testing because we used "CephConfigPath: /etc/ceph". I.e. the ceph client role computed an empty list of hosts but it didn't matter since the keys were already there and the upgrade could continue.

[3.1] https://access.redhat.com/documentation/de-de/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#updating-ceph-client-configuration-for-rhosp-171-external-ceph-deployments

[8.1] https://access.redhat.com/documentation/de-de/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#upgrading-rhosp-on-all-nodes-in-each-stack_overcloud-upgrade

Comment 9 John Fulton 2024-04-09 13:04:54 UTC
This is a bug. The documentation tells you to limit by allovercloud,undercloud (two groups) and the sync.yml tasks in tripleo_ceph_client can only handle hostnames in the limit.

This bug shouldn't block an upgrade if "CephConfigPath: /etc/ceph" is set as documented. However, because its producing unintentional behavior (a task limited to empty set) we will update the tripleo_ceph_client role to handle when "--limit allovercloud,undercloud" is passed.

Comment 23 errata-xmlrpc 2024-11-21 09:40:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHOSP 17.1.4 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:9974


Note You need to log in before you can comment on or make changes to this bug.