Description of problem: This was originally reported by a customer running the following command during FFU procedure: openstack overcloud upgrade run --yes --stack <stack> --debug --limit allovercloud,undercloud --playbook all https://access.redhat.com/documentation/de-de/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#upgrading-rhosp-on-all-nodes-in-each-stack_overcloud-upgrade The problem is that "allovercloud" is group name, but it is treated as string/host name by the following play: https://github.com/openstack-archive/tripleo-ansible/blob/stable/wallaby/tripleo_ansible/roles/tripleo_ceph_client/tasks/effective_clients_limit.yml#L20-L24 As a result, tripleo_ceph_client_include created on top of tripleo_ceph_client_limit_list is the following list: tripleo_ceph_client_include: ['allovercloud', 'undercloud'] Since 'allovercloud' is treated as string, then ansible is unable to find valid intersection between list of Ceph client hosts and 'allovercloud' when building tripleo_ceph_client_effective_clients Version-Release number of selected component (if applicable): RHOSP 17.1 How reproducible: Steps and analysis were provided in description Actual results: Ceph client play is not applied on valid overcloud hosts Expected results: From python-tripleoclient perspective --limit argument should contain list of hosts, so other recommendations in the document may be incorrect. But if we will decide to preserve it, then Ceph client configuration should be applied on all groups['ceph_client'] hosts.
Thanks @fpantano for your inputs. As the upgrade command uses `--limit allovercloud,undercloud`, i think ansible_limit is generated as 'allovercloud,undercloud' where as the code in [1] expects it as 'listofnodes_in_overcloud, undercloud' , so the tripleo_ceph_client_effective_clients generated in L55 will result in empty list. @john, We need to review [1] and decide if the fix is needed in the code or upgrade doc. [1] https://github.com/openstack-archive/tripleo-ansible/blob/stable/wallaby/tripleo_ansible/roles/tripleo_ceph_client/tasks/effective_clients_limit.yml#L22
The customer is currently working this issue around by passing an explicit list of hostnames to the --limit parameter. This works in the smaller test/dev clusters, and should work as well in production, unless they hit a cli limit preventing them from explicitly passing ~200 FQDNs. FWIW, I have tested the openstack overcloud upgrade command with >2000 FQDNs as the value of --limit and the command runs (it fails down the road because I don't actually have 2k overcloud nodes, but it does seem to run so it's not an issue for bash).
If the upgrade is run with "CephConfigPath: /etc/ceph", as suggested in the docs section 3.1, I expect the workaround from comment #4 will be unnecessary. See /etc/ceph/ on the compute nodes to confirm if the keys are already present. This is what I think is happening: 1. The system has "CephConfigPath: /var/lib/tripleo-config/ceph" and that path is empty so the keys appear to be missing. The director is creating new versions of the keys and trying to copy them to compute nodes. 2. Per docs section 8.1, "--limit allovercloud,undercloud" is passed which effectively results in keys not getting copied to any computes. I suspect this problem never presented itself as a bug in our testing because we used "CephConfigPath: /etc/ceph". I.e. the ceph client role computed an empty list of hosts but it didn't matter since the keys were already there and the upgrade could continue. [3.1] https://access.redhat.com/documentation/de-de/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#updating-ceph-client-configuration-for-rhosp-171-external-ceph-deployments [8.1] https://access.redhat.com/documentation/de-de/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#upgrading-rhosp-on-all-nodes-in-each-stack_overcloud-upgrade
This is a bug. The documentation tells you to limit by allovercloud,undercloud (two groups) and the sync.yml tasks in tripleo_ceph_client can only handle hostnames in the limit. This bug shouldn't block an upgrade if "CephConfigPath: /etc/ceph" is set as documented. However, because its producing unintentional behavior (a task limited to empty set) we will update the tripleo_ceph_client role to handle when "--limit allovercloud,undercloud" is passed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHOSP 17.1.4 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:9974