Description of problem: The procedure of Upgrading the Compute node operating system includes execution of following command as the last step: openstack overcloud upgrade run --yes --stack <stack> --limit <nodes> which actually reruns the stack deployment to upgrade some containers on the compute nodes after system upgrade, See step 6. of: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#upgrading-the-compute-node-operating-system_upgrading-Compute-nodes-to-a-multi-rhel-environment The DCN compute node roles includes the OS::TripleO::Services::NovaAZConfig service which aggregates site's compute nodes in AZ based on the deployment file: https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/nova/nova-az-config.yaml And that's why the task "Nova: Manage aggregate and availability zone and add hosts to the zone" from the tht deployment file is executed during the execution of "openstack overcloud upgrade run" The "openstack overcloud upgrade run" can be executed on subset of compute nodes (i.e. because of the MultiRHEL when some nodes are left on RHEL8 and the system upgrade is not performed). The problem is that end result is that all the other compute nodes (which are not included in the --limit paramater) are unset from the Availability zone aggregate and only the one upgraded remains in the AZ, I guess it's because the upgrade is executed for only included compute nodes and if you look at that https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/nova/nova-az-config.yaml#L67 It sets a fact nova_host for the included nodes but the other nodes have nova_host unset and that's why won't be included in the hosts for the ask to set the AZ , see: https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/nova/nova-az-config.yaml#L87 and by default in the os_nova_host_aggregate ansible module purges the other host from the AZ. I set manually purge_hosts to false for the task utilizing os_nova_host_aggregate module ansible module and the other compute nodes remained in the AZ. i.e. if I run: openstack overcloud upgrade run --yes \ --stack dcn2 \ --limit dcn2-computehci2-0,undercloud --playbook all 2>&1 On a setup where: openstack compute service list | grep az-dcn2 | 80 | nova-compute | dcn2-computehci2-0.redhat.local | az-dcn2 | enabled | up | 2023-12-18T17:50:20.000000 | | 83 | nova-compute | dcn2-computehci2-2.redhat.local | az-dcn2 | enabled | up | 2023-12-18T17:50:20.000000 | | 89 | nova-compute | dcn2-computehci2-1.redhat.local | az-dcn2 | enabled | up | 2023-12-18T17:50:22.000000 | The result will be that only dcn2-computehci2-0 will remain in az-dcn2. From the log: $ egrep 'nova_host|aggregate' overcloud_upgrade_run-dcn2-computehci2-0.log 2023-12-16 19:10:06 | 2023-12-16 19:10:06.907601 | 52540040-e889-25f9-23ea-000000002155 | TASK | Set nova_host fact 2023-12-16 19:10:06 | 2023-12-16 19:10:06.971271 | 52540040-e889-25f9-23ea-000000002155 | OK | Set nova_host fact | dcn2-computehci2-0 2023-12-16 19:10:06 | 2023-12-16 19:10:06.973869 | 52540040-e889-25f9-23ea-000000002155 | TIMING | Set nova_host fact | dcn2-computehci2-0 | 0:12:24.574090 | 0.06s 2023-12-16 19:23:02 | 2023-12-16 19:23:02.540909 | 52540040-e889-25f9-23ea-000000000125 | TASK | Nova: Manage aggregate and availability zone and add hosts to the zone 2023-12-16 19:23:09 | 2023-12-16 19:23:09.135559 | 52540040-e889-25f9-23ea-000000000125 | CHANGED | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud 2023-12-16 19:23:09 | 2023-12-16 19:23:09.136990 | 52540040-e889-25f9-23ea-000000000125 | TIMING | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud | 0:25:26.737297 | 6.59s Version-Release number of selected component (if applicable): openstack-tripleo-common-containers-15.4.1-17.1.20230927010819.el9ost.noarch puppet-tripleo-14.2.3-17.1.20231102190828.el9ost.noarch ansible-tripleo-ipsec-11.0.1-17.1.20230620172008.b5559c8.el9ost.noarch ansible-tripleo-ipa-0.3.1-17.1.20230627190951.8d29d9e.el9ost.noarch ansible-role-tripleo-modify-image-1.5.1-17.1.20230621064242.b6eedb6.el9ost.noarch python3-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch openstack-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch tripleo-ansible-3.3.1-17.1.20231101230823.4d015bf.el9ost.noarch openstack-tripleo-heat-templates-14.3.1-17.1.20231103010825.el9ost.noarch openstack-tripleo-validations-14.3.2-17.1.20231026020815.2b526f8.el9ost.noarch python3-tripleoclient-16.5.1-17.1.20230927000827.f3599d0.el9ost.noarch openstack-tripleo-image-elements-13.1.3-17.1.20230621111410.a641940.el9ost.noarch openstack-tripleo-puppet-elements-14.1.3-17.1.20230810141019.b4e0cbd.el9ost.noarch How reproducible: Always Steps to Reproduce: 1. Perform FFU OS upgrade of compute nodes in DCN env 2. And the last step openstack overcloud upgrade run --yes --stack <stack> --limit <nodes> with subset of nodes (i.e. the one which are supposed to be upgraded in MultiRHEL env) Actual results: The nodes not included in the command will be unset from the AZ Expected results: All the nodes should remain in the AZ Additional info:
Thank you for the clean description and the proposed fix (to set purge_hosts to false). This looks like the best solution to the problem to me
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: openstack-tripleo-heat-templates and tripleo-ansible update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:2736