2255114 – [FFU][DCN] Compute nodes' AZ is unset during the system upgrade of the compute nodes

Bug 2255114 - [FFU][DCN] Compute nodes' AZ is unset during the system upgrade of the compute nodes

Summary: [FFU][DCN] Compute nodes' AZ is unset during the system upgrade of the comput...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	17.1 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z3
Target Release:	17.1
Assignee:	Bogdan Dobrelya
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1997638
TreeView+	depends on / blocked

Reported:	2023-12-18 18:00 UTC by Marian Krcmarik
Modified:	2024-05-22 20:42 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-14.3.1-17.1.20231103010835.el9ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-05-22 20:42:35 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	904130	None	ABANDONED	Fix DCN upgrades for Compute hosts in aggregates	2024-04-22 18:56:09 UTC
Red Hat Issue Tracker	OSP-30936	None	None	None	2023-12-18 18:01:30 UTC
Red Hat Product Errata	RHSA-2024:2736	None	None	None	2024-05-22 20:42:40 UTC

Description Marian Krcmarik 2023-12-18 18:00:48 UTC

Description of problem:
The procedure of Upgrading the Compute node operating system includes execution of following command as the last step:
openstack overcloud upgrade run --yes --stack <stack> --limit <nodes>
which actually reruns the stack deployment to upgrade some containers on the compute nodes after system upgrade, See step 6. of:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.1/html-single/framework_for_upgrades_16.2_to_17.1/index#upgrading-the-compute-node-operating-system_upgrading-Compute-nodes-to-a-multi-rhel-environment

The DCN compute node roles includes the OS::TripleO::Services::NovaAZConfig service
which aggregates site's compute nodes in AZ based on the deployment file:
https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/nova/nova-az-config.yaml
And that's why the task "Nova: Manage aggregate and availability zone and add hosts to the zone" from the tht deployment file is executed during the execution of "openstack overcloud upgrade run"
The "openstack overcloud upgrade run" can be executed on subset of compute nodes (i.e. because of the MultiRHEL when some nodes are left on RHEL8 and the system upgrade is not performed).
The problem is that end result is that all the other compute nodes (which are not included in the --limit paramater) are unset from the Availability zone aggregate and only the one upgraded remains in the AZ, I guess it's because the upgrade is executed for only included compute nodes and if you look at that https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/nova/nova-az-config.yaml#L67 It sets a fact nova_host for the included nodes but the other nodes have nova_host unset and that's why won't be included in the hosts for the ask to set the AZ , see: https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/nova/nova-az-config.yaml#L87 and by default in the os_nova_host_aggregate ansible module purges the other host from the AZ.

I set manually purge_hosts to false for the task utilizing os_nova_host_aggregate module ansible module and the other compute nodes remained in the AZ.

i.e. if I run:
openstack overcloud upgrade run --yes \
        --stack dcn2 \
        --limit dcn2-computehci2-0,undercloud --playbook all 2>&1

On a setup where:
 openstack compute service list | grep az-dcn2
| 80 | nova-compute   | dcn2-computehci2-0.redhat.local         | az-dcn2    | enabled | up    | 2023-12-18T17:50:20.000000 |
| 83 | nova-compute   | dcn2-computehci2-2.redhat.local         | az-dcn2    | enabled | up    | 2023-12-18T17:50:20.000000 |
| 89 | nova-compute   | dcn2-computehci2-1.redhat.local         | az-dcn2    | enabled | up    | 2023-12-18T17:50:22.000000 |

The result will be that only dcn2-computehci2-0 will remain in az-dcn2.

From the log:
$ egrep 'nova_host|aggregate' overcloud_upgrade_run-dcn2-computehci2-0.log 
2023-12-16 19:10:06 | 2023-12-16 19:10:06.907601 | 52540040-e889-25f9-23ea-000000002155 |       TASK | Set nova_host fact
2023-12-16 19:10:06 | 2023-12-16 19:10:06.971271 | 52540040-e889-25f9-23ea-000000002155 |         OK | Set nova_host fact | dcn2-computehci2-0
2023-12-16 19:10:06 | 2023-12-16 19:10:06.973869 | 52540040-e889-25f9-23ea-000000002155 |     TIMING | Set nova_host fact | dcn2-computehci2-0 | 0:12:24.574090 | 0.06s
2023-12-16 19:23:02 | 2023-12-16 19:23:02.540909 | 52540040-e889-25f9-23ea-000000000125 |       TASK | Nova: Manage aggregate and availability zone and add hosts to the zone
2023-12-16 19:23:09 | 2023-12-16 19:23:09.135559 | 52540040-e889-25f9-23ea-000000000125 |    CHANGED | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud
2023-12-16 19:23:09 | 2023-12-16 19:23:09.136990 | 52540040-e889-25f9-23ea-000000000125 |     TIMING | Nova: Manage aggregate and availability zone and add hosts to the zone | undercloud | 0:25:26.737297 | 6.59s

Version-Release number of selected component (if applicable):
openstack-tripleo-common-containers-15.4.1-17.1.20230927010819.el9ost.noarch
puppet-tripleo-14.2.3-17.1.20231102190828.el9ost.noarch
ansible-tripleo-ipsec-11.0.1-17.1.20230620172008.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.3.1-17.1.20230627190951.8d29d9e.el9ost.noarch
ansible-role-tripleo-modify-image-1.5.1-17.1.20230621064242.b6eedb6.el9ost.noarch
python3-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
openstack-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
tripleo-ansible-3.3.1-17.1.20231101230823.4d015bf.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-17.1.20231103010825.el9ost.noarch
openstack-tripleo-validations-14.3.2-17.1.20231026020815.2b526f8.el9ost.noarch
python3-tripleoclient-16.5.1-17.1.20230927000827.f3599d0.el9ost.noarch
openstack-tripleo-image-elements-13.1.3-17.1.20230621111410.a641940.el9ost.noarch
openstack-tripleo-puppet-elements-14.1.3-17.1.20230810141019.b4e0cbd.el9ost.noarch

How reproducible:
Always

Steps to Reproduce:
1. Perform FFU OS upgrade of compute nodes in DCN env
2. And the last step openstack overcloud upgrade run --yes --stack <stack> --limit <nodes> with subset of nodes (i.e. the one which are supposed to be upgraded in MultiRHEL env)

Actual results:
The nodes not included in the command will be unset from the AZ

Expected results:
All the nodes should remain in the AZ

Additional info:

Comment 1 Bogdan Dobrelya 2023-12-20 12:39:16 UTC

Thank you for the clean description and the proposed fix (to set purge_hosts to false). This looks like the best solution to the problem to me

Comment 23 errata-xmlrpc 2024-05-22 20:42:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: openstack-tripleo-heat-templates and tripleo-ansible update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:2736

Note You need to log in before you can comment on or make changes to this bug.