Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2253891

Summary:	[FFU] RHEL host upgrade of HCI nodes fails on inability to start a pacemaker cluster
Product:	Red Hat OpenStack	Reporter:	Marian Krcmarik <mkrcmari>
Component:	openstack-tripleo-heat-templates	Assignee:	Lukas Bezdicka <lbezdick>
Status:	CLOSED ERRATA	QA Contact:	Marian Krcmarik <mkrcmari>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	17.1 (Wallaby)	CC:	lbezdick, mariel, mburns, mciecier, prgutier, ramishra, ushkalim, yatanaka
Target Milestone:	z2	Keywords:	Regression, Triaged
Target Release:	17.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-14.3.1-17.1.20231103010826.el9ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-01-16 14:31:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2023-12-10 17:10:24 UTC

Description of problem:
The RHEL host upgrade of HCI nodes fails on the following error:
FATAL | Start pacemaker cluster after reboot | dcn2-computehci2-0 | error={"changed": false, "msg": "Command execution failed.\nCommand: `pcs cluster start`\nError: Error: cluster is not currently configured on this node\n"}

Even tho there is no pacemaker cluster configured and should not be the host upgrade procedure tries to start pacemaker cluster.
It happen on the step 5 of the upgrade during executing the task from this commit:
https://opendev.org/openstack/tripleo-heat-templates/commit/a4185f80d2158560a546d0f46e8d4caab9ff6e43
And is part of:
https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/wallaby/deployment/podman/podman-baremetal-ansible.yaml#L231

The problem seems to be that HCI nodes have pcs package installed for some reason (I do not know what's the reason is). While the package is not (I assume) needed on the HCI nodes, It got pulled in at some point so probably It would be better to adjust the check and not only check if pcs is present but if any cluster is running/configured on the node because based on the deployments of 16.2 with HCI nodes I checked It seems that pcs is always present so customers may have it installed on the HCI nodes already. The same goes for my env where pcs was already installed on 16.2 before the FFU.

The host upgrade was executed i.e. in the following command:
openstack overcloud upgrade run --yes \
        --stack dcn2 \
        --tags system_upgrade \
        --limit dcn2-computehci2-0

And the log from the step % of the procedure
PLAY [Upgrade tasks for step 5] ************************************************
[WARNING]: conditional statements should not include jinja2 templating
delimiters such as {{ }} or {% %}. Found: '{{ playbook_dir }}/{{
_task_file_path }}' is exists
2023-12-09 01:20:45.501524 | 5254009f-6055-e05f-9966-000000000030 |     TIMING | include_tasks | dcn2-computehci2-0 | 0:31:03.476240 | 0.07s
2023-12-09 01:20:45.606261 | 64500b99-dc31-4b7a-9c1c-7e750878536d |   INCLUDED | /home/stack/overcloud-deploy/dcn2/config-download/dcn2/ComputeHCI2/upgrade_tasks_step5.yaml | dcn2-computehci2-0
2023-12-09 01:20:45.648112 | 5254009f-6055-e05f-9966-00000000034a |       TASK | Create cinder image conversion directory
2023-12-09 01:20:45.703463 | 5254009f-6055-e05f-9966-00000000034a |    SKIPPED | Create cinder image conversion directory | dcn2-computehci2-0
2023-12-09 01:20:45.704772 | 5254009f-6055-e05f-9966-00000000034a |     TIMING | Create cinder image conversion directory | dcn2-computehci2-0 | 0:31:03.679508 | 0.05s
2023-12-09 01:20:45.734265 | 5254009f-6055-e05f-9966-00000000034b |       TASK | Mount cinder's image conversion NFS share
2023-12-09 01:20:45.788972 | 5254009f-6055-e05f-9966-00000000034b |    SKIPPED | Mount cinder's image conversion NFS share | dcn2-computehci2-0
2023-12-09 01:20:45.790543 | 5254009f-6055-e05f-9966-00000000034b |     TIMING | Mount cinder's image conversion NFS share | dcn2-computehci2-0 | 0:31:03.765278 | 0.05s
2023-12-09 01:20:45.822249 | 5254009f-6055-e05f-9966-00000000034d |       TASK | Mount Nova NFS Share
2023-12-09 01:20:45.873551 | 5254009f-6055-e05f-9966-00000000034d |    SKIPPED | Mount Nova NFS Share | dcn2-computehci2-0
2023-12-09 01:20:45.874921 | 5254009f-6055-e05f-9966-00000000034d |     TIMING | Mount Nova NFS Share | dcn2-computehci2-0 | 0:31:03.849658 | 0.05s
2023-12-09 01:20:45.904007 | 5254009f-6055-e05f-9966-00000000034f |       TASK | Check if pcs is present
2023-12-09 01:20:46.958806 | 5254009f-6055-e05f-9966-00000000034f |         OK | Check if pcs is present | dcn2-computehci2-0
2023-12-09 01:20:46.960144 | 5254009f-6055-e05f-9966-00000000034f |     TIMING | Check if pcs is present | dcn2-computehci2-0 | 0:31:04.934880 | 1.05s
2023-12-09 01:20:46.993346 | 5254009f-6055-e05f-9966-000000000350 |       TASK | Start pacemaker cluster after reboot
2023-12-09 01:20:48.776939 | 5254009f-6055-e05f-9966-000000000350 |      FATAL | Start pacemaker cluster after reboot | dcn2-computehci2-0 | error={"changed": false, "msg": "Command execution failed.\nCommand: `pcs cluster start`\nError: Error: cluster is not currently configured on this node\n"}
2023-12-09 01:20:48.778363 | 5254009f-6055-e05f-9966-000000000350 |     TIMING | Start pacemaker cluster after reboot | dcn2-computehci2-0 | 0:31:06.753091 | 1.78s


Vopenstack-tripleo-common-containers-15.4.1-17.1.20230927010819.el9ost.noarch
puppet-tripleo-14.2.3-17.1.20231102190827.40278e1.el9ost.noarch
ansible-tripleo-ipsec-11.0.1-17.1.20230620172008.b5559c8.el9ost.noarch
ansible-tripleo-ipa-0.3.1-17.1.20230627190951.8d29d9e.el9ost.noarch
ansible-role-tripleo-modify-image-1.5.1-17.1.20230621064242.b6eedb6.el9ost.noarch
python3-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
openstack-tripleo-common-15.4.1-17.1.20230927010819.el9ost.noarch
tripleo-ansible-3.3.1-17.1.20231101230823.4d015bf.el9ost.noarch
openstack-tripleo-heat-templates-14.3.1-17.1.20231103010823.el9ost.noarch
openstack-tripleo-validations-14.3.2-17.1.20231026020815.2b526f8.el9ost.noarch
python3-tripleoclient-16.5.1-17.1.20230927000827.f3599d0.el9ost.noarch
openstack-tripleo-image-elements-13.1.3-17.1.20230621111410.a641940.el9ost.noarch
openstack-tripleo-puppet-elements-14.1.3-17.1.20230810141019.b4e0cbd.el9ost.noarchersion-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Perform System host upgrade of a HCI nodes from RHEL8.4 to RHEL9.2 as a part of FFU procedure from 16.2 to 17.1

Additional info:

Comment 1 Marian Krcmarik 2023-12-10 17:12:30 UTC

I've actually observed the behavior on the networker nodes too
2023-12-05 21:50:25 | 2023-12-05 21:50:25.225274 | 525400c3-b998-5b88-81bc-00000000086d |      FATAL | Start pacemaker cluster after reboot | networker-0 | error={"changed": false, "msg": "Command execution failed.\nCommand: `pcs cluster start`\nError: Error: cluster is not currently configured on this node\n"}

Comment 13 errata-xmlrpc 2024-01-16 14:31:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0209

Comment 14 Lukas Bezdicka 2024-01-23 11:00:38 UTC

*** Bug 2237659 has been marked as a duplicate of this bug. ***