Bug 2208237

Summary: [FFU] Controller Nodes in MAINTENANCE state after Overcloud Ctlplane System Upgrade
Product: Red Hat OpenStack Reporter: Ricardo Diaz <rdiazcam>
Component: rhosp-releaseAssignee: Juan Badia Payno <jbadiapa>
Status: CLOSED NOTABUG QA Contact: Arik Chernetsky <achernet>
Severity: high Docs Contact:
Priority: medium    
Version: 17.1 (Wallaby)CC: ekuris, jbadiapa, jjoyce, jpretori
Target Milestone: gaKeywords: TestOnly, Triaged, UpgradeBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-07 12:29:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ricardo Diaz 2023-05-18 10:41:46 UTC
Description of problem:
After running (with no error) the Overcloud Ctlplane System Upgrade FFU OSP17 stage controllers are in MAINTENANCE state:
~~~
(undercloud) [stack@undercloud-0 ~]$ metalsmith list
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+
| UUID                                 | Node Name    | Allocation UUID                      | Hostname           | State       | IP Addresses         |
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+
| 6dca4b5f-ac03-4a04-9516-cb88fc148012 | compute-0    | b897d9a0-0ef7-4a3f-9c39-71eff5b9673d | computedpdksriov-0 | ACTIVE      | ctlplane=192.0.70.17 |
| 32199f18-b156-4e07-9570-56f6e90eb64c | compute-1    | 94c7c4e7-8c8d-4ea9-9bbe-017f43c4d134 | computedpdksriov-1 | ACTIVE      | ctlplane=192.0.70.14 |
| 46c84e87-4549-4d96-beea-ababb43e2236 | controller-0 | 7ff671a4-f665-477a-b733-c9e9a827ffa1 | controller-0       | MAINTENANCE | ctlplane=192.0.70.15 |
| f90624be-e487-4ad7-8a47-4649dd545c81 | controller-1 | a129feb1-52d0-41d1-9c96-d85f7e4559a2 | controller-1       | MAINTENANCE | ctlplane=192.0.70.9  |
| e296f959-b31a-436f-a87d-fc5febaac5b0 | controller-2 | 623fe14e-2ffb-488e-a7ca-e1f2bc79e007 | controller-2       | MAINTENANCE | ctlplane=192.0.70.6  |
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+
~~~


Version-Release number of selected component (if applicable):
FFU 16.2 -> 17.1

How reproducible:
100%

Steps to Reproduce:
1.Run Overcloud Ctlplane System Upgrade FFU OSP17
2.
3.

Actual results:


Expected results:
Controller nodes must be in ACTIVE state

Additional info:

Comment 1 Ricardo Diaz 2023-05-18 12:31:12 UTC
It looks like there is no problem when unsetting maintenance for a controller:

(undercloud) [stack@undercloud-0 ~]$ metalsmith list
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+
| UUID                                 | Node Name    | Allocation UUID                      | Hostname           | State       | IP Addresses         |
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+
| 6dca4b5f-ac03-4a04-9516-cb88fc148012 | compute-0    | b897d9a0-0ef7-4a3f-9c39-71eff5b9673d | computedpdksriov-0 | ACTIVE      | ctlplane=192.0.70.17 |
| 32199f18-b156-4e07-9570-56f6e90eb64c | compute-1    | 94c7c4e7-8c8d-4ea9-9bbe-017f43c4d134 | computedpdksriov-1 | ACTIVE      | ctlplane=192.0.70.14 |
| 46c84e87-4549-4d96-beea-ababb43e2236 | controller-0 | 7ff671a4-f665-477a-b733-c9e9a827ffa1 | controller-0       | MAINTENANCE | ctlplane=192.0.70.15 |
| f90624be-e487-4ad7-8a47-4649dd545c81 | controller-1 | a129feb1-52d0-41d1-9c96-d85f7e4559a2 | controller-1       | MAINTENANCE | ctlplane=192.0.70.9  |
| e296f959-b31a-436f-a87d-fc5febaac5b0 | controller-2 | 623fe14e-2ffb-488e-a7ca-e1f2bc79e007 | controller-2       | MAINTENANCE | ctlplane=192.0.70.6  |
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+

(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node maintenance unset controller-0

(undercloud) [stack@undercloud-0 ~]$ metalsmith list
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+
| UUID                                 | Node Name    | Allocation UUID                      | Hostname           | State       | IP Addresses         |
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+
| 6dca4b5f-ac03-4a04-9516-cb88fc148012 | compute-0    | b897d9a0-0ef7-4a3f-9c39-71eff5b9673d | computedpdksriov-0 | ACTIVE      | ctlplane=192.0.70.17 |
| 32199f18-b156-4e07-9570-56f6e90eb64c | compute-1    | 94c7c4e7-8c8d-4ea9-9bbe-017f43c4d134 | computedpdksriov-1 | ACTIVE      | ctlplane=192.0.70.14 |
| 46c84e87-4549-4d96-beea-ababb43e2236 | controller-0 | 7ff671a4-f665-477a-b733-c9e9a827ffa1 | controller-0       | ACTIVE      | ctlplane=192.0.70.15 |
| f90624be-e487-4ad7-8a47-4649dd545c81 | controller-1 | a129feb1-52d0-41d1-9c96-d85f7e4559a2 | controller-1       | MAINTENANCE | ctlplane=192.0.70.9  |
| e296f959-b31a-436f-a87d-fc5febaac5b0 | controller-2 | 623fe14e-2ffb-488e-a7ca-e1f2bc79e007 | controller-2       | MAINTENANCE | ctlplane=192.0.70.6  |
+--------------------------------------+--------------+--------------------------------------+--------------------+-------------+----------------------+

Comment 2 Ricardo Diaz 2023-05-18 17:06:48 UTC
It looks like that after some minutes the Controller backs to MAINTENANCE state.

Comment 3 Juan Badia Payno 2023-05-19 09:48:20 UTC
The issue with the metalsmith with VMs is that it is simulate the ipmi with vbmc, everything is installed on rhel8.4 with virtualenv (python3.6).
Once the undercloud OS is upgraded to rhel-9.2 the vbmc does not work any longer. vbmc needs to be reinstalled and restarted.

Comment 5 Jesse Pretorius 2023-06-07 12:29:15 UTC
This is an issue in CI automation which would need to be solved in Infrared or some other CI automation changes. The issue is not in OSP.