2033746 – Ironic baremetal node goes to maintenace after minor update

Bug 2033746 - Ironic baremetal node goes to maintenace after minor update

Summary: Ironic baremetal node goes to maintenace after minor update

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-ironic
Sub Component:
Version:	16.1 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	z8
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	OSP Team
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-17 18:21 UTC by Rafael Urena
Modified:	2022-11-28 16:35 UTC (History)
CC List:	2 users (show)
Fixed In Version:	openstack-ironic-13.0.7-1.20220105043354.3d77e61.el8ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-24 11:02:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-11879	0	None	None	None	2021-12-17 18:21:51 UTC
Red Hat Product Errata	RHBA-2022:0986	0	None	None	None	2022-03-24 11:02:39 UTC

Description Rafael Urena 2021-12-17 18:21:23 UTC

Description of problem:
Customer upgraded from 16.1.6 to 16.1.7. After the upgrade they noticed that all the baremetal nodes went to maintenance on with the following:

| last_error | During sync_power_state, max retries exceeded for node 9271e535-a7d0-4e60-849e-02b0e83f2769, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: An exclusive lock is required, but the current context has a shared lock. |

Version-Release number of selected component (if applicable):
Openstack 16.1.7 (minor update from 16.1.6)

How reproducible:
CU has not attempted to reproduce

Steps to Reproduce:
1. deploy osp 16.1.6
2. perform minor upgrade to osp 16.1.7
3. verify baremetal node status

Actual results:
All nodes go to maintenace

~~~
$ openstack baremetal node list
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name              | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
| 9271e535-a7d0-4e60-849e-02b0e83f2769 | controller-0      | f85c29ae-c0a8-409c-b282-0ec85ae35852 | None        | active             | True        |
| 5821e92e-fc8e-453f-9241-cec86b461d25 | controller-1      | 9a390b75-a888-4b6d-80e8-25abcdb6fb09 | None        | active             | True        |
| 5b56aa9e-e7e5-43f7-b19d-2968ce2716e4 | controller-2      | d1b4a31c-3564-42ad-a91e-342adccd8dac | None        | active             | True        |
| 8a0e788e-ec84-43ab-9250-bbc052294e33 | computeDell6152-0 | 4da51add-42bf-4f12-a48f-1bbf2558d2e6 | None        | active             | True        |
| e6ee252c-b202-43f2-a888-a11edbf2c44b | computeDell6152-1 | None                                 | None        | available          | True        |
| ce6c0e9e-c7d8-4de1-8b28-ca973566c045 | storage-0         | 3350c31f-2f15-449e-8b40-b13db30e0e77 | None        | active             | True        |
| cad62674-65cd-4636-ac85-7d2661e32d6f | storage-1         | df425cce-3066-4e9c-a6b5-bf0a81d05aca | None        | active             | True        |
| 40060dad-4e9f-469d-b5cd-50937248bc7b | storage-2         | 49157863-ef71-4677-a3e3-5446b92edbad | None        | active             | True        |
| f2c99b78-feb7-4e9a-996f-a5325f7a5329 | computeSriov-0    | 6bf99fe0-4e93-4fc4-ac21-0cdc6fa70ff0 | None        | active             | True        |
| 879f2998-714c-4966-9de5-db07ba8f2073 | computeSriov-1    | None                                 | None        | available          | True        |
| 13bcf2a1-7d9b-4f1b-b8bd-020cf2dffcdc | computeDell6230-0 | 4d4866fa-7484-4cc9-930d-b8710f43fbcf | None        | active             | True        |
| cc56446d-1985-4ea3-b6a2-ea82dff38c3c | computeDell6230-1 | None                                 | None        | available          | True        |
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+

Expected results:
Nodes receive power state

Additional info:
We performed the following on the undercloud to see if the state could be restored:
~~~
# systemctl restart tripleo_ironic_conductor.service
# systemctl restart tripleo_ironic_inspector.service
~~~

This allowed the state to be reset 

~~~
$ openstack baremetal node list
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name              | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
| 9271e535-a7d0-4e60-849e-02b0e83f2769 | controller-0      | f85c29ae-c0a8-409c-b282-0ec85ae35852 | None        | active             | False       |
| 5821e92e-fc8e-453f-9241-cec86b461d25 | controller-1      | 9a390b75-a888-4b6d-80e8-25abcdb6fb09 | None        | active             | False       |
| 5b56aa9e-e7e5-43f7-b19d-2968ce2716e4 | controller-2      | d1b4a31c-3564-42ad-a91e-342adccd8dac | None        | active             | False       |
| 8a0e788e-ec84-43ab-9250-bbc052294e33 | computeDell6152-0 | 4da51add-42bf-4f12-a48f-1bbf2558d2e6 | None        | active             | False       |
| e6ee252c-b202-43f2-a888-a11edbf2c44b | computeDell6152-1 | None                                 | None        | available          | False       |
| ce6c0e9e-c7d8-4de1-8b28-ca973566c045 | storage-0         | 3350c31f-2f15-449e-8b40-b13db30e0e77 | None        | active             | False       |
| cad62674-65cd-4636-ac85-7d2661e32d6f | storage-1         | df425cce-3066-4e9c-a6b5-bf0a81d05aca | None        | active             | False       |
| 40060dad-4e9f-469d-b5cd-50937248bc7b | storage-2         | 49157863-ef71-4677-a3e3-5446b92edbad | None        | active             | False       |
| f2c99b78-feb7-4e9a-996f-a5325f7a5329 | computeSriov-0    | 6bf99fe0-4e93-4fc4-ac21-0cdc6fa70ff0 | None        | active             | False       |
| 879f2998-714c-4966-9de5-db07ba8f2073 | computeSriov-1    | None                                 | None        | available          | False       |
| 13bcf2a1-7d9b-4f1b-b8bd-020cf2dffcdc | computeDell6230-0 | 4d4866fa-7484-4cc9-930d-b8710f43fbcf | None        | active             | False       |
| cc56446d-1985-4ea3-b6a2-ea82dff38c3c | computeDell6230-1 | None                                 | None        | available          | False       |
+--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+
~~~

But they all went to maintenance again. This is seen in the logs:

~~~
2021-12-15 14:45:06.467 8 DEBUG ironic.conductor.task_manager [req-d3c43a10-e156-4abe-8b90-450fbaf74d16 - - - - -] Successfully released shared lock for power failure recovery on node 9271e535-a7d0-4e60-849e-02b0e83f2769 (lock was held 0.03 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:360
2021-12-15 14:45:06.484 8 DEBUG ironic.conductor.task_manager [req-d3c43a10-e156-4abe-8b90-450fbaf74d16 - - - - -] Attempting to get shared lock on node 5821e92e-fc8e-453f-9241-cec86b461d25 (for power failure recovery) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2021-12-15 14:45:06.523 8 DEBUG ironic.conductor.manager [req-d3c43a10-e156-4abe-8b90-450fbaf74d16 - - - - -] During power_failure_recovery, could not get power state for node 5821e92e-fc8e-453f-9241-cec86b461d25, Error: An exclusive lock is required, but the current context has a shared lock.. _power_failure_recovery /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:1932 
~~~

Comment 2 Steve Baker 2022-01-04 21:12:12 UTC

Here is the proposed backport for 16.1.x, the same fix is included in 16.2.2

Comment 3 Steve Baker 2022-02-08 21:19:00 UTC

The fix is in compose RHOS-16.1-RHEL-8-20220121.n.1

Comment 12 errata-xmlrpc 2022-03-24 11:02:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0986

Note You need to log in before you can comment on or make changes to this bug.