Description of problem: Customer upgraded from 16.1.6 to 16.1.7. After the upgrade they noticed that all the baremetal nodes went to maintenance on with the following: | last_error | During sync_power_state, max retries exceeded for node 9271e535-a7d0-4e60-849e-02b0e83f2769, node state None does not match expected state 'None'. Updating DB state to 'None' Switching node to maintenance mode. Error: An exclusive lock is required, but the current context has a shared lock. | Version-Release number of selected component (if applicable): Openstack 16.1.7 (minor update from 16.1.6) How reproducible: CU has not attempted to reproduce Steps to Reproduce: 1. deploy osp 16.1.6 2. perform minor upgrade to osp 16.1.7 3. verify baremetal node status Actual results: All nodes go to maintenace ~~~ $ openstack baremetal node list +--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+ | 9271e535-a7d0-4e60-849e-02b0e83f2769 | controller-0 | f85c29ae-c0a8-409c-b282-0ec85ae35852 | None | active | True | | 5821e92e-fc8e-453f-9241-cec86b461d25 | controller-1 | 9a390b75-a888-4b6d-80e8-25abcdb6fb09 | None | active | True | | 5b56aa9e-e7e5-43f7-b19d-2968ce2716e4 | controller-2 | d1b4a31c-3564-42ad-a91e-342adccd8dac | None | active | True | | 8a0e788e-ec84-43ab-9250-bbc052294e33 | computeDell6152-0 | 4da51add-42bf-4f12-a48f-1bbf2558d2e6 | None | active | True | | e6ee252c-b202-43f2-a888-a11edbf2c44b | computeDell6152-1 | None | None | available | True | | ce6c0e9e-c7d8-4de1-8b28-ca973566c045 | storage-0 | 3350c31f-2f15-449e-8b40-b13db30e0e77 | None | active | True | | cad62674-65cd-4636-ac85-7d2661e32d6f | storage-1 | df425cce-3066-4e9c-a6b5-bf0a81d05aca | None | active | True | | 40060dad-4e9f-469d-b5cd-50937248bc7b | storage-2 | 49157863-ef71-4677-a3e3-5446b92edbad | None | active | True | | f2c99b78-feb7-4e9a-996f-a5325f7a5329 | computeSriov-0 | 6bf99fe0-4e93-4fc4-ac21-0cdc6fa70ff0 | None | active | True | | 879f2998-714c-4966-9de5-db07ba8f2073 | computeSriov-1 | None | None | available | True | | 13bcf2a1-7d9b-4f1b-b8bd-020cf2dffcdc | computeDell6230-0 | 4d4866fa-7484-4cc9-930d-b8710f43fbcf | None | active | True | | cc56446d-1985-4ea3-b6a2-ea82dff38c3c | computeDell6230-1 | None | None | available | True | +--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+ Expected results: Nodes receive power state Additional info: We performed the following on the undercloud to see if the state could be restored: ~~~ # systemctl restart tripleo_ironic_conductor.service # systemctl restart tripleo_ironic_inspector.service ~~~ This allowed the state to be reset ~~~ $ openstack baremetal node list +--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+ | UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance | +--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+ | 9271e535-a7d0-4e60-849e-02b0e83f2769 | controller-0 | f85c29ae-c0a8-409c-b282-0ec85ae35852 | None | active | False | | 5821e92e-fc8e-453f-9241-cec86b461d25 | controller-1 | 9a390b75-a888-4b6d-80e8-25abcdb6fb09 | None | active | False | | 5b56aa9e-e7e5-43f7-b19d-2968ce2716e4 | controller-2 | d1b4a31c-3564-42ad-a91e-342adccd8dac | None | active | False | | 8a0e788e-ec84-43ab-9250-bbc052294e33 | computeDell6152-0 | 4da51add-42bf-4f12-a48f-1bbf2558d2e6 | None | active | False | | e6ee252c-b202-43f2-a888-a11edbf2c44b | computeDell6152-1 | None | None | available | False | | ce6c0e9e-c7d8-4de1-8b28-ca973566c045 | storage-0 | 3350c31f-2f15-449e-8b40-b13db30e0e77 | None | active | False | | cad62674-65cd-4636-ac85-7d2661e32d6f | storage-1 | df425cce-3066-4e9c-a6b5-bf0a81d05aca | None | active | False | | 40060dad-4e9f-469d-b5cd-50937248bc7b | storage-2 | 49157863-ef71-4677-a3e3-5446b92edbad | None | active | False | | f2c99b78-feb7-4e9a-996f-a5325f7a5329 | computeSriov-0 | 6bf99fe0-4e93-4fc4-ac21-0cdc6fa70ff0 | None | active | False | | 879f2998-714c-4966-9de5-db07ba8f2073 | computeSriov-1 | None | None | available | False | | 13bcf2a1-7d9b-4f1b-b8bd-020cf2dffcdc | computeDell6230-0 | 4d4866fa-7484-4cc9-930d-b8710f43fbcf | None | active | False | | cc56446d-1985-4ea3-b6a2-ea82dff38c3c | computeDell6230-1 | None | None | available | False | +--------------------------------------+-------------------+--------------------------------------+-------------+--------------------+-------------+ ~~~ But they all went to maintenance again. This is seen in the logs: ~~~ 2021-12-15 14:45:06.467 8 DEBUG ironic.conductor.task_manager [req-d3c43a10-e156-4abe-8b90-450fbaf74d16 - - - - -] Successfully released shared lock for power failure recovery on node 9271e535-a7d0-4e60-849e-02b0e83f2769 (lock was held 0.03 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:360 2021-12-15 14:45:06.484 8 DEBUG ironic.conductor.task_manager [req-d3c43a10-e156-4abe-8b90-450fbaf74d16 - - - - -] Attempting to get shared lock on node 5821e92e-fc8e-453f-9241-cec86b461d25 (for power failure recovery) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222 2021-12-15 14:45:06.523 8 DEBUG ironic.conductor.manager [req-d3c43a10-e156-4abe-8b90-450fbaf74d16 - - - - -] During power_failure_recovery, could not get power state for node 5821e92e-fc8e-453f-9241-cec86b461d25, Error: An exclusive lock is required, but the current context has a shared lock.. _power_failure_recovery /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:1932 ~~~
Here is the proposed backport for 16.1.x, the same fix is included in 16.2.2
The fix is in compose RHOS-16.1-RHEL-8-20220121.n.1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.8 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0986