Bug 1908405 - Nova is out of sync with ironic as hypervisor list in nova does not agree with ironic list after reboot of the nodes
Summary: Nova is out of sync with ironic as hypervisor list in nova does not agree wit...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 16.2 (Train on RHEL 8.4)
Assignee: melanie witt
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On: 1908546 2002737
Blocks: 2002773
TreeView+ depends on / blocked
 
Reported: 2020-12-16 15:59 UTC by Paras Babbar
Modified: 2023-03-21 19:38 UTC (History)
10 users (show)

Fixed In Version: openstack-nova-20.6.2-2.20211120015017.8906552.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1908546 2002773 (view as bug list)
Environment:
Last Closed: 2022-03-23 22:10:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1853009 0 None None None 2020-12-17 01:05:55 UTC
Red Hat Issue Tracker OSP-5246 0 None None None 2021-11-30 20:30:58 UTC
Red Hat Product Errata RHBA-2022:1001 0 None None None 2022-03-23 22:10:36 UTC

Description Paras Babbar 2020-12-16 15:59:54 UTC
Description of problem:
In one of our phase3 16.2 job, After the reboot of the nodes , nova-hypervisor list doesn't list the ironic hypervisor created and this we see inconsistencies below:

TASK [ensure nova is caught up with node status - get uuids] *******************
changed: [undercloud-0] => {
    "changed": true,
    "cmd": "source /home/stack/overcloudrc\nopenstack baremetal node list -f value -c UUID\n",
    "delta": "0:00:03.576614",
    "end": "2020-12-08 03:22:41.492741",
    "rc": 0,
    "start": "2020-12-08 03:22:37.916127"
}

STDOUT:

4a0dab7d-a5e9-4956-a5cb-c3714be5813f

TASK [get nova count of nodes] *************************************************
FAILED - RETRYING: get nova count of nodes (120 retries left).
FAILED - RETRYING: get nova count of nodes (119 retries left).
FAILED - RETRYING: get nova count of nodes (118 retries left).
FAILED - RETRYING: get nova count of nodes (117 retries left).
FAILED - RETRYING: get nova count of nodes (116 retries left).
FAILED - RETRYING: get nova count of nodes (115 retries left). 
...
FAILED - RETRYING: get nova count of nodes (1 retries left).
fatal: [undercloud-0]: FAILED! => {
    "attempts": 120,
    "changed": true,
    "cmd": "source /home/stack/overcloudrc\nnova hypervisor-list |grep -F 4a0dab7d-a5e9-4956-a5cb-c3714be5813f |wc -l\n",
    "delta": "0:00:02.199633",
    "end": "2020-12-08 03:48:55.366401",
    "rc": 0,
    "start": "2020-12-08 03:48:53.166768"
}

STDOUT:

0

which was clearly listed before the reboot of nodes:

$nova hypervisor-list
+--------------------------------------+--------------------------------------+-------+---------+
| ID                                   | Hypervisor hostname                  | State | Status  |
+--------------------------------------+--------------------------------------+-------+---------+
| d2365afd-f5e8-4842-ae49-101e9b261b0b | compute-1.redhat.local               | up    | enabled |
| 88f12987-e9d5-4d0d-966b-e24a16f294e7 | compute-0.redhat.local               | up    | enabled |
| 4a0dab7d-a5e9-4956-a5cb-c3714be5813f | 4a0dab7d-a5e9-4956-a5cb-c3714be5813f | up    | enabled |
| c2e8de2a-2d8c-4dcf-b0fa-ff3f70b4eecd | c2e8de2a-2d8c-4dcf-b0fa-ff3f70b4eecd | up    | enabled |
+--------------------------------------+--------------------------------------+-------+---------+

Version-Release number of selected component (if applicable):


How reproducible:
everytime

Steps to Reproduce:
1.create ironic node
2.add them to the hypervisor list
3.reboot the nodes using virsh destroy/start
4. validate the hypervisor list

Actual results:
Nova hypervisor list does not agree with ironic list

Expected results:
Nova hypervisor list should be similar to ironic list

Additional info:

Comment 2 melanie witt 2020-12-16 18:13:59 UTC
Not sure what's going on here but I see the following trace in the ironic app.log on all 3 controllers:

2020-12-08 03:02:20.303 12 DEBUG ironic.common.hash_ring [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Rebuilding cached hash rings ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:57
2020-12-08 03:02:20.321 12 DEBUG ironic.common.hash_ring [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Finished rebuilding hash rings, available drivers are :fake-hardware, :ipmi ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61
2020-12-08 03:02:22.385 12 DEBUG wsme.api [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Client-side error: Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is locked by host controller-2.redhat.local, please retry after the current operation is completed.
Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 3423, in vif_detach
    purpose='detach vif') as task:
  File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 161, in acquire
    return TaskManager(context, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 243, in __init__
    self.release_resources()
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 224, in __init__
    self._lock()
  File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 284, in _lock
    reserve_node()
  File "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/lib/python3.6/site-packages/retrying.py", line 229, in call
    raise attempt.get()
  File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 277, in reserve_node
    self.node_id)
  File "/usr/lib/python3.6/site-packages/ironic/objects/node.py", line 340, in reserve
    db_node = cls.dbapi.reserve_node(tag, node_id)
  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper
    ectxt.value = e.inner_exc
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ironic/db/sqlalchemy/api.py", line 450, in reserve_node
    host=node['reservation'])

ironic.common.exception.NodeLocked: Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is locked by host controller-2.redhat.local, please retry after the current operation is completed.
 format_exception /usr/lib/python3.6/site-packages/wsme/api.py:222


and no occurrence of the uuid 4a0dab7d-a5e9-4956-a5cb-c3714be5813f in either nova compute.log.

Comment 3 melanie witt 2020-12-17 01:05:57 UTC
After spending more time digging into the logs I think this bug is a duplicate of https://bugs.launchpad.net/nova/+bug/1853009 upstream.

The tl;dr is that when there are multiple nova-compute services using ironic and they start at the same time and race, it is possible they will take ownership of one another's ironic nodes during a hash ring rebalance, resulting in deletion of compute nodes from nova that are not restored. (One nova-compute will delete the compute node out from under another nova-compute).

I think the only possible workaround for this bug is to avoid starting multiple nova-computes using ironic at the same time and pause between the start of each one.

Details from this job run:

At first, node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f was owned by controller-0:

2020-12-08 02:31:20.179 8 WARNING nova.compute.resource_tracker [req-867b657a-3c5e-409f-85e1-39b780d206c3 - - - - -] No compute node record for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-0.redhat.local could not be found.
2020-12-08 02:31:20.236 8 INFO nova.compute.resource_tracker [req-867b657a-3c5e-409f-85e1-39b780d206c3 - - - - -] Compute node record created for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f with uuid: 4a0dab7d-a5e9-4956-a5cb-c3714be5813f

Then it moved to controller-2:

2020-12-08 02:31:38.686 8 WARNING nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] No compute node record for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-2.redhat.local could not be found.
2020-12-08 02:31:38.702 8 INFO nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-0.redhat.local to controller-2.redhat.local
[...]
2020-12-08 02:31:38.820 8 DEBUG nova.virt.ironic.driver [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is not ready for a deployment, reporting resources as reserved for it. Node's provision state is manageable, power state is power off and maintenance is False. update_provider_tree /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:944
[...]
2020-12-08 02:31:38.918 8 DEBUG nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] Compute_service record updated for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

Then it moved to controller-1:

2020-12-08 03:12:21.533 7 WARNING nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] No compute node record for controller-1.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-1.redhat.local could not be found.
2020-12-08 03:12:21.562 7 INFO nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-2.redhat.local to controller-1.redhat.local
[...]
2020-12-08 03:12:22.883 7 DEBUG nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] Compute_service record updated for controller-1.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

Then it moved to controller-2:

2020-12-08 03:12:21.700 7 WARNING nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] No compute node record for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-2.redhat.local could not be found.
2020-12-08 03:12:21.729 7 INFO nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-1.redhat.local to controller-2.redhat.local
[...]
2020-12-08 03:12:22.118 7 DEBUG nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] Compute_service record updated for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

Controller-2 is seen as down.

Then it moved to controller-0:

2020-12-08 03:12:23.881 7 WARNING nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] No compute node record for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-0.redhat.local could not be found.
2020-12-08 03:12:23.900 7 INFO nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-2.redhat.local to controller-0.redhat.local
[...]
2020-12-08 03:12:24.769 7 DEBUG nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] Compute_service record updated for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974

Then controller-0 thought it was owned by controller-2 and deleted its compute node. controller-0 nova-compute.log:

2020-12-08 03:12:31.096 7 DEBUG nova.virt.ironic.driver [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Hash ring members are {'controller-0.redhat.local', 'controller-2.redhat.local'} _refresh_hash_ring /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:795
2020-12-08 03:12:31.174 7 DEBUG nova.virt.ironic.driver [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Returning 0 available node(s) get_available_nodes /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:885
2020-12-08 03:12:31.192 7 INFO nova.compute.manager [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Deleting orphan compute node 8 hypervisor host is 4a0dab7d-a5e9-4956-a5cb-c3714be5813f, nodes are set()

Finally, no nova-compute ever re-created the compute node, so it never appears in the hypervisor list again.

I will work on refreshing the upstream patches and clone this rhbz for OSP17 where the master changes will land.

Comment 5 melanie witt 2020-12-17 03:32:32 UTC
(In reply to melanie witt from comment #3)
> I think the only possible workaround for this bug is to avoid starting
> multiple nova-computes using ironic at the same time and pause between the
> start of each one.

Correction: it would be best not to do an arbitrary pause between starting nova-computes but rather start one and wait for it to show as service "up" before starting the next one, and repeat. This way the bug couldn't emerge again for reasons of timing issues or slowness happening in CI.

Comment 8 Jason Paroly 2021-10-08 20:51:00 UTC
@

Comment 9 Jason Paroly 2021-10-08 20:52:01 UTC
@

Comment 28 errata-xmlrpc 2022-03-23 22:10:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1001


Note You need to log in before you can comment on or make changes to this bug.