Bug 1908405
Summary: | Nova is out of sync with ironic as hypervisor list in nova does not agree with ironic list after reboot of the nodes | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Paras Babbar <pbabbar> | |
Component: | openstack-nova | Assignee: | melanie witt <mwitt> | |
Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 16.2 (Train) | CC: | dasmith, eglynn, jhakimra, jparoly, kchamart, mwitt, sbauza, sgordon, stephenfin, vromanso | |
Target Milestone: | z2 | Keywords: | Patch, Triaged | |
Target Release: | 16.2 (Train on RHEL 8.4) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-nova-20.6.2-2.20211120015017.8906552.el8ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1908546 2002773 (view as bug list) | Environment: | ||
Last Closed: | 2022-03-23 22:10:09 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1908546, 2002737 | |||
Bug Blocks: | 2002773 |
Description
Paras Babbar
2020-12-16 15:59:54 UTC
Not sure what's going on here but I see the following trace in the ironic app.log on all 3 controllers: 2020-12-08 03:02:20.303 12 DEBUG ironic.common.hash_ring [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Rebuilding cached hash rings ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:57 2020-12-08 03:02:20.321 12 DEBUG ironic.common.hash_ring [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Finished rebuilding hash rings, available drivers are :fake-hardware, :ipmi ring /usr/lib/python3.6/site-packages/ironic/common/hash_ring.py:61 2020-12-08 03:02:22.385 12 DEBUG wsme.api [req-ee831d28-665a-4204-8122-dc7759fd70f6 ffa8c71de4b447a897b18f395da0754e 7226696dd35345bdaddabfe302594c9c - default default] Client-side error: Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is locked by host controller-2.redhat.local, please retry after the current operation is completed. Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 235, in inner return func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 3423, in vif_detach purpose='detach vif') as task: File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 161, in acquire return TaskManager(context, *args, **kwargs) File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 243, in __init__ self.release_resources() File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 224, in __init__ self._lock() File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 284, in _lock reserve_node() File "/usr/lib/python3.6/site-packages/retrying.py", line 68, in wrapped_f return Retrying(*dargs, **dkw).call(f, *args, **kw) File "/usr/lib/python3.6/site-packages/retrying.py", line 229, in call raise attempt.get() File "/usr/lib/python3.6/site-packages/retrying.py", line 261, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/usr/lib/python3.6/site-packages/retrying.py", line 217, in call attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "/usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py", line 277, in reserve_node self.node_id) File "/usr/lib/python3.6/site-packages/ironic/objects/node.py", line 340, in reserve db_node = cls.dbapi.reserve_node(tag, node_id) File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 154, in wrapper ectxt.value = e.inner_exc File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/usr/lib/python3.6/site-packages/oslo_db/api.py", line 142, in wrapper return f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/ironic/db/sqlalchemy/api.py", line 450, in reserve_node host=node['reservation']) ironic.common.exception.NodeLocked: Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is locked by host controller-2.redhat.local, please retry after the current operation is completed. format_exception /usr/lib/python3.6/site-packages/wsme/api.py:222 and no occurrence of the uuid 4a0dab7d-a5e9-4956-a5cb-c3714be5813f in either nova compute.log. After spending more time digging into the logs I think this bug is a duplicate of https://bugs.launchpad.net/nova/+bug/1853009 upstream. The tl;dr is that when there are multiple nova-compute services using ironic and they start at the same time and race, it is possible they will take ownership of one another's ironic nodes during a hash ring rebalance, resulting in deletion of compute nodes from nova that are not restored. (One nova-compute will delete the compute node out from under another nova-compute). I think the only possible workaround for this bug is to avoid starting multiple nova-computes using ironic at the same time and pause between the start of each one. Details from this job run: At first, node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f was owned by controller-0: 2020-12-08 02:31:20.179 8 WARNING nova.compute.resource_tracker [req-867b657a-3c5e-409f-85e1-39b780d206c3 - - - - -] No compute node record for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-0.redhat.local could not be found. 2020-12-08 02:31:20.236 8 INFO nova.compute.resource_tracker [req-867b657a-3c5e-409f-85e1-39b780d206c3 - - - - -] Compute node record created for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f with uuid: 4a0dab7d-a5e9-4956-a5cb-c3714be5813f Then it moved to controller-2: 2020-12-08 02:31:38.686 8 WARNING nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] No compute node record for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-2.redhat.local could not be found. 2020-12-08 02:31:38.702 8 INFO nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-0.redhat.local to controller-2.redhat.local [...] 2020-12-08 02:31:38.820 8 DEBUG nova.virt.ironic.driver [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] Node 4a0dab7d-a5e9-4956-a5cb-c3714be5813f is not ready for a deployment, reporting resources as reserved for it. Node's provision state is manageable, power state is power off and maintenance is False. update_provider_tree /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:944 [...] 2020-12-08 02:31:38.918 8 DEBUG nova.compute.resource_tracker [req-a8e3477f-4113-427d-9014-0dd15e3d705e - - - - -] Compute_service record updated for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974 Then it moved to controller-1: 2020-12-08 03:12:21.533 7 WARNING nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] No compute node record for controller-1.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-1.redhat.local could not be found. 2020-12-08 03:12:21.562 7 INFO nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-2.redhat.local to controller-1.redhat.local [...] 2020-12-08 03:12:22.883 7 DEBUG nova.compute.resource_tracker [req-c82c664d-0d99-46f3-9f7f-8491185ebff5 - - - - -] Compute_service record updated for controller-1.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974 Then it moved to controller-2: 2020-12-08 03:12:21.700 7 WARNING nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] No compute node record for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-2.redhat.local could not be found. 2020-12-08 03:12:21.729 7 INFO nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-1.redhat.local to controller-2.redhat.local [...] 2020-12-08 03:12:22.118 7 DEBUG nova.compute.resource_tracker [req-96bf0299-1048-4afb-9648-93b8432ec7ef - - - - -] Compute_service record updated for controller-2.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974 Controller-2 is seen as down. Then it moved to controller-0: 2020-12-08 03:12:23.881 7 WARNING nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] No compute node record for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f: nova.exception_Remote.ComputeHostNotFound_Remote: Compute host controller-0.redhat.local could not be found. 2020-12-08 03:12:23.900 7 INFO nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] ComputeNode 4a0dab7d-a5e9-4956-a5cb-c3714be5813f moving from controller-2.redhat.local to controller-0.redhat.local [...] 2020-12-08 03:12:24.769 7 DEBUG nova.compute.resource_tracker [req-a7473231-a8ac-4cd3-80a8-026fe7fdb21a - - - - -] Compute_service record updated for controller-0.redhat.local:4a0dab7d-a5e9-4956-a5cb-c3714be5813f _update_available_resource /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:974 Then controller-0 thought it was owned by controller-2 and deleted its compute node. controller-0 nova-compute.log: 2020-12-08 03:12:31.096 7 DEBUG nova.virt.ironic.driver [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Hash ring members are {'controller-0.redhat.local', 'controller-2.redhat.local'} _refresh_hash_ring /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:795 2020-12-08 03:12:31.174 7 DEBUG nova.virt.ironic.driver [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Returning 0 available node(s) get_available_nodes /usr/lib/python3.6/site-packages/nova/virt/ironic/driver.py:885 2020-12-08 03:12:31.192 7 INFO nova.compute.manager [req-41833860-5ddc-45e5-aa62-e5c1c1e71ae0 - - - - -] Deleting orphan compute node 8 hypervisor host is 4a0dab7d-a5e9-4956-a5cb-c3714be5813f, nodes are set() Finally, no nova-compute ever re-created the compute node, so it never appears in the hypervisor list again. I will work on refreshing the upstream patches and clone this rhbz for OSP17 where the master changes will land. (In reply to melanie witt from comment #3) > I think the only possible workaround for this bug is to avoid starting > multiple nova-computes using ironic at the same time and pause between the > start of each one. Correction: it would be best not to do an arbitrary pause between starting nova-computes but rather start one and wait for it to show as service "up" before starting the next one, and repeat. This way the bug couldn't emerge again for reasons of timing issues or slowness happening in CI. @ @ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.2), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1001 |