Thank you for providing so much detail in this bug report. With the data you pasted and the logs in the sosreport, I can see how the build requests are getting orphaned and the instance_mappings.cell_id are NULL during periods when rabbitmq disconnects/failures are occurring. The TL;DR is that there are points along the code path where we emit notifications via RPC onto rabbitmq and we're not gracefully handling failure to send those notification messages, so we miss carrying out database updates like deleting a build request record or updating an instance mapping record with the cell0 cell_id. We can see this in the Queens code [1], the nova_cell0.instances record is created, then we try to _set_vm_state_and_notify but that fails because of a rabbitmq messaging error, then we never reach the code to update the nova_api.instance_mappings.cell_id for the instance and we also don't reach the code that will destroy the nova_api.build_requests record for the instance: ... with obj_target_cell(instance, cell0) as cctxt: instance.create() # NOTE(mnaser): In order to properly clean-up volumes after # being buried in cell0, we need to store BDMs. if block_device_mapping: self._create_block_device_mapping( cell0, instance.flavor, instance.uuid, block_device_mapping) self._create_tags(cctxt, instance.uuid, tags) # Use the context targeted to cell0 here since the instance is # now in cell0. self._set_vm_state_and_notify( cctxt, instance.uuid, 'build_instances', updates, exc, request_spec) try: # We don't need the cell0-targeted context here because the # instance mapping is in the API DB. inst_mapping = \ objects.InstanceMapping.get_by_instance_uuid( context, instance.uuid) inst_mapping.cell_mapping = cell0 inst_mapping.save() except exception.InstanceMappingNotFound: pass for build_request in build_requests: try: build_request.destroy() ... There was a recent patch in Ussuri that would handle the nova_api.instance_mappings.cell_id update part of things: https://review.opendev.org/683730 but it does not handle the issue of orphaning build requests in this scenario. It seems like maybe we should just try to _set_vm_state_and_notify last, after we've done all of the database update work. Note that this would not help us in the case of say, failing database writes in a degraded cluster, but it would handle the message queue part of it. Graceful handling of failing database writes would involve additional hardening. [1] https://github.com/openstack/nova/blob/bea91b8d58d909852949726296149d93f2c639d5/nova/conductor/manager.py#L1133
Patch has been in upstream stable/wallaby for some time.
Just wanted to note that per comment 1 that this rhbz is only partially fixed at this point in time. That is, "vm creation interrupted by cluster degradation results in orphaned build requests or required nova cell information being unpopulated" are two separate issues that require separate fixes. Patch https://review.opendev.org/683730 fixed the "required nova cell information being unpopulated" part of this rhbz. The "orphaned build requests" part of this rhbz is not yet fixed. Because this rhbz is MODIFIED -> ON_QA now, I will open a new rhbz for the orphaned build requests bug.
(In reply to melanie witt from comment #8) > Patch https://review.opendev.org/683730 fixed the "required nova cell > information being unpopulated" part of this rhbz. Update: it was discovered that NULL cell_id can still happen under certain circumstances. A follow up fix is being worked on in https://bugzilla.redhat.com/show_bug.cgi?id=2112579. > The "orphaned build requests" part of this rhbz is not yet fixed. Because > this rhbz is MODIFIED -> ON_QA now, I will open a new rhbz for the orphaned > build requests bug. And it turns out that https://bugzilla.redhat.com/show_bug.cgi?id=1702048 fixed the "orphaned build requests" problem, so no further action is actually needed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543