1848737 – vm creation interrupted by cluster degradation results in orphaned build requests or required nova cell information being unpopulated.

Bug 1848737 - vm creation interrupted by cluster degradation results in orphaned build requests or required nova cell information being unpopulated.

Summary: vm creation interrupted by cluster degradation results in orphaned build requ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	Upstream M3
Target Release:	17.0
Assignee:	melanie witt
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2003258
TreeView+	depends on / blocked

Reported:	2020-06-18 20:43 UTC by coldford@redhat.com
Modified:	2023-10-06 20:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-nova-23.2.1-0.20220606130355.68cad8f.el9ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2003258 (view as bug list)
Environment:
Last Closed:	2022-09-21 12:10:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1775934	None	None	None	2020-09-16 17:59:19 UTC
OpenStack gerrit	683730	None	MERGED	Sanity check instance mapping during scheduling	2020-11-16 17:46:42 UTC
Red Hat Issue Tracker	OSP-9517	None	None	None	2022-06-17 15:44:54 UTC
Red Hat Product Errata	RHEA-2022:6543	None	None	None	2022-09-21 12:11:51 UTC

Comment 1 melanie witt 2020-06-25 02:33:54 UTC

Thank you for providing so much detail in this bug report. With the data you pasted and the logs in the sosreport, I can see how the build requests are getting orphaned and the instance_mappings.cell_id are NULL during periods when rabbitmq disconnects/failures are occurring.

The TL;DR is that there are points along the code path where we emit notifications via RPC onto rabbitmq and we're not gracefully handling failure to send those notification messages, so we miss carrying out database updates like deleting a build request record or updating an instance mapping record with the cell0 cell_id.

We can see this in the Queens code [1], the nova_cell0.instances record is created, then we try to _set_vm_state_and_notify but that fails because of a rabbitmq messaging error, then we never reach the code to update the nova_api.instance_mappings.cell_id for the instance and we also don't reach the code that will destroy the nova_api.build_requests record for the instance:

...
            with obj_target_cell(instance, cell0) as cctxt:
                instance.create()

                # NOTE(mnaser): In order to properly clean-up volumes after
                #               being buried in cell0, we need to store BDMs.
                if block_device_mapping:
                    self._create_block_device_mapping(
                       cell0, instance.flavor, instance.uuid,
                       block_device_mapping)

                self._create_tags(cctxt, instance.uuid, tags)

                # Use the context targeted to cell0 here since the instance is
                # now in cell0.
                self._set_vm_state_and_notify(
                    cctxt, instance.uuid, 'build_instances', updates,
                    exc, request_spec)
                try:
                    # We don't need the cell0-targeted context here because the
                    # instance mapping is in the API DB.
                    inst_mapping = \
                        objects.InstanceMapping.get_by_instance_uuid(
                            context, instance.uuid)
                    inst_mapping.cell_mapping = cell0
                    inst_mapping.save()
                except exception.InstanceMappingNotFound:
                    pass

        for build_request in build_requests:
            try:
                build_request.destroy()
...

There was a recent patch in Ussuri that would handle the nova_api.instance_mappings.cell_id update part of things:

https://review.opendev.org/683730

but it does not handle the issue of orphaning build requests in this scenario.

It seems like maybe we should just try to _set_vm_state_and_notify last, after we've done all of the database update work. Note that this would not help us in the case of say, failing database writes in a degraded cluster, but it would handle the message queue part of it. Graceful handling of failing database writes would involve additional hardening.

[1] https://github.com/openstack/nova/blob/bea91b8d58d909852949726296149d93f2c639d5/nova/conductor/manager.py#L1133

Comment 5 Lon Hohberger 2022-06-17 15:28:02 UTC

Patch has been in upstream stable/wallaby for some time.

Comment 8 melanie witt 2022-06-28 18:19:02 UTC

Just wanted to note that per comment 1 that this rhbz is only partially fixed at this point in time. That is, "vm creation interrupted by cluster degradation results in orphaned build requests or required nova cell information being unpopulated" are two separate issues that require separate fixes. 

Patch https://review.opendev.org/683730 fixed the "required nova cell information being unpopulated" part of this rhbz.

The "orphaned build requests" part of this rhbz is not yet fixed. Because this rhbz is MODIFIED -> ON_QA now, I will open a new rhbz for the orphaned build requests bug.

Comment 9 melanie witt 2022-08-09 06:59:35 UTC

(In reply to melanie witt from comment #8)
> Patch https://review.opendev.org/683730 fixed the "required nova cell
> information being unpopulated" part of this rhbz.

Update: it was discovered that NULL cell_id can still happen under certain circumstances. A follow up fix is being worked on in https://bugzilla.redhat.com/show_bug.cgi?id=2112579.

> The "orphaned build requests" part of this rhbz is not yet fixed. Because
> this rhbz is MODIFIED -> ON_QA now, I will open a new rhbz for the orphaned
> build requests bug.

And it turns out that https://bugzilla.redhat.com/show_bug.cgi?id=1702048 fixed the "orphaned build requests" problem, so no further action is actually needed.

Comment 14 errata-xmlrpc 2022-09-21 12:10:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543

Note You need to log in before you can comment on or make changes to this bug.