Bug 1848737 - vm creation interrupted by cluster degradation results in orphaned build requests or required nova cell information being unpopulated.
Summary: vm creation interrupted by cluster degradation results in orphaned build requ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: All
OS: Linux
low
low
Target Milestone: Upstream M3
: 17.0
Assignee: melanie witt
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks: 2003258
TreeView+ depends on / blocked
 
Reported: 2020-06-18 20:43 UTC by coldford@redhat.com
Modified: 2023-10-06 20:42 UTC (History)
9 users (show)

Fixed In Version: openstack-nova-23.2.1-0.20220606130355.68cad8f.el9ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2003258 (view as bug list)
Environment:
Last Closed: 2022-09-21 12:10:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1775934 0 None None None 2020-09-16 17:59:19 UTC
OpenStack gerrit 683730 0 None MERGED Sanity check instance mapping during scheduling 2020-11-16 17:46:42 UTC
Red Hat Issue Tracker OSP-9517 0 None None None 2022-06-17 15:44:54 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:11:51 UTC

Comment 1 melanie witt 2020-06-25 02:33:54 UTC
Thank you for providing so much detail in this bug report. With the data you pasted and the logs in the sosreport, I can see how the build requests are getting orphaned and the instance_mappings.cell_id are NULL during periods when rabbitmq disconnects/failures are occurring.

The TL;DR is that there are points along the code path where we emit notifications via RPC onto rabbitmq and we're not gracefully handling failure to send those notification messages, so we miss carrying out database updates like deleting a build request record or updating an instance mapping record with the cell0 cell_id.

We can see this in the Queens code [1], the nova_cell0.instances record is created, then we try to _set_vm_state_and_notify but that fails because of a rabbitmq messaging error, then we never reach the code to update the nova_api.instance_mappings.cell_id for the instance and we also don't reach the code that will destroy the nova_api.build_requests record for the instance:

...
            with obj_target_cell(instance, cell0) as cctxt:
                instance.create()

                # NOTE(mnaser): In order to properly clean-up volumes after
                #               being buried in cell0, we need to store BDMs.
                if block_device_mapping:
                    self._create_block_device_mapping(
                       cell0, instance.flavor, instance.uuid,
                       block_device_mapping)

                self._create_tags(cctxt, instance.uuid, tags)

                # Use the context targeted to cell0 here since the instance is
                # now in cell0.
                self._set_vm_state_and_notify(
                    cctxt, instance.uuid, 'build_instances', updates,
                    exc, request_spec)
                try:
                    # We don't need the cell0-targeted context here because the
                    # instance mapping is in the API DB.
                    inst_mapping = \
                        objects.InstanceMapping.get_by_instance_uuid(
                            context, instance.uuid)
                    inst_mapping.cell_mapping = cell0
                    inst_mapping.save()
                except exception.InstanceMappingNotFound:
                    pass

        for build_request in build_requests:
            try:
                build_request.destroy()
...

There was a recent patch in Ussuri that would handle the nova_api.instance_mappings.cell_id update part of things:

https://review.opendev.org/683730

but it does not handle the issue of orphaning build requests in this scenario.

It seems like maybe we should just try to _set_vm_state_and_notify last, after we've done all of the database update work. Note that this would not help us in the case of say, failing database writes in a degraded cluster, but it would handle the message queue part of it. Graceful handling of failing database writes would involve additional hardening.

[1] https://github.com/openstack/nova/blob/bea91b8d58d909852949726296149d93f2c639d5/nova/conductor/manager.py#L1133

Comment 5 Lon Hohberger 2022-06-17 15:28:02 UTC
Patch has been in upstream stable/wallaby for some time.

Comment 8 melanie witt 2022-06-28 18:19:02 UTC
Just wanted to note that per comment 1 that this rhbz is only partially fixed at this point in time. That is, "vm creation interrupted by cluster degradation results in orphaned build requests or required nova cell information being unpopulated" are two separate issues that require separate fixes. 

Patch https://review.opendev.org/683730 fixed the "required nova cell information being unpopulated" part of this rhbz.

The "orphaned build requests" part of this rhbz is not yet fixed. Because this rhbz is MODIFIED -> ON_QA now, I will open a new rhbz for the orphaned build requests bug.

Comment 9 melanie witt 2022-08-09 06:59:35 UTC
(In reply to melanie witt from comment #8)
> Patch https://review.opendev.org/683730 fixed the "required nova cell
> information being unpopulated" part of this rhbz.

Update: it was discovered that NULL cell_id can still happen under certain circumstances. A follow up fix is being worked on in https://bugzilla.redhat.com/show_bug.cgi?id=2112579.

> The "orphaned build requests" part of this rhbz is not yet fixed. Because
> this rhbz is MODIFIED -> ON_QA now, I will open a new rhbz for the orphaned
> build requests bug.

And it turns out that https://bugzilla.redhat.com/show_bug.cgi?id=1702048 fixed the "orphaned build requests" problem, so no further action is actually needed.

Comment 14 errata-xmlrpc 2022-09-21 12:10:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.