2128568 – Evacuating instances which have vGPU fails

Bug 2128568 - Evacuating instances which have vGPU fails

Summary: Evacuating instances which have vGPU fails

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	16.2 (Train)
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-09-21 05:18 UTC by yatanaka
Modified:	2023-06-05 15:45 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-06-05 15:45:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	845757	0	None	NEW	Support multiple allocations for vGPUs	2023-06-05 15:45:27 UTC
Red Hat Issue Tracker	OSP-18836	0	None	None	None	2022-09-21 05:28:02 UTC

Description yatanaka 2022-09-21 05:18:54 UTC

Description of problem:

Evacuating instances with vGPU always fails due to "Insufficient compute resources: vGPU resource is not available" error.
Creating a new instance with vGPU on the destination node succeeds.
This error occurs regardless of the destinations node.

This error is raised on the following points(*).
The following does:
  - getting the first record of "allocation" variable.
  - trying to get resource provider information by the first record of "allocation" variable.
However, getting resource provider information fails.

https://github.com/openstack/nova/blob/stable/train/nova/virt/libvirt/driver.py#L7087-L7093
~~~
        vgpu_allocations = self._vgpu_allocations(allocations)
        if not vgpu_allocations:
            return
        # TODO(sbauza): Once we have nested resource providers, find which one
        # is having the related allocation for the specific VGPU type.
        # For the moment, we should only have one allocation for
        # ResourceProvider.
        # TODO(sbauza): Iterate over all the allocations once we have
        # nested Resource Providers. For the moment, just take the first.
        if len(vgpu_allocations) > 1:
            LOG.warning('More than one allocation was passed over to libvirt '
                        'while at the moment libvirt only supports one. Only '
                        'the first allocation will be looked up.')
        rp_uuid, alloc = six.next(six.iteritems(vgpu_allocations))   <===============(*)Get the first record of the "allocations" variable.
        vgpus_asked = alloc['resources'][orc.VGPU]

        # Find if we allocated against a specific pGPU (and then the allocation
        # is made against a child RP) or any pGPU (in case the VGPU inventory
        # is still on the root RP)
        try:
            allocated_rp = self.provider_tree.data(rp_uuid)  <=======================(*)Try getting resource provider information, but this fails.
        except ValueError:
            # The provider doesn't exist, return a better understandable
            # exception
            raise exception.ComputeResourcesUnavailable(     <=======================(*)'vGPU resource is not available' exception is raised
                reason='vGPU resource is not available')     <=======================(*)
~~~

Because, as the following comment mentions, if an instance is in evacuating, the "allocations" variable has new and old allocation and old one doesn't exist in the current provider tree.

https://github.com/openstack/nova/blob/6786e9630b10c0c01c8797a4e2e0a1a35fd3ca94/nova/compute/resource_tracker.py#L431-L440
~~~
        for rp_uuid, alloc_dict in allocations.items():
            try:
                provider_data = self.provider_tree.data(rp_uuid)
            except ValueError:
                # If an instance is in evacuating, it will hold new and old
                # allocations, but the provider UUIDs in old allocations won't
                # exist in the current provider tree, so skip it.
                LOG.debug("Skip claiming resources of provider %(rp_uuid)s, "
                          "since the provider UUIDs are not in provider tree.",
                          {'rp_uuid': rp_uuid})
~~~

That's why evacuation fails.
I think this is a bug.



Version-Release number of selected component (if applicable):
RHOSP 16.2


How reproducible:

Steps to Reproduce:
1. Create a instance with vGPU according to a document [1]
2. Power-off a compute node where the instance is running
3. Evacuate the instance


Actual results:
Evacuation fails


Expected results:
Evacuation succeeds


Additional info:
[1]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/configuring_the_compute_service_for_instance_creation/assembly_configuring-virtual-gpus-for-instances_vgpu

Comment 9 Artom Lifshitz 2023-06-05 15:45:28 UTC

Closing this as UPSTREAM since the workaround has satisfied the customer and the linked patch is in progress upstream.

Note You need to log in before you can comment on or make changes to this bug.