Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2128568

Summary: Evacuating instances which have vGPU fails
Product: Red Hat OpenStack Reporter: yatanaka
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: CLOSED UPSTREAM QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: alifshit, chopark, dasmith, eglynn, igallagh, jhakimra, kchamart, sbauza, sgordon, smooney, vromanso
Target Milestone: ---Keywords: Patch, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-05 15:45:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yatanaka 2022-09-21 05:18:54 UTC
Description of problem:

Evacuating instances with vGPU always fails due to "Insufficient compute resources: vGPU resource is not available" error.
Creating a new instance with vGPU on the destination node succeeds.
This error occurs regardless of the destinations node.

This error is raised on the following points(*).
The following does:
  - getting the first record of "allocation" variable.
  - trying to get resource provider information by the first record of "allocation" variable.
However, getting resource provider information fails.

https://github.com/openstack/nova/blob/stable/train/nova/virt/libvirt/driver.py#L7087-L7093
~~~
        vgpu_allocations = self._vgpu_allocations(allocations)
        if not vgpu_allocations:
            return
        # TODO(sbauza): Once we have nested resource providers, find which one
        # is having the related allocation for the specific VGPU type.
        # For the moment, we should only have one allocation for
        # ResourceProvider.
        # TODO(sbauza): Iterate over all the allocations once we have
        # nested Resource Providers. For the moment, just take the first.
        if len(vgpu_allocations) > 1:
            LOG.warning('More than one allocation was passed over to libvirt '
                        'while at the moment libvirt only supports one. Only '
                        'the first allocation will be looked up.')
        rp_uuid, alloc = six.next(six.iteritems(vgpu_allocations))   <===============(*)Get the first record of the "allocations" variable.
        vgpus_asked = alloc['resources'][orc.VGPU]

        # Find if we allocated against a specific pGPU (and then the allocation
        # is made against a child RP) or any pGPU (in case the VGPU inventory
        # is still on the root RP)
        try:
            allocated_rp = self.provider_tree.data(rp_uuid)  <=======================(*)Try getting resource provider information, but this fails.
        except ValueError:
            # The provider doesn't exist, return a better understandable
            # exception
            raise exception.ComputeResourcesUnavailable(     <=======================(*)'vGPU resource is not available' exception is raised
                reason='vGPU resource is not available')     <=======================(*)
~~~

Because, as the following comment mentions, if an instance is in evacuating, the "allocations" variable has new and old allocation and old one doesn't exist in the current provider tree.

https://github.com/openstack/nova/blob/6786e9630b10c0c01c8797a4e2e0a1a35fd3ca94/nova/compute/resource_tracker.py#L431-L440
~~~
        for rp_uuid, alloc_dict in allocations.items():
            try:
                provider_data = self.provider_tree.data(rp_uuid)
            except ValueError:
                # If an instance is in evacuating, it will hold new and old
                # allocations, but the provider UUIDs in old allocations won't
                # exist in the current provider tree, so skip it.
                LOG.debug("Skip claiming resources of provider %(rp_uuid)s, "
                          "since the provider UUIDs are not in provider tree.",
                          {'rp_uuid': rp_uuid})
~~~

That's why evacuation fails.
I think this is a bug.



Version-Release number of selected component (if applicable):
RHOSP 16.2


How reproducible:

Steps to Reproduce:
1. Create a instance with vGPU according to a document [1]
2. Power-off a compute node where the instance is running
3. Evacuate the instance


Actual results:
Evacuation fails


Expected results:
Evacuation succeeds


Additional info:
[1]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/configuring_the_compute_service_for_instance_creation/assembly_configuring-virtual-gpus-for-instances_vgpu

Comment 9 Artom Lifshitz 2023-06-05 15:45:28 UTC
Closing this as UPSTREAM since the workaround has satisfied the customer and the linked patch is in progress upstream.