Description of problem: Evacuating instances with vGPU always fails due to "Insufficient compute resources: vGPU resource is not available" error. Creating a new instance with vGPU on the destination node succeeds. This error occurs regardless of the destinations node. This error is raised on the following points(*). The following does: - getting the first record of "allocation" variable. - trying to get resource provider information by the first record of "allocation" variable. However, getting resource provider information fails. https://github.com/openstack/nova/blob/stable/train/nova/virt/libvirt/driver.py#L7087-L7093 ~~~ vgpu_allocations = self._vgpu_allocations(allocations) if not vgpu_allocations: return # TODO(sbauza): Once we have nested resource providers, find which one # is having the related allocation for the specific VGPU type. # For the moment, we should only have one allocation for # ResourceProvider. # TODO(sbauza): Iterate over all the allocations once we have # nested Resource Providers. For the moment, just take the first. if len(vgpu_allocations) > 1: LOG.warning('More than one allocation was passed over to libvirt ' 'while at the moment libvirt only supports one. Only ' 'the first allocation will be looked up.') rp_uuid, alloc = six.next(six.iteritems(vgpu_allocations)) <===============(*)Get the first record of the "allocations" variable. vgpus_asked = alloc['resources'][orc.VGPU] # Find if we allocated against a specific pGPU (and then the allocation # is made against a child RP) or any pGPU (in case the VGPU inventory # is still on the root RP) try: allocated_rp = self.provider_tree.data(rp_uuid) <=======================(*)Try getting resource provider information, but this fails. except ValueError: # The provider doesn't exist, return a better understandable # exception raise exception.ComputeResourcesUnavailable( <=======================(*)'vGPU resource is not available' exception is raised reason='vGPU resource is not available') <=======================(*) ~~~ Because, as the following comment mentions, if an instance is in evacuating, the "allocations" variable has new and old allocation and old one doesn't exist in the current provider tree. https://github.com/openstack/nova/blob/6786e9630b10c0c01c8797a4e2e0a1a35fd3ca94/nova/compute/resource_tracker.py#L431-L440 ~~~ for rp_uuid, alloc_dict in allocations.items(): try: provider_data = self.provider_tree.data(rp_uuid) except ValueError: # If an instance is in evacuating, it will hold new and old # allocations, but the provider UUIDs in old allocations won't # exist in the current provider tree, so skip it. LOG.debug("Skip claiming resources of provider %(rp_uuid)s, " "since the provider UUIDs are not in provider tree.", {'rp_uuid': rp_uuid}) ~~~ That's why evacuation fails. I think this is a bug. Version-Release number of selected component (if applicable): RHOSP 16.2 How reproducible: Steps to Reproduce: 1. Create a instance with vGPU according to a document [1] 2. Power-off a compute node where the instance is running 3. Evacuate the instance Actual results: Evacuation fails Expected results: Evacuation succeeds Additional info: [1]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/configuring_the_compute_service_for_instance_creation/assembly_configuring-virtual-gpus-for-instances_vgpu
Closing this as UPSTREAM since the workaround has satisfied the customer and the linked patch is in progress upstream.