Description of problem:
Evacuating instances with vGPU always fails due to "Insufficient compute resources: vGPU resource is not available" error.
Creating a new instance with vGPU on the destination node succeeds.
This error occurs regardless of the destinations node.
This error is raised on the following points(*).
The following does:
- getting the first record of "allocation" variable.
- trying to get resource provider information by the first record of "allocation" variable.
However, getting resource provider information fails.
https://github.com/openstack/nova/blob/stable/train/nova/virt/libvirt/driver.py#L7087-L7093
~~~
vgpu_allocations = self._vgpu_allocations(allocations)
if not vgpu_allocations:
return
# TODO(sbauza): Once we have nested resource providers, find which one
# is having the related allocation for the specific VGPU type.
# For the moment, we should only have one allocation for
# ResourceProvider.
# TODO(sbauza): Iterate over all the allocations once we have
# nested Resource Providers. For the moment, just take the first.
if len(vgpu_allocations) > 1:
LOG.warning('More than one allocation was passed over to libvirt '
'while at the moment libvirt only supports one. Only '
'the first allocation will be looked up.')
rp_uuid, alloc = six.next(six.iteritems(vgpu_allocations)) <===============(*)Get the first record of the "allocations" variable.
vgpus_asked = alloc['resources'][orc.VGPU]
# Find if we allocated against a specific pGPU (and then the allocation
# is made against a child RP) or any pGPU (in case the VGPU inventory
# is still on the root RP)
try:
allocated_rp = self.provider_tree.data(rp_uuid) <=======================(*)Try getting resource provider information, but this fails.
except ValueError:
# The provider doesn't exist, return a better understandable
# exception
raise exception.ComputeResourcesUnavailable( <=======================(*)'vGPU resource is not available' exception is raised
reason='vGPU resource is not available') <=======================(*)
~~~
Because, as the following comment mentions, if an instance is in evacuating, the "allocations" variable has new and old allocation and old one doesn't exist in the current provider tree.
https://github.com/openstack/nova/blob/6786e9630b10c0c01c8797a4e2e0a1a35fd3ca94/nova/compute/resource_tracker.py#L431-L440
~~~
for rp_uuid, alloc_dict in allocations.items():
try:
provider_data = self.provider_tree.data(rp_uuid)
except ValueError:
# If an instance is in evacuating, it will hold new and old
# allocations, but the provider UUIDs in old allocations won't
# exist in the current provider tree, so skip it.
LOG.debug("Skip claiming resources of provider %(rp_uuid)s, "
"since the provider UUIDs are not in provider tree.",
{'rp_uuid': rp_uuid})
~~~
That's why evacuation fails.
I think this is a bug.
Version-Release number of selected component (if applicable):
RHOSP 16.2
How reproducible:
Steps to Reproduce:
1. Create a instance with vGPU according to a document [1]
2. Power-off a compute node where the instance is running
3. Evacuate the instance
Actual results:
Evacuation fails
Expected results:
Evacuation succeeds
Additional info:
[1]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/configuring_the_compute_service_for_instance_creation/assembly_configuring-virtual-gpus-for-instances_vgpu