Bug 2219598

Summary: compute with 20 NVIDIA A100 GPU allows only 19/20 GPU instances to being spawned
Product: Red Hat OpenStack Reporter: alisci <alisci>
Component: openstack-novaAssignee: OSP DFG:Compute <osp-dfg-compute>
Status: NEW --- QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 16.2 (Train)CC: dasmith, eglynn, jhakimra, kchamart, sbauza, sgordon, vromanso
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description alisci 2023-07-04 13:51:00 UTC
Description of problem:
it is not possible to allocate all the 20 GPU instances on a compute.
Actually it is running 19/20 of them. Spawning a new one more than that, it fails at the scheduled compute with the error:

Instance failed to spawn: nova.exception.ComputeResourcesUnavailable: Insufficient compute resources: vGPU resource is not available

checking the allocated mdev GPU devices, they seems to be the ones from the currently running instances and it seems there aren't any allocated and unused ones.

details on the following private commen

Version-Release number of selected component (if applicable):
OSP 16.2.3


How reproducible:
this is CU specific

Steps to Reproduce:
create instances with GPU

Actual results:
only 19/20 GPU get allocated

Expected results:
20/20 GPU get allocated