Bug 1482630

Summary: [RFE] NVIDIA vGPU support for Guests in RHOSP (AI/ML use case)
Product: Red Hat OpenStack Reporter: Angela Soni <asoni>
Component: openstack-novaAssignee: Sylvain Bauza <sbauza>
Status: CLOSED ERRATA QA Contact: OSP DFG:Compute <osp-dfg-compute>
Severity: high Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: asoni, berrange, dasmith, eglynn, jhakimra, jliberma, kchamart, lyarwood, mbooth, sauchter, sbauza, sgordon, shwu, srevivo, vromanso
Target Milestone: Upstream M2Keywords: FutureFeature, TechPreview, Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-18.0.0-0.20180710150340.8469fa7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1552268 1553832 1625235 1656292 1764341 (view as bug list) Environment:
Last Closed: 2019-01-11 11:48:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1360442, 1552268, 1553832, 1624221, 1625235, 1626155, 1656291, 1656292, 1724085, 1761753, 1764341    

Description Angela Soni 2017-08-17 18:37:14 UTC
Proposed title of this feature request

This feature request is for NVIDIA vGPU support for Guests in RHOSP. 

* With regards to vGPU device definition do you expect in an initial implementation to allocate them once before activating the compute node or do be able to make (re)allocations when the compute node is active?
	These GPUs will  go on existing Compute nodes. If there is benefit, we are open to rebuild the node to enable additional capabilities with in OSP 8.
    * Which guest operating systems do you intend to use with vGPU?
	Mostly RHEL 7
    * Which display protocol(s) do you intend to use with vGPU?
	Protocols used in the backend by CUDA or other use cases around Machine Learning
    * What are y our scheduling expectations:
      - Do you require an initial implementation to handle NUMA locality or can they live with out?
	NUMA locality is needed, may be ok with out it. We need to know when we can enable same.
      - Do you expect homogenized environments in regards to the specific cards you will use and more importantly the types of vGPU slices you will use or will you be attempting to use different types across the environment.
	Will use same type of GPUs across 
       - Will you be using NVIDIA cards? Intel cards? Both? In each case, which models.
	NVIDIA cards in scope right now, P100

Comment 2 Stephen Gordon 2017-08-18 02:18:27 UTC
I'm updating the title to reflect the difference, nuanced as it may be, between this and Bug # 1360442 which focuses more on technical workstation virtualization use cases. The most immediately obvious differences visible here are:

* RHEL workloads (vs Windows workloads).
* Lesser importance of remote display protocols.
* More likely to require NUMA affinity.

Comment 8 Stephen Gordon 2018-03-07 15:28:50 UTC
NUMA locality will be treated as a stretch goal for vGPU MVP and tracked separately.

Comment 17 errata-xmlrpc 2019-01-11 11:48:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045