1482630 – [RFE] NVIDIA vGPU support for Guests in RHOSP (AI/ML use case)

Bug 1482630 - [RFE] NVIDIA vGPU support for Guests in RHOSP (AI/ML use case)

Summary: [RFE] NVIDIA vGPU support for Guests in RHOSP (AI/ML use case)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	Upstream M2
Target Release:	14.0 (Rocky)
Assignee:	Sylvain Bauza
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1360442 1552268 1553832 1624221 1625235 1626155 1656291 1656292 1724085 1761753 1764341
TreeView+	depends on / blocked

Reported:	2017-08-17 18:37 UTC by Angela Soni
Modified:	2023-12-15 15:57 UTC (History)
CC List:	15 users (show)
Fixed In Version:	openstack-nova-18.0.0-0.20180710150340.8469fa7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1552268 1553832 1625235 1656292 1764341 (view as bug list)
Environment:
Last Closed:	2019-01-11 11:48:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-23316	0	None	None	None	2023-03-21 18:44:42 UTC
Red Hat Product Errata	RHEA-2019:0045	0	None	None	None	2019-01-11 11:48:38 UTC

Description Angela Soni 2017-08-17 18:37:14 UTC

Proposed title of this feature request

This feature request is for NVIDIA vGPU support for Guests in RHOSP. 

* With regards to vGPU device definition do you expect in an initial implementation to allocate them once before activating the compute node or do be able to make (re)allocations when the compute node is active?
	These GPUs will  go on existing Compute nodes. If there is benefit, we are open to rebuild the node to enable additional capabilities with in OSP 8.
    * Which guest operating systems do you intend to use with vGPU?
	Mostly RHEL 7
    * Which display protocol(s) do you intend to use with vGPU?
	Protocols used in the backend by CUDA or other use cases around Machine Learning
    * What are y our scheduling expectations:
      - Do you require an initial implementation to handle NUMA locality or can they live with out?
	NUMA locality is needed, may be ok with out it. We need to know when we can enable same.
      - Do you expect homogenized environments in regards to the specific cards you will use and more importantly the types of vGPU slices you will use or will you be attempting to use different types across the environment.
	Will use same type of GPUs across 
       - Will you be using NVIDIA cards? Intel cards? Both? In each case, which models.
	NVIDIA cards in scope right now, P100

Comment 2 Stephen Gordon 2017-08-18 02:18:27 UTC

I'm updating the title to reflect the difference, nuanced as it may be, between this and Bug # 1360442 which focuses more on technical workstation virtualization use cases. The most immediately obvious differences visible here are:

* RHEL workloads (vs Windows workloads).
* Lesser importance of remote display protocols.
* More likely to require NUMA affinity.

Comment 8 Stephen Gordon 2018-03-07 15:28:50 UTC

NUMA locality will be treated as a stretch goal for vGPU MVP and tracked separately.

Comment 17 errata-xmlrpc 2019-01-11 11:48:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Note You need to log in before you can comment on or make changes to this bug.