1729855 – unable to access GPU inside the pod

Bug 1729855 - unable to access GPU inside the pod

Summary: unable to access GPU inside the pod

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Seth Jennings
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-15 06:54 UTC by Sudarshan Chaudhari
Modified:	2019-07-23 16:34 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-23 16:34:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Sudarshan Chaudhari 2019-07-15 06:54:42 UTC

Description of problem:

the app nodes are physical servers with nvidia v100 graphical cards. Useing the following reference guide to install the required software:
https://blog.openshift.com/how-to-use-gpus-with-deviceplugin-in-openshift-3-10/

When running the "nvidia-smi" command from within the pod, getting the error:
~~~
failed to initialize nvml: insufficient permissions
~~~

To fix this we need to explicitly run the command [1] as stated in the https://github.com/NVIDIA/nvidia-container-runtime/issues/28

[1]  chcon -t container_file_t /dev/nvidia*.

As a fix we made the entry selinux entry and using daemon restoreconnd to make the persistent changes.

Version-Release number of selected component (if applicable):
OCP cluster 3.11.88

How reproducible:
everytine

Need help to check if there is any proper way to make the fix. 

Expected results:
The command should have ran without need to make any changes. 


Additional info:

Any other information or docs can help to identify this issue.

Comment 1 Greg Blomquist 2019-07-23 14:01:12 UTC

Seth, is this something that could or would be fixed in 4.x?

For 3.x, I wonder if we could settle with a kbase solution.  I don't want to set expectations that this 3.x fix will bubble to the top of the list.

Comment 2 Seth Jennings 2019-07-23 16:34:37 UTC

The referenced document is a blog post, not part of our official documentation (supported procedures).

The blog post should have just included the chcon/semanage command from the beginning as it is a required step for that to work with selinux in enforcing mode.

The reporter is already doing the correct thing to resolve that issue.

Note You need to log in before you can comment on or make changes to this bug.