1835145 – After CoreOS Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start

Bug 1835145 - After CoreOS Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start

Summary: After CoreOS Node unexpected shutdown (Electricity Outage) for Node with GPU ...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Special Resource Operator
Sub Component:
Version:	4.2.z
Hardware:	x86_64
OS:	All
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Zvonko Kosic
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-13 08:53 UTC by Anoel Yakoubov
Modified:	2021-09-07 16:57 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-07 16:57:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 4052	0	None	closed	config: create hooks dir if not present	2020-12-17 17:40:25 UTC
Github	openshift machine-config-operator pull 1998	0	None	closed	Bug 1835145: crio: add temporary hooks directory	2020-12-17 17:40:24 UTC

Description Anoel Yakoubov 2020-05-13 08:53:50 UTC

Description of problem:
After CoreOS Running Openshift 4.2 worker Physical Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start since for our opinion CoreOS don't complete to delete NVidia hook for container runtime, and as a result after bringing server on, CRI-O is looking for nvidia-container-toolkit: no such file or directory" To resolve this issue we are using workaround: delete NVidia hook that reside in path /etc/containers/oci/hooks.d/oci-nvidia-hook.json. After that cri-o starting normally. To check this we are tried to reboot server gracefully and we can see that this issue not happen when server rebooting gracefully, it means that CoreOS succeeded to delete all files it supposed to delete before reboot.

Version-Release number of selected component (if applicable):
OCP 4.2.9, CoreOS 4.2, GPU Operator 1.0.0 (Nvidia)

How reproducible:
Very easy reproducible - All components should be deployed on physical node and node should be disconnected suddenly from electricity

Steps to Reproduce:
1. Deploy OCP 4.2
2. Add GPU (NVidia)
3. Deploy GPU Operator
4. Disconnect Server from electricity
5. Power on Server.


Actual results:
CRI-O is not started after server powered on , NVIdida hook should be deleted manually to allow CRI-O to start

Expected results:

CRI-o Should start automatically

Additional info:

https://github.com/NVIDIA/gpu-operator

Comment 1 Colin Walters 2020-05-13 14:10:20 UTC

I don't think this is a RHCOS bug.  That said this is exactly the type of issue that drives the design behind https://github.com/coreos/fedora-coreos-tracker/issues/354

IOW, the crio hooks that are "lifecycled" to a container image should live in `/run`, not `/etc`.  Moving to Node for any additional comments - this issue has to be worked through between crio and 3rd party hooks.

Comment 2 Kir Kolyshkin 2020-06-18 09:11:30 UTC

Currently cri-o supports /etc/containers/oci/hooks.d/ and /usr/share/containers/oci/hooks.d/

I agree it makes sense to add support for /run.

Comment 3 Kir Kolyshkin 2020-06-18 09:12:53 UTC

Having said that, it's a long term solution. I am not quite sure how to fix this short-term, except maybe to write a hook that does nothing (or even removes itself) in case the binary it needs is not found.

Comment 4 Kir Kolyshkin 2020-07-29 23:02:13 UTC

Peter, can you please look into adding /run/containers/oci/hooks.d support? I see you recently touched that area (https://github.com/openshift/machine-config-operator/pull/1902 and https://github.com/openshift/machine-config-operator/pull/1305)

Comment 5 Peter Hunt 2020-07-31 21:27:50 UTC

I will work on this next sprint

Comment 6 Peter Hunt 2020-08-06 17:50:54 UTC

first step is to create a hooks dir if its not present on the host. we've gone back and fourth on hooks dir handling in the past, but I think this is the cleanest solution

Comment 7 Peter Hunt 2020-08-10 14:17:15 UTC

Hooks handling in CRI-O is merged, now for the MCO bits. after the MCO fixes are merged, it'll be up to the PSAP to finish the integration

Comment 10 Peter Hunt 2020-08-20 13:40:46 UTC

oops, it's premature to have this ON_QA. We need the operators to support this directory

Comment 12 Zvonko Kosic 2020-11-05 19:50:24 UTC

NVIDIA has started to implement this feature for its newest operator. 
Current release 1.3.1 supports 4.4, 4.5 and 4.6. We added some functionality for 
them to detect which OpenShift version they are running to make decisions when to deploy the hook to /run.

Comment 14 Zvonko Kosic 2021-03-09 18:59:50 UTC

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks.

Comment 15 Zvonko Kosic 2021-03-09 18:59:57 UTC

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks.

Comment 16 Carlos Eduardo Arango Gutierrez 2021-03-25 16:05:21 UTC

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 has been merged , I think we can move forward with this BZ
thoughts?

Comment 17 dagray 2021-09-07 16:57:51 UTC

Closing, because this BZ as the fix has been merged in the GPU Operator.

For future issues related to the GPU Operator please open a BZ against the BZ component "ISV Operators", instead of the Special Resource Operator.

Note You need to log in before you can comment on or make changes to this bug.