Bug 1835145

Summary:	After CoreOS Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start
Product:	OpenShift Container Platform	Reporter:	Anoel Yakoubov <ayakoubo>
Component:	Special Resource Operator	Assignee:	Zvonko Kosic <zkosic>
Status:	CLOSED NOTABUG	QA Contact:	Walid A. <wabouham>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.2.z	CC:	aos-bugs, bbreard, carangog, dagray, imcleod, jligon, jokerman, nstielau, rphillips, walters
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-07 16:57:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Anoel Yakoubov 2020-05-13 08:53:50 UTC

Description of problem:
After CoreOS Running Openshift 4.2 worker Physical Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start since for our opinion CoreOS don't complete to delete NVidia hook for container runtime, and as a result after bringing server on, CRI-O is looking for nvidia-container-toolkit: no such file or directory" To resolve this issue we are using workaround: delete NVidia hook that reside in path /etc/containers/oci/hooks.d/oci-nvidia-hook.json. After that cri-o starting normally. To check this we are tried to reboot server gracefully and we can see that this issue not happen when server rebooting gracefully, it means that CoreOS succeeded to delete all files it supposed to delete before reboot.

Version-Release number of selected component (if applicable):
OCP 4.2.9, CoreOS 4.2, GPU Operator 1.0.0 (Nvidia)

How reproducible:
Very easy reproducible - All components should be deployed on physical node and node should be disconnected suddenly from electricity

Steps to Reproduce:
1. Deploy OCP 4.2
2. Add GPU (NVidia)
3. Deploy GPU Operator
4. Disconnect Server from electricity
5. Power on Server.


Actual results:
CRI-O is not started after server powered on , NVIdida hook should be deleted manually to allow CRI-O to start

Expected results:

CRI-o Should start automatically

Additional info:

https://github.com/NVIDIA/gpu-operator

Comment 1 Colin Walters 2020-05-13 14:10:20 UTC

I don't think this is a RHCOS bug.  That said this is exactly the type of issue that drives the design behind https://github.com/coreos/fedora-coreos-tracker/issues/354

IOW, the crio hooks that are "lifecycled" to a container image should live in `/run`, not `/etc`.  Moving to Node for any additional comments - this issue has to be worked through between crio and 3rd party hooks.

Comment 2 Kir Kolyshkin 2020-06-18 09:11:30 UTC

Currently cri-o supports /etc/containers/oci/hooks.d/ and /usr/share/containers/oci/hooks.d/

I agree it makes sense to add support for /run.

Comment 3 Kir Kolyshkin 2020-06-18 09:12:53 UTC

Having said that, it's a long term solution. I am not quite sure how to fix this short-term, except maybe to write a hook that does nothing (or even removes itself) in case the binary it needs is not found.

Comment 4 Kir Kolyshkin 2020-07-29 23:02:13 UTC

Peter, can you please look into adding /run/containers/oci/hooks.d support? I see you recently touched that area (https://github.com/openshift/machine-config-operator/pull/1902 and https://github.com/openshift/machine-config-operator/pull/1305)

Comment 5 Peter Hunt 2020-07-31 21:27:50 UTC

I will work on this next sprint

Comment 6 Peter Hunt 2020-08-06 17:50:54 UTC

first step is to create a hooks dir if its not present on the host. we've gone back and fourth on hooks dir handling in the past, but I think this is the cleanest solution

Comment 7 Peter Hunt 2020-08-10 14:17:15 UTC

Hooks handling in CRI-O is merged, now for the MCO bits. after the MCO fixes are merged, it'll be up to the PSAP to finish the integration

Comment 10 Peter Hunt 2020-08-20 13:40:46 UTC

oops, it's premature to have this ON_QA. We need the operators to support this directory

Comment 12 Zvonko Kosic 2020-11-05 19:50:24 UTC

NVIDIA has started to implement this feature for its newest operator. 
Current release 1.3.1 supports 4.4, 4.5 and 4.6. We added some functionality for 
them to detect which OpenShift version they are running to make decisions when to deploy the hook to /run.

Comment 14 Zvonko Kosic 2021-03-09 18:59:50 UTC

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks.

Comment 15 Zvonko Kosic 2021-03-09 18:59:57 UTC

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks.

Comment 16 Carlos Eduardo Arango Gutierrez 2021-03-25 16:05:21 UTC

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 has been merged , I think we can move forward with this BZ
thoughts?

Comment 17 dagray 2021-09-07 16:57:51 UTC

Closing, because this BZ as the fix has been merged in the GPU Operator.

For future issues related to the GPU Operator please open a BZ against the BZ component "ISV Operators", instead of the Special Resource Operator.