Bug 1835145 - After CoreOS Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start
Summary: After CoreOS Node unexpected shutdown (Electricity Outage) for Node with GPU ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Special Resource Operator
Version: 4.2.z
Hardware: x86_64
OS: All
medium
high
Target Milestone: ---
: ---
Assignee: Zvonko Kosic
QA Contact: Walid A.
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-13 08:53 UTC by Anoel Yakoubov
Modified: 2021-09-07 16:57 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-07 16:57:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 4052 0 None closed config: create hooks dir if not present 2020-12-17 17:40:25 UTC
Github openshift machine-config-operator pull 1998 0 None closed Bug 1835145: crio: add temporary hooks directory 2020-12-17 17:40:24 UTC

Description Anoel Yakoubov 2020-05-13 08:53:50 UTC
Description of problem:
After CoreOS Running Openshift 4.2 worker Physical Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start since for our opinion CoreOS don't complete to delete NVidia hook for container runtime, and as a result after bringing server on, CRI-O is looking for nvidia-container-toolkit: no such file or directory" To resolve this issue we are using workaround: delete NVidia hook that reside in path /etc/containers/oci/hooks.d/oci-nvidia-hook.json. After that cri-o starting normally. To check this we are tried to reboot server gracefully and we can see that this issue not happen when server rebooting gracefully, it means that CoreOS succeeded to delete all files it supposed to delete before reboot.

Version-Release number of selected component (if applicable):
OCP 4.2.9, CoreOS 4.2, GPU Operator 1.0.0 (Nvidia)

How reproducible:
Very easy reproducible - All components should be deployed on physical node and node should be disconnected suddenly from electricity

Steps to Reproduce:
1. Deploy OCP 4.2
2. Add GPU (NVidia)
3. Deploy GPU Operator
4. Disconnect Server from electricity
5. Power on Server.


Actual results:
CRI-O is not started after server powered on , NVIdida hook should be deleted manually to allow CRI-O to start

Expected results:

CRI-o Should start automatically

Additional info:

https://github.com/NVIDIA/gpu-operator

Comment 1 Colin Walters 2020-05-13 14:10:20 UTC
I don't think this is a RHCOS bug.  That said this is exactly the type of issue that drives the design behind https://github.com/coreos/fedora-coreos-tracker/issues/354

IOW, the crio hooks that are "lifecycled" to a container image should live in `/run`, not `/etc`.  Moving to Node for any additional comments - this issue has to be worked through between crio and 3rd party hooks.

Comment 2 Kir Kolyshkin 2020-06-18 09:11:30 UTC
Currently cri-o supports /etc/containers/oci/hooks.d/ and /usr/share/containers/oci/hooks.d/

I agree it makes sense to add support for /run.

Comment 3 Kir Kolyshkin 2020-06-18 09:12:53 UTC
Having said that, it's a long term solution. I am not quite sure how to fix this short-term, except maybe to write a hook that does nothing (or even removes itself) in case the binary it needs is not found.

Comment 4 Kir Kolyshkin 2020-07-29 23:02:13 UTC
Peter, can you please look into adding /run/containers/oci/hooks.d support? I see you recently touched that area (https://github.com/openshift/machine-config-operator/pull/1902 and https://github.com/openshift/machine-config-operator/pull/1305)

Comment 5 Peter Hunt 2020-07-31 21:27:50 UTC
I will work on this next sprint

Comment 6 Peter Hunt 2020-08-06 17:50:54 UTC
first step is to create a hooks dir if its not present on the host. we've gone back and fourth on hooks dir handling in the past, but I think this is the cleanest solution

Comment 7 Peter Hunt 2020-08-10 14:17:15 UTC
Hooks handling in CRI-O is merged, now for the MCO bits. after the MCO fixes are merged, it'll be up to the PSAP to finish the integration

Comment 10 Peter Hunt 2020-08-20 13:40:46 UTC
oops, it's premature to have this ON_QA. We need the operators to support this directory

Comment 12 Zvonko Kosic 2020-11-05 19:50:24 UTC
NVIDIA has started to implement this feature for its newest operator. 
Current release 1.3.1 supports 4.4, 4.5 and 4.6. We added some functionality for 
them to detect which OpenShift version they are running to make decisions when to deploy the hook to /run.

Comment 14 Zvonko Kosic 2021-03-09 18:59:50 UTC
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks.

Comment 15 Zvonko Kosic 2021-03-09 18:59:57 UTC
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks.

Comment 16 Carlos Eduardo Arango Gutierrez 2021-03-25 16:05:21 UTC
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 has been merged , I think we can move forward with this BZ
thoughts?

Comment 17 dagray 2021-09-07 16:57:51 UTC
Closing, because this BZ as the fix has been merged in the GPU Operator.

For future issues related to the GPU Operator please open a BZ against the BZ component "ISV Operators", instead of the Special Resource Operator.


Note You need to log in before you can comment on or make changes to this bug.