Description of problem: After CoreOS Running Openshift 4.2 worker Physical Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start since for our opinion CoreOS don't complete to delete NVidia hook for container runtime, and as a result after bringing server on, CRI-O is looking for nvidia-container-toolkit: no such file or directory" To resolve this issue we are using workaround: delete NVidia hook that reside in path /etc/containers/oci/hooks.d/oci-nvidia-hook.json. After that cri-o starting normally. To check this we are tried to reboot server gracefully and we can see that this issue not happen when server rebooting gracefully, it means that CoreOS succeeded to delete all files it supposed to delete before reboot. Version-Release number of selected component (if applicable): OCP 4.2.9, CoreOS 4.2, GPU Operator 1.0.0 (Nvidia) How reproducible: Very easy reproducible - All components should be deployed on physical node and node should be disconnected suddenly from electricity Steps to Reproduce: 1. Deploy OCP 4.2 2. Add GPU (NVidia) 3. Deploy GPU Operator 4. Disconnect Server from electricity 5. Power on Server. Actual results: CRI-O is not started after server powered on , NVIdida hook should be deleted manually to allow CRI-O to start Expected results: CRI-o Should start automatically Additional info: https://github.com/NVIDIA/gpu-operator
I don't think this is a RHCOS bug. That said this is exactly the type of issue that drives the design behind https://github.com/coreos/fedora-coreos-tracker/issues/354 IOW, the crio hooks that are "lifecycled" to a container image should live in `/run`, not `/etc`. Moving to Node for any additional comments - this issue has to be worked through between crio and 3rd party hooks.
Currently cri-o supports /etc/containers/oci/hooks.d/ and /usr/share/containers/oci/hooks.d/ I agree it makes sense to add support for /run.
Having said that, it's a long term solution. I am not quite sure how to fix this short-term, except maybe to write a hook that does nothing (or even removes itself) in case the binary it needs is not found.
Peter, can you please look into adding /run/containers/oci/hooks.d support? I see you recently touched that area (https://github.com/openshift/machine-config-operator/pull/1902 and https://github.com/openshift/machine-config-operator/pull/1305)
I will work on this next sprint
first step is to create a hooks dir if its not present on the host. we've gone back and fourth on hooks dir handling in the past, but I think this is the cleanest solution
Hooks handling in CRI-O is merged, now for the MCO bits. after the MCO fixes are merged, it'll be up to the PSAP to finish the integration
oops, it's premature to have this ON_QA. We need the operators to support this directory
NVIDIA has started to implement this feature for its newest operator. Current release 1.3.1 supports 4.4, 4.5 and 4.6. We added some functionality for them to detect which OpenShift version they are running to make decisions when to deploy the hook to /run.
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks.
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 has been merged , I think we can move forward with this BZ thoughts?
Closing, because this BZ as the fix has been merged in the GPU Operator. For future issues related to the GPU Operator please open a BZ against the BZ component "ISV Operators", instead of the Special Resource Operator.