Bug 1835145
Summary: | After CoreOS Node unexpected shutdown (Electricity Outage) for Node with GPU CRI-O failed to start | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Anoel Yakoubov <ayakoubo> |
Component: | Special Resource Operator | Assignee: | Zvonko Kosic <zkosic> |
Status: | CLOSED NOTABUG | QA Contact: | Walid A. <wabouham> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 4.2.z | CC: | aos-bugs, bbreard, carangog, dagray, imcleod, jligon, jokerman, nstielau, rphillips, walters |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | All | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-07 16:57:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Anoel Yakoubov
2020-05-13 08:53:50 UTC
I don't think this is a RHCOS bug. That said this is exactly the type of issue that drives the design behind https://github.com/coreos/fedora-coreos-tracker/issues/354 IOW, the crio hooks that are "lifecycled" to a container image should live in `/run`, not `/etc`. Moving to Node for any additional comments - this issue has to be worked through between crio and 3rd party hooks. Currently cri-o supports /etc/containers/oci/hooks.d/ and /usr/share/containers/oci/hooks.d/ I agree it makes sense to add support for /run. Having said that, it's a long term solution. I am not quite sure how to fix this short-term, except maybe to write a hook that does nothing (or even removes itself) in case the binary it needs is not found. Peter, can you please look into adding /run/containers/oci/hooks.d support? I see you recently touched that area (https://github.com/openshift/machine-config-operator/pull/1902 and https://github.com/openshift/machine-config-operator/pull/1305) I will work on this next sprint first step is to create a hooks dir if its not present on the host. we've gone back and fourth on hooks dir handling in the past, but I think this is the cleanest solution Hooks handling in CRI-O is merged, now for the MCO bits. after the MCO fixes are merged, it'll be up to the PSAP to finish the integration oops, it's premature to have this ON_QA. We need the operators to support this directory NVIDIA has started to implement this feature for its newest operator. Current release 1.3.1 supports 4.4, 4.5 and 4.6. We added some functionality for them to detect which OpenShift version they are running to make decisions when to deploy the hook to /run. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 Will be released in v1.7 ~4 weeks. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/192 has been merged , I think we can move forward with this BZ thoughts? Closing, because this BZ as the fix has been merged in the GPU Operator. For future issues related to the GPU Operator please open a BZ against the BZ component "ISV Operators", instead of the Special Resource Operator. |