Description of problem: CRIO having an issue creating a Pod due to CNI issue. Started the issue in Pipeline Task Pods and affected some community Operator as well. Not observed till OCP 4.6 Version-Release number of selected component (if applicable): 4.6 How reproducible: CU specific Steps to Reproduce: 1. Created Pipeline Task and failed to create Pods due to CNI issue. 2. 3. Actual results: The Pod should be created without issue. Expected results: Pod creates after rebooting the node. Additional info: Azure IPI deployment
what are the contents of /etc/containers/oci/hooks.d and /run/containers/oci/hooks.d
Hello Hunt, Please see the contents inside the directories. --- # ls -lah /etc/containers/oci/hooks.d/ total 0 drwxr-xr-x. 2 root root 6 Feb 17 12:24 . drwxr-xr-x. 3 root root 21 Feb 17 12:24 .. # ls -lah /run/containers/oci/hooks.d/ total 0 drwxr-xr-x. 2 root root 40 Feb 17 12:25 . drwxr-xr-x. 3 root root 60 Feb 17 12:25 .. --- Thanks, Vinu K
Hello Team, Any update on this? Thanks, Vinu K
I'm frankly quite confused on this. If you have no hooks specified, I don't understand why we're trying to run one. I need to look deeper in the logs
It looks like that there is only one hooks location configured within /etc/crio/crio.conf: hooks_dir = [ "/usr/share/containers/oci/hooks.d", ] Just to double check, because I'm not sure if the support files contains the full content of that directory: Is the path /usr/share/containers/oci/hooks.d empty, too? Thank you!
Hello Grunert, Please see the below output. It is empty. [root@azu-eng-1-cluster-h4frn-worker-weurope-standard-d-4as-v4-32bzt9 /]# ls -lha /usr/share/containers/oci/hooks.d/ total 0 drwxr-xr-x. 2 root root 6 Jan 1 1970 . drwxr-xr-x. 3 root root 21 Jan 1 1970 .. Thanks, Vinu K
Hey Vinu, do you think that it would be possible to access the customer node directly within a remote session? This way we can check directly why it looks like that CRI-O picks-up an OCI hook during runtime.
Hello Sascha, Sure, we can arrange a remote session with the customer. Please allow me some time to confirm the availability of the customer. I will update you here. Thanks, Vinu K
We had a customer session which provided only partial insights into the issue. From a certain pipeline step the whole node seems unusable and this issue is reproducible. It's not possible anymore to create new sandboxes at this point and runc returns the `running hook: exit status 255` error. I can investigate the issue further by looking at the pipeline pod which causes the issue. Vinu, do you think you could get this pod definition from the customer? (output of: kubectl get pod <breaking-pod> -o yaml) Nevertheless, I would like to try an runc upgrade on the node to see if we can scope the issue to it. Replacing the runc binary with a latest static build should be sufficient. We're checking if the customer is running in an test environment where we could check this.
Hello Sascha, I have attached the Pod definition of the breaking Pod. I got the confirmation from the customer that we can try the runc upgrade on the host. Let us schedule a meeting for the same. Thanks, Vinu K
Created attachment 1767953 [details] Pod defintion of the breaking Pod
We did the customer call and updated runc to the latest master (including a build fix which is not related to this issue): https://github.com/opencontainers/runc/pull/2908/commits/2f1a3ed3087b7a23fbbcd98659cacfcb08ed6bd5 runc version 1.0.0-rc93+dev spec: 1.0.2-dev go: go1.15.11 libseccomp: 2.5.1 Now, the issue seems to be gone. The customer will verify if that is still the case when running more extensive tests. If this fixes the problem then we may have to update runc in OpenShift.
Hello Team, The CU is confirmed that all issues have been resolved by replacing the latest version of runc on the node. Waiting for the permanent update of the runc binary. Thanks, Vinu K
Hey Vinu, can you confirm that the customer is happy to upgrade its OpenShift instance, that we can plan to update runc in one of the next releases of OpenShift?
Hey Jatan, we're right now evaluating if we can upgrade runc for 4.7, so it won't land in 4.6 unfortunately. I'll update this bug once the update is done.
I tagged the runc versions build for 4.8 (rhel 7/8) back to 4.7, which means that they should be available within the next release of 4.7: https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1579139 https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1579125
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438
Hello Team, Will this be backported to 4.7.x as well? Thanks, Vinu K
(In reply to Vinu K from comment #24) > Hello Team, > > Will this be backported to 4.7.x as well? > > Thanks, > Vinu K Hey Vinu, yes the fix should be part of 4.7, too. Best, Sascha