Bug 1932324 - CRIO fails to create a Pod in sandbox stage - starting container process caused: process_linux.go:472: container init caused: Running hook #0:: error running hook: exit status 255, stdout: , stderr: \"\n"
Summary: CRIO fails to create a Pod in sandbox stage - starting container process cau...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Sascha Grunert
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-24 13:34 UTC by Vinu K
Modified: 2021-08-23 07:42 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:48:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Pod defintion of the breaking Pod (19.47 KB, text/plain)
2021-03-31 06:39 UTC, Vinu K
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:48:17 UTC

Description Vinu K 2021-02-24 13:34:11 UTC
Description of problem:
CRIO having an issue creating a Pod due to CNI issue. Started the issue in Pipeline Task Pods and affected some community Operator as well. Not observed till OCP 4.6

Version-Release number of selected component (if applicable):
4.6

How reproducible:
CU specific

Steps to Reproduce:
1. Created Pipeline Task and failed to create Pods due to CNI issue.
2.
3.

Actual results:
The Pod should be created without issue.

Expected results:
Pod creates after rebooting the node.

Additional info:
Azure IPI deployment

Comment 2 Peter Hunt 2021-02-24 16:53:13 UTC
what are the contents of /etc/containers/oci/hooks.d and /run/containers/oci/hooks.d

Comment 3 Vinu K 2021-03-02 14:34:54 UTC
Hello Hunt,

Please see the contents inside the directories.

---
# ls -lah /etc/containers/oci/hooks.d/
total 0
drwxr-xr-x. 2 root root  6 Feb 17 12:24 .
drwxr-xr-x. 3 root root 21 Feb 17 12:24 ..

# ls -lah /run/containers/oci/hooks.d/
total 0
drwxr-xr-x. 2 root root 40 Feb 17 12:25 .
drwxr-xr-x. 3 root root 60 Feb 17 12:25 ..
---

Thanks,
Vinu K

Comment 4 Vinu K 2021-03-04 11:28:11 UTC
Hello Team,

Any update on this?

Thanks,
Vinu K

Comment 5 Peter Hunt 2021-03-04 15:11:22 UTC
I'm frankly quite confused on this. If you have no hooks specified, I don't understand why we're trying to run one. I need to look deeper in the logs

Comment 6 Sascha Grunert 2021-03-05 10:02:59 UTC
It looks like that there is only one hooks location configured within /etc/crio/crio.conf:

hooks_dir = [
	"/usr/share/containers/oci/hooks.d",
]

Just to double check, because I'm not sure if the support files contains the full content of that directory: Is the path /usr/share/containers/oci/hooks.d empty, too?

Thank you!

Comment 7 Vinu K 2021-03-09 06:38:50 UTC
Hello Grunert,

Please see the below output. It is empty.

[root@azu-eng-1-cluster-h4frn-worker-weurope-standard-d-4as-v4-32bzt9 /]# ls -lha /usr/share/containers/oci/hooks.d/
total 0
drwxr-xr-x. 2 root root  6 Jan  1  1970 .
drwxr-xr-x. 3 root root 21 Jan  1  1970 ..

Thanks,
Vinu K

Comment 8 Sascha Grunert 2021-03-16 15:07:04 UTC
Hey Vinu,

do you think that it would be possible to access the customer node directly within a remote session? This way we can check directly why it looks like that CRI-O picks-up an OCI hook during runtime.

Comment 9 Vinu K 2021-03-19 08:50:21 UTC
Hello Sascha,

Sure, we can arrange a remote session with the customer. Please allow me some time to confirm the availability of the customer. I will update you here.

Thanks,
Vinu K

Comment 10 Sascha Grunert 2021-03-26 10:57:42 UTC
We had a customer session which provided only partial insights into the issue. From a certain pipeline step the whole node seems unusable and this issue is reproducible. It's not possible anymore to create new sandboxes at this point and runc returns the `running hook: exit status 255` error.

I can investigate the issue further by looking at the pipeline pod which causes the issue. Vinu, do you think you could get this pod definition from the customer? (output of: kubectl get pod <breaking-pod> -o yaml)

Nevertheless, I would like to try an runc upgrade on the node to see if we can scope the issue to it. Replacing the runc binary with a latest static build should be sufficient. We're checking if the customer is running in an test environment where we could check this.

Comment 11 Vinu K 2021-03-31 06:38:38 UTC
Hello Sascha,

I have attached the Pod definition of the breaking Pod. I got the confirmation from the customer that we can try the runc upgrade on the host. Let us schedule a meeting for the same.

Thanks,
Vinu K

Comment 12 Vinu K 2021-03-31 06:39:30 UTC
Created attachment 1767953 [details]
Pod defintion of the breaking Pod

Comment 13 Sascha Grunert 2021-04-14 13:38:46 UTC
We did the customer call and updated runc to the latest master (including a build fix which is not related to this issue):

https://github.com/opencontainers/runc/pull/2908/commits/2f1a3ed3087b7a23fbbcd98659cacfcb08ed6bd5

runc version 1.0.0-rc93+dev
spec: 1.0.2-dev
go: go1.15.11
libseccomp: 2.5.1

Now, the issue seems to be gone. The customer will verify if that is still the case when running more extensive tests.

If this fixes the problem then we may have to update runc in OpenShift.

Comment 14 Vinu K 2021-04-14 16:55:45 UTC
Hello Team,

The CU is confirmed that all issues have been resolved by replacing the latest version of runc on the node. Waiting for the permanent update of the runc binary.

Thanks,
Vinu K

Comment 15 Sascha Grunert 2021-04-15 07:19:15 UTC
Hey Vinu, can you confirm that the customer is happy to upgrade its OpenShift instance, that we can plan to update runc in one of the next releases of OpenShift?

Comment 17 Sascha Grunert 2021-04-27 13:48:18 UTC
Hey Jatan, we're right now evaluating if we can upgrade runc for 4.7, so it won't land in 4.6 unfortunately. I'll update this bug once the update is done.

Comment 18 Sascha Grunert 2021-04-27 15:12:21 UTC
I tagged the runc versions build for 4.8 (rhel 7/8) back to 4.7, which means that they should be available within the next release of 4.7:

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1579139
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1579125

Comment 23 errata-xmlrpc 2021-07-27 22:48:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 24 Vinu K 2021-08-08 05:17:52 UTC
Hello Team,

Will this be backported to 4.7.x as well?

Thanks,
Vinu K

Comment 25 Sascha Grunert 2021-08-09 07:16:46 UTC
(In reply to Vinu K from comment #24)
> Hello Team,
> 
> Will this be backported to 4.7.x as well?
> 
> Thanks,
> Vinu K

Hey Vinu,

yes the fix should be part of 4.7, too.

Best,
Sascha


Note You need to log in before you can comment on or make changes to this bug.