1950536 – Openshift pods stay in a "CreateContainerError" status and show the executable file not found in $PATH" error.

Bug 1950536 - Openshift pods stay in a "CreateContainerError" status and show the executable file not found in $PATH" error.

Summary: Openshift pods stay in a "CreateContainerError" status and show the executab...

Keywords:
Status:	CLOSED DUPLICATE of bug 1942536
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-16 19:53 UTC by Alfredo Pizarro
Modified:	2022-03-15 07:28 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-16 20:08:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5972661	0	None	None	None	2022-03-15 07:28:15 UTC

Description Alfredo Pizarro 2021-04-16 19:53:25 UTC

Description of problem:

After a issue with master nodes with an upgrade from 4.5.16 to 4.6.21 related to another component, the nodes were powered off multiple times trying to stabilize. After this, one node couldn't bring some control plane up:

$ oc get pods 

NAME                                                      READY  STATUS     RESTARTS  AGE
kube-controller-manager-master1.example.com  3/4    CreateContainerError    0         18h
kube-controller-manager-master2.example.com  4/4    Running    0         12m
kube-controller-manager-master3.example.com  4/4    Running    0         18h


$ oc describe pod kube-controller-manager-master1.example.com

Events:
  Type     Reason  Age                     From     Message
  ----     ------  ----                    ----     -------
  Warning  Failed  59m (x8706 over 17h)    kubelet  (combined from similar events): Error: container create failed: time="2021-04-11T21:03:07Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"cluster-kube-scheduler-operator\": executable file not found in $PATH"
  Normal   Pulled  4m10s (x4600 over 17h)  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6468c1dd1ca2d855e171dda54efcb56b8915ba65f9b915899d922c8720d8e7e1" already present on machine.

We tried podman rmi and podman pull of the problematic image but it didn't resolve the issue. After some more troubleshooting we tried to run the image from podman:

[root@master1 ~]# podman run 2810ace6e1fe

And obtain the following error:
readlink /var/lib/containers/storage/overlay: invalid argument"

It seems overlayfs got somehow corrupted so as a workaround we:
- cordon/drained the node
- Stopped all containers, cri-o and kubelet.
- Deleted /var/lib/container/storage/overlay*
- Started services again and all containers were downloaded correctly and the issue was fixed.

Version-Release number of selected component (if applicable):
OCP 4.6.21
Node Image:  Red Hat Enterprise Linux CoreOS 46.82.202103050041-0 (Ootpa)


How reproducible:

I haven't be able to reproduce the exact issue since this issue was likely due to multiple ungraceful power off of the affected node.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2021-04-16 20:08:40 UTC

I believe this is due to the node rebooting before a pull had a chance to sync changes to disk. We've implemented a fix for this in 4.8 (basically automating what you've done manually: check if we synced, and if not wipe the storage directory).

There are a few bugs with it still, however, and we're doing some substantial reworks of cri-o's boot flow anyway. I would say don't hold your breath on these bug fixes coming back this far, we need to be very deliberate and make sure nothing could break.

For instance, we opted to not backport to 4.7 already here: https://bugzilla.redhat.com/show_bug.cgi?id=1922154 (not saying we won't fix, but we need to be deliberate)

closing this because I found a very similar bug that's still open

*** This bug has been marked as a duplicate of bug 1942536 ***

Note You need to log in before you can comment on or make changes to this bug.