Bug 1950536

Summary: Openshift pods stay in a "CreateContainerError" status and show the executable file not found in $PATH" error.
Product: OpenShift Container Platform Reporter: Alfredo Pizarro <apizarro>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, igreen
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-16 20:08:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alfredo Pizarro 2021-04-16 19:53:25 UTC
Description of problem:

After a issue with master nodes with an upgrade from 4.5.16 to 4.6.21 related to another component, the nodes were powered off multiple times trying to stabilize. After this, one node couldn't bring some control plane up:

$ oc get pods 

NAME                                                      READY  STATUS     RESTARTS  AGE
kube-controller-manager-master1.example.com  3/4    CreateContainerError    0         18h
kube-controller-manager-master2.example.com  4/4    Running    0         12m
kube-controller-manager-master3.example.com  4/4    Running    0         18h


$ oc describe pod kube-controller-manager-master1.example.com

Events:
  Type     Reason  Age                     From     Message
  ----     ------  ----                    ----     -------
  Warning  Failed  59m (x8706 over 17h)    kubelet  (combined from similar events): Error: container create failed: time="2021-04-11T21:03:07Z" level=error msg="container_linux.go:366: starting container process caused: exec: \"cluster-kube-scheduler-operator\": executable file not found in $PATH"
  Normal   Pulled  4m10s (x4600 over 17h)  kubelet  Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6468c1dd1ca2d855e171dda54efcb56b8915ba65f9b915899d922c8720d8e7e1" already present on machine.

We tried podman rmi and podman pull of the problematic image but it didn't resolve the issue. After some more troubleshooting we tried to run the image from podman:

[root@master1 ~]# podman run 2810ace6e1fe

And obtain the following error:
readlink /var/lib/containers/storage/overlay: invalid argument"

It seems overlayfs got somehow corrupted so as a workaround we:
- cordon/drained the node
- Stopped all containers, cri-o and kubelet.
- Deleted /var/lib/container/storage/overlay*
- Started services again and all containers were downloaded correctly and the issue was fixed.

Version-Release number of selected component (if applicable):
OCP 4.6.21
Node Image:  Red Hat Enterprise Linux CoreOS 46.82.202103050041-0 (Ootpa)


How reproducible:

I haven't be able to reproduce the exact issue since this issue was likely due to multiple ungraceful power off of the affected node.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Peter Hunt 2021-04-16 20:08:40 UTC
I believe this is due to the node rebooting before a pull had a chance to sync changes to disk. We've implemented a fix for this in 4.8 (basically automating what you've done manually: check if we synced, and if not wipe the storage directory).

There are a few bugs with it still, however, and we're doing some substantial reworks of cri-o's boot flow anyway. I would say don't hold your breath on these bug fixes coming back this far, we need to be very deliberate and make sure nothing could break.

For instance, we opted to not backport to 4.7 already here: https://bugzilla.redhat.com/show_bug.cgi?id=1922154 (not saying we won't fix, but we need to be deliberate)

closing this because I found a very similar bug that's still open

*** This bug has been marked as a duplicate of bug 1942536 ***