Bug 1993243 - Error when starting a container "Stat /var/lib/containers/storage/overlay/XXX: no such file or directory"
Summary: Error when starting a container "Stat /var/lib/containers/storage/overlay/XXX...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-12 15:41 UTC by Mat Kowalski
Modified: 2023-03-09 01:05 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-03-09 01:05:24 UTC
Target Upstream Version:
Embargoed:
mko: needinfo-
pehunt: needinfo? (gscrivan)
pehunt: needinfo-
pehunt: needinfo-
tasano: needinfo-


Attachments (Terms of Use)

Description Mat Kowalski 2021-08-12 15:41:23 UTC
+++ Description of problem:

After the reboot, when starting a container the following error appears

```
Stat /var/lib/containers/storage/overlay/9d5eaa43e868265191761d09f2aabfbacba7965313a4f84cfb6aff933979ac17:
          no such file or directory'
```

```
  containerStatuses:
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1294e206f477da38cdf954e101f504ea95d4996b2cf679eaea83a02c8ef350d8
    imageID: ""
    lastState: {}
    name: webhook
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: 'error creating read-write layer with ID "4132d03a41d0653b15af1b40e03cf3d9ee82d8a6e5288e63ecf1a85a39877844":
          Stat /var/lib/containers/storage/overlay/9d5eaa43e868265191761d09f2aabfbacba7965313a4f84cfb6aff933979ac17:
          no such file or directory'
        reason: CreateContainerError
```

+++ Version-Release number of selected component (if applicable):

OCP 4.8.2
RHCOS 48.84.202107202156-0
CRI-O 1.21.2-5.rhaos4.8.gitb27d974.el8


+++ How reproducible:

The issue happens at random

+++ Additional info:

Related issues from Podman
* https://bugzilla.redhat.com/show_bug.cgi?id=1966872
* https://bugzilla.redhat.com/show_bug.cgi?id=1921128

Comment 2 Peter Hunt 2021-08-12 20:24:38 UTC
this is weird, I would not expect that directory to not be there... is there a cluster showing this problem I could hop onto and debug?

Comment 3 Mat Kowalski 2021-08-16 07:29:56 UTC
I don't have at the moment a cluster with this issue (it's only been observed when customers tried to install OCP), but will try to get one available for us as soon as we see the issue again. Maybe @sgrunert can give some insights on what was happening (given that some fix in CRI-O based on the Podman bugfix has already been created in the past) ?

Comment 4 Peter Hunt 2021-09-09 17:57:49 UTC
Is this still an issue?

Comment 5 Weinan Liu 2021-09-13 02:38:47 UTC
(In reply to Peter Hunt from comment #4)
> Is this still an issue?

@

Comment 6 Weinan Liu 2021-09-13 02:41:34 UTC
Hi Peter, 
I think it yes, @jiwei hit it twice on 4.8.10 nightly, would you mind letting us know what logs to collect?

Thanks,

Comment 7 Peter Hunt 2021-09-13 13:57:42 UTC
can I have the CRI-O logs from an affected node?

Comment 12 Peter Hunt 2021-09-16 18:01:44 UTC
Steps I have done:
- found the way to identify the image that is messed up: `for image in $(podman images -q); do podman inspect $image >/dev/null || echo $image; done`
- found a way to find which image pull failed: `podman rmi $image` (will print the sha value)

Steps I have not done:
- figured out what went wrong. I have the log snippet between pulling the container, apparently successfully finishing that pull, and attempting to create the container (and failing). It does not illuminate what went wrong though. I suspect we are incompletely pulling the image, as there is no discernible reboot that could be causing this inconsistency

It stinks to have to ask again, but can we recreate this situation, this time with cri-o log level at debug? One could create a machine config that enables cri-o debug logs:
```
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: set-pids-limit
spec:
 machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/worker: ""
 containerRuntimeConfig:
   logLevel: debug
```

and then pass this file to the installation manifests like in https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-customizing.html

Comment 13 Jianli Wei 2021-09-17 09:15:50 UTC
@pehunt 

>According to your suggestion above, we tried 2 scenarios today: 

(1) create the CRs (.yaml files) as manifests, then do the OCP installation: the installation failed due to "Cluster operator machine-config Available is False with : Cluster not available for 4.8.10", and the error is 'machineconfig.machineconfiguration.openshift.io \"rendered-master-0d5d14b6abb8058b77276c05dc729097\" not found'. Do you know, if such CRs is supported by bootstrap or during installation?  

(2) do OCP installation, then create the CRs (.yaml files) and then "oc create -f <the yaml file>" to apply them: we did get one pod with CreateContainerError, please check the crio debug logs at the node "jiwei-vv-master-2".

ssh -i openshift-qe.pem root.162.99
cd working-dir/
export KUBECONFIG=/root/working-dir/install-dir/auth/kubeconfig 
export PATH=$PATH:/root/working-dir/

[root@jiwei-vv-rhel7-bastion working-dir]# ls 99-* -l
-rw-r--r--. 1 root root 274 Sep 17 15:10 99-master-set-container-log.yaml
-rw-r--r--. 1 root root 281 Sep 17 15:28 99-worker-set-container-log.yaml
[root@jiwei-vv-rhel7-bastion working-dir]# oc create -f 99-master-set-container-log.yaml
[root@jiwei-vv-rhel7-bastion working-dir]# oc create -f 99-worker-set-container-log.yaml
>...<wait until UPDATED True>...
[root@jiwei-vv-rhel7-bastion working-dir]# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-11482d77778b0ab83816982fea896ec1   True      False      False      3              3                   3                     0                      118m
worker   rendered-worker-05dc70e7c14b52184d7baad99fa238b6   True      False      False      2              2                   2                     0                      118m
[root@jiwei-vv-rhel7-bastion working-dir]# 

>then "reboot" jiwei-vv-master-2

[root@jiwei-vv-rhel7-bastion working-dir]# oc get pods -o wide -n openshift-multus | grep -v Running
NAME                                  READY   STATUS                      RESTARTS   AGE    IP             NODE                NOMINATED NODE   READINESS GATES
multus-additional-cni-plugins-nbqsk   0/1     Init:CreateContainerError   3          123m   172.16.1.103   jiwei-vv-master-2   <none>           <none>
[root@jiwei-vv-rhel7-bastion working-dir]# oc logs multus-additional-cni-plugins-nbqsk -n openshift-multus
Error from server (BadRequest): container "kube-multus-additional-cni-plugins" in pod "multus-additional-cni-plugins-nbqsk" is waiting to start: PodInitializing
[root@jiwei-vv-rhel7-bastion working-dir]# 


[root@jiwei-vv-master-2 core]# journalctl -b -f -u kubelet.service 
...
Sep 17 08:35:23 jiwei-vv-master-2 hyperkube[1723]: E0917 08:35:23.834875    1723 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cni-plugins\" with CreateContainerError: \"error creating read-write layer with ID \\\"a6b8252492da72fccb69e53f3efe114a5892987b3853b2c6d8ce69ac0419a80b\\\": Stat /var/lib/containers/storage/overlay/cdaf9ce9b7f812cb489f883d17778e35bff417931221d8b803a449a60f503067: no such file or directory\"" pod="openshift-multus/multus-additional-cni-plugins-nbqsk" podUID=2810ee09-943a-4db9-b4d9-4384d3337876
^C
[root@jiwei-vv-master-2 core]#

Comment 14 Peter Hunt 2021-09-22 19:30:49 UTC
Sorry didn't get a chance to look at this. It seems the node you used to use as a jump host is down. Are you able to recreate the situation for me again?

Comment 15 Jianli Wei 2021-09-24 08:39:19 UTC
(In reply to Peter Hunt from comment #14)
> Sorry didn't get a chance to look at this. It seems the node you used to use
> as a jump host is down. Are you able to recreate the situation for me again?

@pehunt We tried again today, the first time is 4.8.13 which succeeded, and the second time is 4.8.10 which got the issue again. 

(1) 4.8.13 flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/44159/ 

(2) 4.8.10 flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/44204/

>The cluster is still there, and please ssh to the bastion firstly, e.g. 
ssh -i openshift-qe.pem root.133.49
cd working-dir/
export KUBECONFIG=/root/working-dir/install-dir/auth/kubeconfig 
export PATH=$PATH:/root/working-dir/

[root@jiwei-bug1993243-rhel7-bastion working-dir]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          70m     Unable to apply 4.8.10: the cluster operator operator-lifecycle-manager has not yet successfully rolled out
[root@jiwei-bug1993243-rhel7-bastion working-dir]# oc get pods --all-namespaces -o wide | grep -Ev 'Running|Completed'
NAMESPACE                                          NAME                                                      READY   STATUS                 RESTARTS   AGE   IP            NODE                        NOMINATED NODE   READINESS GATES
openshift-operator-lifecycle-manager               catalog-operator-6b799568b4-4f9m6                         0/1     CreateContainerError   0          70m   10.128.0.30   jiwei-bug1993243-master-0   <none>           <none>
openshift-operator-lifecycle-manager               olm-operator-6fcb8fb89b-8nn94                             0/1     CreateContainerError   0          70m   10.128.0.28   jiwei-bug1993243-master-0   <none>           <none>
[root@jiwei-bug1993243-rhel7-bastion working-dir]#

Comment 16 Peter Hunt 2021-09-24 16:58:55 UTC
I still unfortunately need debug logs set in cri-o when the node is bootstrapping so I can see what has happened to the image. It doesn't seem this node has such logs :(

> Do you know, if such CRs is supported by bootstrap or during installation?  

I think so, as they're managed by the same controller than handles day 1 kernel args (MCO) https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-customizing.html#installation-special-config-kargs_installing-customizing

Comment 17 Jianli Wei 2021-09-26 09:03:14 UTC
(In reply to Peter Hunt from comment #16)
> I still unfortunately need debug logs set in cri-o when the node is
> bootstrapping so I can see what has happened to the image. It doesn't seem
> this node has such logs :(
> 
> > Do you know, if such CRs is supported by bootstrap or during installation?  
> 
> I think so, as they're managed by the same controller than handles day 1
> kernel args (MCO)
> https://docs.openshift.com/container-platform/4.3/installing/install_config/
> installing-customizing.html#installation-special-config-kargs_installing-
> customizing

@pehunt We retried on GCP today, to double-confirm whether CRs on CRIO is supported by bootstrap or during installation, the installation finally failed with below info. If possible, please advise, thanks!  

INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform
INFO Cluster operator insights Disabled is False with AsExpected:
INFO Cluster operator machine-config Progressing is True with : Working towards 4.8.10
ERROR Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.8.10: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)
INFO Cluster operator machine-config Available is False with : Cluster not available for 4.8.10
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
FATAL failed to initialize the cluster: Cluster operator machine-config is not available

>FYI some additional info: 

>(1) the yaml files of these CRS we used:

[jiwei@jiwei ocp_lab]$ more bak/openshift/*-crio-*
::::::::::::::
bak/openshift/99_openshift-machineconfig-master-crio-args.yaml
::::::::::::::
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: master-crio-args
spec:
 machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/master: ""
 containerRuntimeConfig:
   logLevel: debug

::::::::::::::
bak/openshift/99_openshift-machineconfig-worker-crio-args.yaml
::::::::::::::
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: worker-crio-args
spec:
 machineConfigPoolSelector:
   matchLabels:
     pools.operator.machineconfiguration.openshift.io/worker: ""
 containerRuntimeConfig:
   logLevel: debug

[jiwei@jiwei ocp_lab]$ 

>(2) master nodes don't enable log_level debug, although worker nodes do: 

[jiwei@jiwei ocp_lab]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          48m     Unable to apply 4.8.10: the cluster operator machine-config has not yet successfully rolled out
[jiwei@jiwei ocp_lab]$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master                                                      False     True       True       3              0                   0                     3                      42m
worker   rendered-worker-1887b09844b695abe33af7e3cfb4b3d4   True      False      False      3              3                   3                     0                      42m
[jiwei@jiwei ocp_lab]$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
00-worker                                          a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
01-master-container-runtime                        a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
01-master-kubelet                                  a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
01-worker-container-runtime                        a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
01-worker-kubelet                                  a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
99-master-generated-containerruntime-1             a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
99-master-generated-registries                     a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
99-master-ssh                                                                                 3.2.0             48m
99-worker-generated-containerruntime-1             a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
99-worker-generated-registries                     a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
99-worker-ssh                                                                                 3.2.0             48m
rendered-master-707c105444a18763672eb9d878846cbe   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
rendered-worker-1887b09844b695abe33af7e3cfb4b3d4   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             41m
[jiwei@jiwei ocp_lab]$ 
[jiwei@jiwei ocp_lab]$ oc get nodes
NAME                                                            STATUS   ROLES    AGE   VERSION
jiwei-bug1993243-m5jj7-master-0.c.openshift-qe.internal         Ready    master   34m   v1.21.1+9807387
jiwei-bug1993243-m5jj7-master-1.c.openshift-qe.internal         Ready    master   34m   v1.21.1+9807387
jiwei-bug1993243-m5jj7-master-2.c.openshift-qe.internal         Ready    master   34m   v1.21.1+9807387
jiwei-bug1993243-m5jj7-worker-a-chxlm.c.openshift-qe.internal   Ready    worker   25m   v1.21.1+9807387
jiwei-bug1993243-m5jj7-worker-b-gtcjx.c.openshift-qe.internal   Ready    worker   28m   v1.21.1+9807387
jiwei-bug1993243-m5jj7-worker-c-8bqpv.c.openshift-qe.internal   Ready    worker   27m   v1.21.1+9807387
[jiwei@jiwei ocp_lab]$ oc debug node/jiwei-bug1993243-m5jj7-master-0.c.openshift-qe.internal
Starting pod/jiwei-bug1993243-m5jj7-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.0.3
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls /etc/crio/crio.conf.d/ -l
total 4
-rw-r--r--. 1 root root 1033 Sep 26 08:04 00-default
sh-4.4# cat /etc/crio/crio.conf.d/00-default | grep log_level
log_level = "info"
sh-4.4# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
[jiwei@jiwei ocp_lab]$ 
[jiwei@jiwei ocp_lab]$ oc debug node/jiwei-bug1993243-m5jj7-worker-a-chxlm.c.openshift-qe.internal
Starting pod/jiwei-bug1993243-m5jj7-worker-a-chxlmcopenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.4
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ls /etc/crio/crio.conf.d/ -l
total 8
-rw-r--r--. 1 root root 1033 Sep 26 08:12 00-default
-rw-r--r--. 1 root root   48 Sep 26 08:12 01-ctrcfg-logLevel
sh-4.4# cat /etc/crio/crio.conf.d/01-ctrcfg-logLevel | grep log_level
    log_level = "debug"
sh-4.4# exit
exit
sh-4.4# exit
exit

Removing debug pod ...
[jiwei@jiwei ocp_lab]$ 
[jiwei@jiwei ocp_lab]$ oc describe co/machine-config
Name:         machine-config
Namespace:
Labels:       <none>   
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-09-26T08:02:51Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations: 
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
          f:include.release.openshift.io/single-node-developer:
      f:spec:
      f:status:
        .:
        f:extension:   
          .:
          f:master:
        f:relatedObjects:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-09-26T08:02:51Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:  
        f:extension:   
          .:
          f:master:
          f:worker:
        f:relatedObjects:
    Manager:         machine-config-operator
    Operation:       Update
    Time:            2021-09-26T08:53:50Z
  Resource Version:  39415
  UID:               ca5ce37c-5d27-4fea-97ab-9aa62102164c
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-09-26T08:08:44Z
    Message:               Working towards 4.8.10
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-09-26T08:28:13Z
    Message:               Unable to apply 4.8.10: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3)
    Reason:                RequiredPoolsFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-09-26T08:08:45Z
    Message:               Cluster not available for 4.8.10
    Status:                False
    Type:                  Available
    Last Transition Time:  2021-09-26T08:10:10Z
    Message:               One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading
    Reason:                DegradedPool
    Status:                False
    Type:                  Upgradeable
  Extension:
    Master:  pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node jiwei-bug1993243-m5jj7-master-0.c.openshift-qe.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-26cce7d7b513ad7139c466b52617e589\\\" not found\", Node jiwei-bug1993243-m5jj7-master-1.c.openshift-qe.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-26cce7d7b513ad7139c466b52617e589\\\" not found\", Node jiwei-bug1993243-m5jj7-master-2.c.openshift-qe.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-26cce7d7b513ad7139c466b52617e589\\\" not found\""
    Worker:  all 3 nodes are at latest configuration rendered-worker-1887b09844b695abe33af7e3cfb4b3d4
  Related Objects:
    Group:
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  controllerconfigs
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  kubeletconfigs
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  containerruntimeconfigs
    Group:     machineconfiguration.openshift.io
    Name:
    Resource:  machineconfigs
    Group:
    Name:
    Resource:  nodes   
    Group:
    Name:      openshift-kni-infra
    Resource:  namespaces
    Group:
    Name:      openshift-openstack-infra
    Resource:  namespaces
    Group:
    Name:      openshift-ovirt-infra
    Resource:  namespaces
    Group:
    Name:      openshift-vsphere-infra
    Resource:  namespaces
Events:        <none>  
[jiwei@jiwei ocp_lab]$

Comment 18 Jianli Wei 2021-10-13 06:35:36 UTC
@pehunt We got the issue once more, and with 4.9.0-0.nightly. If possible, please investigate, thanks! 

FYI The flexy-install job which is with 4.9.0-0.nightly-2021-10-13-035504: 
https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/47816/

Comment 19 Mat Kowalski 2021-10-21 10:59:48 UTC
We have observed the issue once again with the following message (http://pastebin.test.redhat.com/1003075)

```
  - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:272d2259952b888e4a1c3777a5d8a5a7d5d5f102b6eb80085fc7e139a79ee151
    imageID: ""
    lastState: {}
    name: cni-plugins
    ready: false
    restartCount: 0
    state:
      waiting:
        message: 'error creating read-write layer with ID "8308dc299dfe1377c5c010c558b16230a7fc1bc91e7bbffc4f30be7192c66162": Stat /var/lib/containers/storage/overlay/4dfe7fd33fb8275620237ee7c7340e3eca04caddb4bc85a037c1ff45b63f9e90: no such file or directory'
        reason: CreateContainerError
```

@tsohlber can be contacted in case access to the environment is helpful in debugging this issue. The OCP version used is 4.8.12. From the AI side we have collected a bit of logs (must-gather is amongst them) in https://issues.redhat.com/browse/AITRIAGE-1758.

Comment 20 Peter Hunt 2021-11-03 18:48:29 UTC
sorry for the delay, is this cluster still available? I'm not sure I'll be able to figure out what happened but it's worth a shot

Comment 21 Mat Kowalski 2021-11-12 12:34:26 UTC
Sorry, we got this setup for a week and it's been already quite some time

Comment 22 Peter Hunt 2021-12-17 21:22:09 UTC
Yeah sorry about that. Unfortunately this is both a hard bug to reproduce and hard to tell what happened when it does reproduce. Possibly in the next sprint I can investigate adding additional logging by default so we can possibly catch the situation and get more information.

Comment 34 Santhiya R 2022-09-23 11:24:52 UTC
Hello @pehunt

Comment 35 Santhiya R 2022-09-23 11:39:44 UTC
Hello @pehunt I have a customer hitting similar issue discussed in this Bugzilla, while trying to installa OCP 4.10.26 on VMware using UPI. Unfortunately, we don't have the cluster at the moment, yet we have the required logs that has been captured during installation.                                                                                                   

~~~
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b224ee3992a37d95ee59d051551a8e2a0471a5af7706264fb7aacd2ebfa0410f
imageID: ""
lastState: {}
name: kube-rbac-proxy
ready: false
restartCount: 0
started: false
state:
  waiting:


message: 'error creating read-write layer with ID "d236321fa75dfec9b28fca5f79100aa7dda16de8b8ab70e2d971bdba77358170":
  Stat /var/lib/containers/storage/overlay/329bad5ee6aed2dde088f825dab9fa65334c951359c459c3e4e37d0e1dcc1514:
  no such file or directory'
reason: CreateContainerError
~~~

Could you please let me know what is targeted openshift version to have fix for this issue? Also CU wants automated workaround to be provided for such issue faced during cluster installation. Please do let me know if any details required.

cc: arajendr

Comment 36 Santhiya R 2022-09-23 11:40:24 UTC
Hello @pehunt I have a customer hitting similar issue discussed in this Bugzilla, while trying to installa OCP 4.10.26 on VMware using UPI. Unfortunately, we don't have the cluster at the moment, yet we have the required logs that has been captured during installation.                                                                                                   

~~~
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b224ee3992a37d95ee59d051551a8e2a0471a5af7706264fb7aacd2ebfa0410f
imageID: ""
lastState: {}
name: kube-rbac-proxy
ready: false
restartCount: 0
started: false
state:
  waiting:


message: 'error creating read-write layer with ID "d236321fa75dfec9b28fca5f79100aa7dda16de8b8ab70e2d971bdba77358170":
  Stat /var/lib/containers/storage/overlay/329bad5ee6aed2dde088f825dab9fa65334c951359c459c3e4e37d0e1dcc1514:
  no such file or directory'
reason: CreateContainerError
~~~

Could you please let me know what is targeted openshift version to have fix for this issue? Also CU wants automated workaround to be provided for such issue faced during cluster installation. Please do let me know if any details required.

cc: arajendr

Comment 37 Santhiya R 2022-09-27 13:00:26 UTC
Hello @pehunt Could you please update on the above query?

Comment 38 Peter Hunt 2022-09-27 13:37:38 UTC
we have neither a target for fix nor a workaround at the moment. we haven't been able to reproduce reliably enough to figure out the situation this arises

Comment 39 Santhiya R 2022-09-28 07:58:03 UTC
Hello @pehunt, please find the below concern from the Customer. 

Is there anything additional we can collect ((along with Must-Gather, Sosreport, inspect namespaces, Project, node, Prometheus data dump, and audit logs) from such a failure that would allow you to plan resolution for that or workaround? As we want to know what went wrong, and how it can be a workaround to avoid issue appearing.

Comment 40 Peter Hunt 2022-09-30 17:30:06 UTC
I don't know of anything else. The trouble here is multi-fold
1: some entity is removing this directory when they shouldn't, and they're not announcing when it's happening.
2: If we were to instrument every call to remove the directory in the crio binary, and let folks try, there's no saying we'll reproduce it.

This is the worst kind of bug IMO: sporatic, probably due to a race, with no clear reproducer. It may help me to have the must-gather/sos_report that you currently have. I can try digging around in it

One lead we have is it has happened more than expected on assisted installer installations. Was the one you found such an installation?

Comment 41 Peter Hunt 2022-09-30 17:45:18 UTC
for tracking, there is some work to make the container/storage library locking better optimized for cri-o: https://github.com/containers/storage/issues/1332
It's my hope that that work ties up the issues here

Comment 43 Santhiya R 2022-10-03 06:08:13 UTC
Hello @pehunt, Please find the below link containing the cluster Must-Gather and Sosreport.

https://attachments.access.redhat.com/hydra/rest/cases/03316257/attachments/93c5c6f4-09fb-4d6f-b02b-4526a55de384?usePresignedUrl=true

One lead we have is it has happened more than expected on assisted installer installations. Was the one you found such an installation? >>> Here, CU tried cluster installation OCP 4.10.26 on VMware using the UPI method.

Comment 44 Santhiya R 2022-10-05 05:37:06 UTC
Hello @pehunt, Could you please update?

Comment 45 Peter Hunt 2022-10-05 15:53:32 UTC
@nalin is working on a patch to the storage library to detect situations where the storage is corrupted like this. Once that's done, I'll work on a PR to be able to catch situations like this and repair the image (if possible). In the meantime, @mitr is working on his set of PRs that will help the storage library be more robust. I'm afraid there isn't going to be an easy fix, and it will take a while to propagate into openshift.
everyone's patience is appreciated

Comment 46 Santhiya R 2022-10-11 05:16:06 UTC
  Hello @pehunt, I understand it's a hard bug to reproduce and to find the exact cause of the issue. CU mentioned that they really need to understand the expected timeline of this. Is there any way we could make an assumption here?

Comment 47 Peter Hunt 2022-10-11 13:44:17 UTC
No because there's no guarantee it'd be correct. I am not comfortable telling a customer a timeline without certainty the timeline will be respected

Comment 48 Santhiya R 2022-12-01 09:43:04 UTC
Hello @pehunt Is there any update on this Bugzilla? Please let me know if there is any additional input I can provide to weigh this Bugzilla with higher priority.

Comment 53 Santhiya R 2022-12-14 04:57:57 UTC
Hello @pehunt Is there any update on this Bugzilla? The customer is concerned about the Bugzilla progress.

Comment 61 mbudnick 2023-01-06 10:01:12 UTC
After a power breakdown one of my 4.8.0-0.okd-2021-11-14-052418 master can not start 17 of his 29 pods with a similar message:

Failed to pull image "quay.io/openshift/okd-content@sha256:459f15f0e457edaf04fa1a44be6858044d9af4de276620df46dc91a565ddb4ec": rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:8523d6fd474185cca7ea077e7df87aca17a30c041cd4a02f379c7774a20a3dd1": error creating layer with ID "8ece14562d413a6f861625e8bb22ffa8e1ce933941ce23229da056697f08ba4e": Stat /var/lib/containers/storage/overlay/304a849950b39c8f546f2a914fa3d3c1c6425ea03d15baa53701ef68a95e0f33/diff: no such file or directory

If I delete the pod the new one have the same problem.

@pehunt Can I help to find the problem with collecting logs?

Comment 62 Peter Hunt 2023-01-09 18:18:12 UTC
that one is conceptually a bit simpler: I bet there was an image pull that was in progress when the node unexpectedly shut down. in newer openshifts, CRI-O automatically removes `/var/lib/containers` on re-startup to prevent an unexpected node shutdown from causing this. I don't know if there are logs that could be gathered to help the case where a node is just running and this happens.

Comment 64 mbudnick 2023-01-10 10:43:03 UTC
Thanks, removing `/var/lib/containers` helped.

Comment 67 Shiftzilla 2023-03-09 01:05:24 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8939


Note You need to log in before you can comment on or make changes to this bug.