+++ Description of problem: After the reboot, when starting a container the following error appears ``` Stat /var/lib/containers/storage/overlay/9d5eaa43e868265191761d09f2aabfbacba7965313a4f84cfb6aff933979ac17: no such file or directory' ``` ``` containerStatuses: - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1294e206f477da38cdf954e101f504ea95d4996b2cf679eaea83a02c8ef350d8 imageID: "" lastState: {} name: webhook ready: false restartCount: 0 started: false state: waiting: message: 'error creating read-write layer with ID "4132d03a41d0653b15af1b40e03cf3d9ee82d8a6e5288e63ecf1a85a39877844": Stat /var/lib/containers/storage/overlay/9d5eaa43e868265191761d09f2aabfbacba7965313a4f84cfb6aff933979ac17: no such file or directory' reason: CreateContainerError ``` +++ Version-Release number of selected component (if applicable): OCP 4.8.2 RHCOS 48.84.202107202156-0 CRI-O 1.21.2-5.rhaos4.8.gitb27d974.el8 +++ How reproducible: The issue happens at random +++ Additional info: Related issues from Podman * https://bugzilla.redhat.com/show_bug.cgi?id=1966872 * https://bugzilla.redhat.com/show_bug.cgi?id=1921128
this is weird, I would not expect that directory to not be there... is there a cluster showing this problem I could hop onto and debug?
I don't have at the moment a cluster with this issue (it's only been observed when customers tried to install OCP), but will try to get one available for us as soon as we see the issue again. Maybe @sgrunert can give some insights on what was happening (given that some fix in CRI-O based on the Podman bugfix has already been created in the past) ?
Is this still an issue?
(In reply to Peter Hunt from comment #4) > Is this still an issue? @
Hi Peter, I think it yes, @jiwei hit it twice on 4.8.10 nightly, would you mind letting us know what logs to collect? Thanks,
can I have the CRI-O logs from an affected node?
Steps I have done: - found the way to identify the image that is messed up: `for image in $(podman images -q); do podman inspect $image >/dev/null || echo $image; done` - found a way to find which image pull failed: `podman rmi $image` (will print the sha value) Steps I have not done: - figured out what went wrong. I have the log snippet between pulling the container, apparently successfully finishing that pull, and attempting to create the container (and failing). It does not illuminate what went wrong though. I suspect we are incompletely pulling the image, as there is no discernible reboot that could be causing this inconsistency It stinks to have to ask again, but can we recreate this situation, this time with cri-o log level at debug? One could create a machine config that enables cri-o debug logs: ``` apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: set-pids-limit spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" containerRuntimeConfig: logLevel: debug ``` and then pass this file to the installation manifests like in https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-customizing.html
@pehunt >According to your suggestion above, we tried 2 scenarios today: (1) create the CRs (.yaml files) as manifests, then do the OCP installation: the installation failed due to "Cluster operator machine-config Available is False with : Cluster not available for 4.8.10", and the error is 'machineconfig.machineconfiguration.openshift.io \"rendered-master-0d5d14b6abb8058b77276c05dc729097\" not found'. Do you know, if such CRs is supported by bootstrap or during installation? (2) do OCP installation, then create the CRs (.yaml files) and then "oc create -f <the yaml file>" to apply them: we did get one pod with CreateContainerError, please check the crio debug logs at the node "jiwei-vv-master-2". ssh -i openshift-qe.pem root.162.99 cd working-dir/ export KUBECONFIG=/root/working-dir/install-dir/auth/kubeconfig export PATH=$PATH:/root/working-dir/ [root@jiwei-vv-rhel7-bastion working-dir]# ls 99-* -l -rw-r--r--. 1 root root 274 Sep 17 15:10 99-master-set-container-log.yaml -rw-r--r--. 1 root root 281 Sep 17 15:28 99-worker-set-container-log.yaml [root@jiwei-vv-rhel7-bastion working-dir]# oc create -f 99-master-set-container-log.yaml [root@jiwei-vv-rhel7-bastion working-dir]# oc create -f 99-worker-set-container-log.yaml >...<wait until UPDATED True>... [root@jiwei-vv-rhel7-bastion working-dir]# oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-11482d77778b0ab83816982fea896ec1 True False False 3 3 3 0 118m worker rendered-worker-05dc70e7c14b52184d7baad99fa238b6 True False False 2 2 2 0 118m [root@jiwei-vv-rhel7-bastion working-dir]# >then "reboot" jiwei-vv-master-2 [root@jiwei-vv-rhel7-bastion working-dir]# oc get pods -o wide -n openshift-multus | grep -v Running NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES multus-additional-cni-plugins-nbqsk 0/1 Init:CreateContainerError 3 123m 172.16.1.103 jiwei-vv-master-2 <none> <none> [root@jiwei-vv-rhel7-bastion working-dir]# oc logs multus-additional-cni-plugins-nbqsk -n openshift-multus Error from server (BadRequest): container "kube-multus-additional-cni-plugins" in pod "multus-additional-cni-plugins-nbqsk" is waiting to start: PodInitializing [root@jiwei-vv-rhel7-bastion working-dir]# [root@jiwei-vv-master-2 core]# journalctl -b -f -u kubelet.service ... Sep 17 08:35:23 jiwei-vv-master-2 hyperkube[1723]: E0917 08:35:23.834875 1723 pod_workers.go:190] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cni-plugins\" with CreateContainerError: \"error creating read-write layer with ID \\\"a6b8252492da72fccb69e53f3efe114a5892987b3853b2c6d8ce69ac0419a80b\\\": Stat /var/lib/containers/storage/overlay/cdaf9ce9b7f812cb489f883d17778e35bff417931221d8b803a449a60f503067: no such file or directory\"" pod="openshift-multus/multus-additional-cni-plugins-nbqsk" podUID=2810ee09-943a-4db9-b4d9-4384d3337876 ^C [root@jiwei-vv-master-2 core]#
Sorry didn't get a chance to look at this. It seems the node you used to use as a jump host is down. Are you able to recreate the situation for me again?
(In reply to Peter Hunt from comment #14) > Sorry didn't get a chance to look at this. It seems the node you used to use > as a jump host is down. Are you able to recreate the situation for me again? @pehunt We tried again today, the first time is 4.8.13 which succeeded, and the second time is 4.8.10 which got the issue again. (1) 4.8.13 flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/44159/ (2) 4.8.10 flexy-install job: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/44204/ >The cluster is still there, and please ssh to the bastion firstly, e.g. ssh -i openshift-qe.pem root.133.49 cd working-dir/ export KUBECONFIG=/root/working-dir/install-dir/auth/kubeconfig export PATH=$PATH:/root/working-dir/ [root@jiwei-bug1993243-rhel7-bastion working-dir]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 70m Unable to apply 4.8.10: the cluster operator operator-lifecycle-manager has not yet successfully rolled out [root@jiwei-bug1993243-rhel7-bastion working-dir]# oc get pods --all-namespaces -o wide | grep -Ev 'Running|Completed' NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-operator-lifecycle-manager catalog-operator-6b799568b4-4f9m6 0/1 CreateContainerError 0 70m 10.128.0.30 jiwei-bug1993243-master-0 <none> <none> openshift-operator-lifecycle-manager olm-operator-6fcb8fb89b-8nn94 0/1 CreateContainerError 0 70m 10.128.0.28 jiwei-bug1993243-master-0 <none> <none> [root@jiwei-bug1993243-rhel7-bastion working-dir]#
I still unfortunately need debug logs set in cri-o when the node is bootstrapping so I can see what has happened to the image. It doesn't seem this node has such logs :( > Do you know, if such CRs is supported by bootstrap or during installation? I think so, as they're managed by the same controller than handles day 1 kernel args (MCO) https://docs.openshift.com/container-platform/4.3/installing/install_config/installing-customizing.html#installation-special-config-kargs_installing-customizing
(In reply to Peter Hunt from comment #16) > I still unfortunately need debug logs set in cri-o when the node is > bootstrapping so I can see what has happened to the image. It doesn't seem > this node has such logs :( > > > Do you know, if such CRs is supported by bootstrap or during installation? > > I think so, as they're managed by the same controller than handles day 1 > kernel args (MCO) > https://docs.openshift.com/container-platform/4.3/installing/install_config/ > installing-customizing.html#installation-special-config-kargs_installing- > customizing @pehunt We retried on GCP today, to double-confirm whether CRs on CRIO is supported by bootstrap or during installation, the installation finally failed with below info. If possible, please advise, thanks! INFO Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform INFO Cluster operator insights Disabled is False with AsExpected: INFO Cluster operator machine-config Progressing is True with : Working towards 4.8.10 ERROR Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.8.10: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) INFO Cluster operator machine-config Available is False with : Cluster not available for 4.8.10 INFO Cluster operator network ManagementStateDegraded is False with : ERROR Cluster initialization failed because one or more operators are not functioning properly. ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation FATAL failed to initialize the cluster: Cluster operator machine-config is not available >FYI some additional info: >(1) the yaml files of these CRS we used: [jiwei@jiwei ocp_lab]$ more bak/openshift/*-crio-* :::::::::::::: bak/openshift/99_openshift-machineconfig-master-crio-args.yaml :::::::::::::: apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: master-crio-args spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/master: "" containerRuntimeConfig: logLevel: debug :::::::::::::: bak/openshift/99_openshift-machineconfig-worker-crio-args.yaml :::::::::::::: apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: worker-crio-args spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: "" containerRuntimeConfig: logLevel: debug [jiwei@jiwei ocp_lab]$ >(2) master nodes don't enable log_level debug, although worker nodes do: [jiwei@jiwei ocp_lab]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 48m Unable to apply 4.8.10: the cluster operator machine-config has not yet successfully rolled out [jiwei@jiwei ocp_lab]$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master False True True 3 0 0 3 42m worker rendered-worker-1887b09844b695abe33af7e3cfb4b3d4 True False False 3 3 3 0 42m [jiwei@jiwei ocp_lab]$ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 00-worker a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 01-master-container-runtime a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 01-master-kubelet a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 01-worker-container-runtime a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 01-worker-kubelet a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 99-master-generated-containerruntime-1 a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 99-master-generated-registries a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 99-master-ssh 3.2.0 48m 99-worker-generated-containerruntime-1 a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 99-worker-generated-registries a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m 99-worker-ssh 3.2.0 48m rendered-master-707c105444a18763672eb9d878846cbe a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m rendered-worker-1887b09844b695abe33af7e3cfb4b3d4 a537783ea4a0cd3b4fe2a02626ab27887307ea51 3.2.0 41m [jiwei@jiwei ocp_lab]$ [jiwei@jiwei ocp_lab]$ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-bug1993243-m5jj7-master-0.c.openshift-qe.internal Ready master 34m v1.21.1+9807387 jiwei-bug1993243-m5jj7-master-1.c.openshift-qe.internal Ready master 34m v1.21.1+9807387 jiwei-bug1993243-m5jj7-master-2.c.openshift-qe.internal Ready master 34m v1.21.1+9807387 jiwei-bug1993243-m5jj7-worker-a-chxlm.c.openshift-qe.internal Ready worker 25m v1.21.1+9807387 jiwei-bug1993243-m5jj7-worker-b-gtcjx.c.openshift-qe.internal Ready worker 28m v1.21.1+9807387 jiwei-bug1993243-m5jj7-worker-c-8bqpv.c.openshift-qe.internal Ready worker 27m v1.21.1+9807387 [jiwei@jiwei ocp_lab]$ oc debug node/jiwei-bug1993243-m5jj7-master-0.c.openshift-qe.internal Starting pod/jiwei-bug1993243-m5jj7-master-0copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.3 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls /etc/crio/crio.conf.d/ -l total 4 -rw-r--r--. 1 root root 1033 Sep 26 08:04 00-default sh-4.4# cat /etc/crio/crio.conf.d/00-default | grep log_level log_level = "info" sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... [jiwei@jiwei ocp_lab]$ [jiwei@jiwei ocp_lab]$ oc debug node/jiwei-bug1993243-m5jj7-worker-a-chxlm.c.openshift-qe.internal Starting pod/jiwei-bug1993243-m5jj7-worker-a-chxlmcopenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.4 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ls /etc/crio/crio.conf.d/ -l total 8 -rw-r--r--. 1 root root 1033 Sep 26 08:12 00-default -rw-r--r--. 1 root root 48 Sep 26 08:12 01-ctrcfg-logLevel sh-4.4# cat /etc/crio/crio.conf.d/01-ctrcfg-logLevel | grep log_level log_level = "debug" sh-4.4# exit exit sh-4.4# exit exit Removing debug pod ... [jiwei@jiwei ocp_lab]$ [jiwei@jiwei ocp_lab]$ oc describe co/machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-09-26T08:02:51Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:extension: .: f:master: f:relatedObjects: Manager: cluster-version-operator Operation: Update Time: 2021-09-26T08:02:51Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: Manager: machine-config-operator Operation: Update Time: 2021-09-26T08:53:50Z Resource Version: 39415 UID: ca5ce37c-5d27-4fea-97ab-9aa62102164c Spec: Status: Conditions: Last Transition Time: 2021-09-26T08:08:44Z Message: Working towards 4.8.10 Status: True Type: Progressing Last Transition Time: 2021-09-26T08:28:13Z Message: Unable to apply 4.8.10: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 3) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2021-09-26T08:08:45Z Message: Cluster not available for 4.8.10 Status: False Type: Available Last Transition Time: 2021-09-26T08:10:10Z Message: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading Reason: DegradedPool Status: False Type: Upgradeable Extension: Master: pool is degraded because nodes fail with "3 nodes are reporting degraded status on sync": "Node jiwei-bug1993243-m5jj7-master-0.c.openshift-qe.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-26cce7d7b513ad7139c466b52617e589\\\" not found\", Node jiwei-bug1993243-m5jj7-master-1.c.openshift-qe.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-26cce7d7b513ad7139c466b52617e589\\\" not found\", Node jiwei-bug1993243-m5jj7-master-2.c.openshift-qe.internal is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-26cce7d7b513ad7139c466b52617e589\\\" not found\"" Worker: all 3 nodes are at latest configuration rendered-worker-1887b09844b695abe33af7e3cfb4b3d4 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Group: Name: openshift-kni-infra Resource: namespaces Group: Name: openshift-openstack-infra Resource: namespaces Group: Name: openshift-ovirt-infra Resource: namespaces Group: Name: openshift-vsphere-infra Resource: namespaces Events: <none> [jiwei@jiwei ocp_lab]$
@pehunt We got the issue once more, and with 4.9.0-0.nightly. If possible, please investigate, thanks! FYI The flexy-install job which is with 4.9.0-0.nightly-2021-10-13-035504: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/47816/
We have observed the issue once again with the following message (http://pastebin.test.redhat.com/1003075) ``` - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:272d2259952b888e4a1c3777a5d8a5a7d5d5f102b6eb80085fc7e139a79ee151 imageID: "" lastState: {} name: cni-plugins ready: false restartCount: 0 state: waiting: message: 'error creating read-write layer with ID "8308dc299dfe1377c5c010c558b16230a7fc1bc91e7bbffc4f30be7192c66162": Stat /var/lib/containers/storage/overlay/4dfe7fd33fb8275620237ee7c7340e3eca04caddb4bc85a037c1ff45b63f9e90: no such file or directory' reason: CreateContainerError ``` @tsohlber can be contacted in case access to the environment is helpful in debugging this issue. The OCP version used is 4.8.12. From the AI side we have collected a bit of logs (must-gather is amongst them) in https://issues.redhat.com/browse/AITRIAGE-1758.
sorry for the delay, is this cluster still available? I'm not sure I'll be able to figure out what happened but it's worth a shot
Sorry, we got this setup for a week and it's been already quite some time
Yeah sorry about that. Unfortunately this is both a hard bug to reproduce and hard to tell what happened when it does reproduce. Possibly in the next sprint I can investigate adding additional logging by default so we can possibly catch the situation and get more information.
Hello @pehunt
Hello @pehunt I have a customer hitting similar issue discussed in this Bugzilla, while trying to installa OCP 4.10.26 on VMware using UPI. Unfortunately, we don't have the cluster at the moment, yet we have the required logs that has been captured during installation. ~~~ image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b224ee3992a37d95ee59d051551a8e2a0471a5af7706264fb7aacd2ebfa0410f imageID: "" lastState: {} name: kube-rbac-proxy ready: false restartCount: 0 started: false state: waiting: message: 'error creating read-write layer with ID "d236321fa75dfec9b28fca5f79100aa7dda16de8b8ab70e2d971bdba77358170": Stat /var/lib/containers/storage/overlay/329bad5ee6aed2dde088f825dab9fa65334c951359c459c3e4e37d0e1dcc1514: no such file or directory' reason: CreateContainerError ~~~ Could you please let me know what is targeted openshift version to have fix for this issue? Also CU wants automated workaround to be provided for such issue faced during cluster installation. Please do let me know if any details required. cc: arajendr
Hello @pehunt Could you please update on the above query?
we have neither a target for fix nor a workaround at the moment. we haven't been able to reproduce reliably enough to figure out the situation this arises
Hello @pehunt, please find the below concern from the Customer. Is there anything additional we can collect ((along with Must-Gather, Sosreport, inspect namespaces, Project, node, Prometheus data dump, and audit logs) from such a failure that would allow you to plan resolution for that or workaround? As we want to know what went wrong, and how it can be a workaround to avoid issue appearing.
I don't know of anything else. The trouble here is multi-fold 1: some entity is removing this directory when they shouldn't, and they're not announcing when it's happening. 2: If we were to instrument every call to remove the directory in the crio binary, and let folks try, there's no saying we'll reproduce it. This is the worst kind of bug IMO: sporatic, probably due to a race, with no clear reproducer. It may help me to have the must-gather/sos_report that you currently have. I can try digging around in it One lead we have is it has happened more than expected on assisted installer installations. Was the one you found such an installation?
for tracking, there is some work to make the container/storage library locking better optimized for cri-o: https://github.com/containers/storage/issues/1332 It's my hope that that work ties up the issues here
Hello @pehunt, Please find the below link containing the cluster Must-Gather and Sosreport. https://attachments.access.redhat.com/hydra/rest/cases/03316257/attachments/93c5c6f4-09fb-4d6f-b02b-4526a55de384?usePresignedUrl=true One lead we have is it has happened more than expected on assisted installer installations. Was the one you found such an installation? >>> Here, CU tried cluster installation OCP 4.10.26 on VMware using the UPI method.
Hello @pehunt, Could you please update?
@nalin is working on a patch to the storage library to detect situations where the storage is corrupted like this. Once that's done, I'll work on a PR to be able to catch situations like this and repair the image (if possible). In the meantime, @mitr is working on his set of PRs that will help the storage library be more robust. I'm afraid there isn't going to be an easy fix, and it will take a while to propagate into openshift. everyone's patience is appreciated
Hello @pehunt, I understand it's a hard bug to reproduce and to find the exact cause of the issue. CU mentioned that they really need to understand the expected timeline of this. Is there any way we could make an assumption here?
No because there's no guarantee it'd be correct. I am not comfortable telling a customer a timeline without certainty the timeline will be respected
Hello @pehunt Is there any update on this Bugzilla? Please let me know if there is any additional input I can provide to weigh this Bugzilla with higher priority.
Hello @pehunt Is there any update on this Bugzilla? The customer is concerned about the Bugzilla progress.
After a power breakdown one of my 4.8.0-0.okd-2021-11-14-052418 master can not start 17 of his 29 pods with a similar message: Failed to pull image "quay.io/openshift/okd-content@sha256:459f15f0e457edaf04fa1a44be6858044d9af4de276620df46dc91a565ddb4ec": rpc error: code = Unknown desc = Error committing the finished image: error adding layer with blob "sha256:8523d6fd474185cca7ea077e7df87aca17a30c041cd4a02f379c7774a20a3dd1": error creating layer with ID "8ece14562d413a6f861625e8bb22ffa8e1ce933941ce23229da056697f08ba4e": Stat /var/lib/containers/storage/overlay/304a849950b39c8f546f2a914fa3d3c1c6425ea03d15baa53701ef68a95e0f33/diff: no such file or directory If I delete the pod the new one have the same problem. @pehunt Can I help to find the problem with collecting logs?
that one is conceptually a bit simpler: I bet there was an image pull that was in progress when the node unexpectedly shut down. in newer openshifts, CRI-O automatically removes `/var/lib/containers` on re-startup to prevent an unexpected node shutdown from causing this. I don't know if there are logs that could be gathered to help the case where a node is just running and this happens.
Thanks, removing `/var/lib/containers` helped.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-8939