Description of problem: After unexpected power outage, CRI-O is not able to start. Version-Release number of selected component (if applicable): OCP 4.6 How reproducible: Frequently Steps to Reproduce: 1. Force a node shutdown in baremetal: `echo b > /proc/sysrq-trigger` 2. CRI-O is not able to start (some times several tries are needed) 3. Actual results: CRI-O is not able to start. `crictl` commands shows the following errors: ~~~ # crictl ps -a time="2021-08-01T01:50:58Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded" # crictl pods time="2021-08-01T01:51:00Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded" ~~~ Podman also fails wit errors like: ~~~ Error: readlink /var/lib/containers/storage/overlay/l/DM... : no such file or directory ~~~ After cleaning the CRI-O ephemeral storage [1], the node is able to start. Expected results: CRI-O to auto-recover Additional info: [1] https://access.redhat.com/solutions/5350721
Is there any chance the user can upgrade to 4.8? in it, we have an enhancement that allows CRI-O to remove the container storage dir if it wasn't cleanly shutdown, allowing for restarts to be safer and more contact free.
fixed in 4.7 with the attached PR merging
Checked on a baremetal cluster with 4.7.0-0.nightly-2021-09-01-110541. Forced shutdown few times. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-09-01-110541 True False 4h23m Cluster version is 4.7.0-0.nightly-2021-09-01-110541 $ oc get nodes NAME STATUS ROLES AGE VERSION master-00.sunilc020947.qe.devcluster.openshift.com Ready master 4h44m v1.20.0+9689d22 master-01.sunilc020947.qe.devcluster.openshift.com Ready master 4h46m v1.20.0+9689d22 master-02.sunilc020947.qe.devcluster.openshift.com Ready master 4h45m v1.20.0+9689d22 worker-00.sunilc020947.qe.devcluster.openshift.com Ready worker 4h35m v1.20.0+9689d22 worker-01.sunilc020947.qe.devcluster.openshift.com Ready worker 4h36m v1.20.0+9689d22 worker-02.sunilc020947.qe.devcluster.openshift.com Ready worker 4h35m v1.20.0+9689d22 $ oc debug node/worker-00.sunilc020947.qe.devcluster.openshift.com Starting pod/worker-00sunilc020947qedevclusteropenshiftcom-debug ... ... sh-4.4# chroot /host sh-4.4# echo b > /proc/sysrq-trigger Removing debug pod ... $ oc get nodes NAME STATUS ROLES AGE VERSION master-00.sunilc020947.qe.devcluster.openshift.com Ready master 4h48m v1.20.0+9689d22 master-01.sunilc020947.qe.devcluster.openshift.com Ready master 4h49m v1.20.0+9689d22 master-02.sunilc020947.qe.devcluster.openshift.com Ready master 4h49m v1.20.0+9689d22 worker-00.sunilc020947.qe.devcluster.openshift.com NotReady worker 4h39m v1.20.0+9689d22 worker-01.sunilc020947.qe.devcluster.openshift.com Ready worker 4h40m v1.20.0+9689d22 worker-02.sunilc020947.qe.devcluster.openshift.com Ready worker 4h39m v1.20.0+9689d22 $ oc get nodes NAME STATUS ROLES AGE VERSION master-00.sunilc020947.qe.devcluster.openshift.com Ready master 4h59m v1.20.0+9689d22 master-01.sunilc020947.qe.devcluster.openshift.com Ready master 5h v1.20.0+9689d22 master-02.sunilc020947.qe.devcluster.openshift.com Ready master 4h59m v1.20.0+9689d22 worker-00.sunilc020947.qe.devcluster.openshift.com Ready worker 4h49m v1.20.0+9689d22 worker-01.sunilc020947.qe.devcluster.openshift.com Ready worker 4h50m v1.20.0+9689d22 worker-02.sunilc020947.qe.devcluster.openshift.com Ready worker 4h49m v1.20.0+9689d22 $ oc debug node/worker-00.sunilc020947.qe.devcluster.openshift.com Starting pod/worker-00sunilc020947qedevclusteropenshiftcom-debug ... ... sh-4.4# systemctl status crio ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf Active: active (running) since Thu 2021-09-02 16:39:06 UTC; 7min ago Docs: https://github.com/cri-o/cri-o Main PID: 2869 (crio) Tasks: 49 Memory: 2.7G CPU: 1min 51.829s CGroup: /system.slice/crio.service └─2869 /usr/bin/crio --enable-metrics=true --metrics-port=9537 ... sh-4.4# crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME 9f68c39a30f87 28 seconds ago Ready worker-00sunilc020947qedevclusteropenshiftcom-debug default 0 (default) 4a6a66db421b1 4 minutes ago NotReady community-operators-gn2c9 openshift-marketplace 0 (default) 370998fa67eaa 4 minutes ago NotReady community-operators-72lzd openshift-marketplace 0 (default) 85183fcab710a 7 minutes ago Ready tuned-8dzdl openshift-cluster-node-tuning-operator 0 (default) b945ec336c835 7 minutes ago Ready node-ca-5rvwv openshift-image-registry 0 (default) d9890b5439648 7 minutes ago Ready network-check-target-lk8fn openshift-network-diagnostics 0 (default) 847a64a273bcb 7 minutes ago Ready network-metrics-daemon-gc6qg openshift-multus 0 (default) 7c04ad0538605 7 minutes ago Ready redhat-marketplace-v7vrv openshift-marketplace 0 (default) 1a37f052e8fad 7 minutes ago Ready machine-config-daemon-vwwqc openshift-machine-config-operator 0 (default) 7092983a9cf33 7 minutes ago Ready community-operators-gmcc9 openshift-marketplace 0 (default) 2e2e850d0d469 7 minutes ago Ready multus-jnq2l openshift-multus 0 (default) 2efa4c2748458 7 minutes ago Ready node-exporter-657sw openshift-monitoring 0 (default) 81929f275b8b1 7 minutes ago Ready sdn-hr2qd openshift-sdn 0 (default) 1c6c6f9c5a6ac 7 minutes ago Ready dns-default-46gqg openshift-dns 0 (default) 4cc58a720a731 7 minutes ago Ready ingress-canary-k7l7t openshift-ingress-canary 0 (default)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.29 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3303