Bug 1995199 - After unexpected power outage some nodes could not start anything via crio
Summary: After unexpected power outage some nodes could not start anything via crio
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.7.z
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On: 1942536
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-18 15:52 UTC by oarribas
Modified: 2024-12-20 20:45 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-09-08 13:17:53 UTC
Target Upstream Version:
Embargoed:
pehunt: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 5229 0 None None None 2021-08-23 20:12:50 UTC
Red Hat Product Errata RHSA-2021:3303 0 None None None 2021-09-08 13:18:17 UTC

Description oarribas 2021-08-18 15:52:36 UTC
Description of problem:

After unexpected power outage, CRI-O  is not able to start.


Version-Release number of selected component (if applicable):

OCP 4.6


How reproducible:

Frequently


Steps to Reproduce:
1. Force a node shutdown in baremetal: `echo b > /proc/sysrq-trigger`
2. CRI-O is not able to start (some times several tries are needed)
3.


Actual results:

CRI-O is not able to start. `crictl` commands shows the following errors:
~~~
# crictl ps -a
time="2021-08-01T01:50:58Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded"

# crictl pods
time="2021-08-01T01:51:00Z" level=fatal msg="connect: connect endpoint 'unix:///var/run/crio/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded"
~~~

Podman also fails wit errors like:
~~~
Error: readlink /var/lib/containers/storage/overlay/l/DM... : no such file or directory
~~~

After cleaning the CRI-O ephemeral storage [1], the node is able to start.


Expected results:

CRI-O to auto-recover



Additional info:




[1] https://access.redhat.com/solutions/5350721

Comment 1 Peter Hunt 2021-08-18 15:56:58 UTC
Is there any chance the user can upgrade to 4.8? in it, we have an enhancement that allows CRI-O to remove the container storage dir if it wasn't cleanly shutdown, allowing for restarts to be safer and more contact free.

Comment 8 Peter Hunt 2021-08-23 20:12:50 UTC
fixed in 4.7 with the attached PR merging

Comment 13 Sunil Choudhary 2021-09-02 16:49:51 UTC
Checked on a baremetal cluster with 4.7.0-0.nightly-2021-09-01-110541. Forced shutdown few times.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-09-01-110541   True        False         4h23m   Cluster version is 4.7.0-0.nightly-2021-09-01-110541

$ oc get nodes
NAME                                                 STATUS   ROLES    AGE     VERSION
master-00.sunilc020947.qe.devcluster.openshift.com   Ready    master   4h44m   v1.20.0+9689d22
master-01.sunilc020947.qe.devcluster.openshift.com   Ready    master   4h46m   v1.20.0+9689d22
master-02.sunilc020947.qe.devcluster.openshift.com   Ready    master   4h45m   v1.20.0+9689d22
worker-00.sunilc020947.qe.devcluster.openshift.com   Ready    worker   4h35m   v1.20.0+9689d22
worker-01.sunilc020947.qe.devcluster.openshift.com   Ready    worker   4h36m   v1.20.0+9689d22
worker-02.sunilc020947.qe.devcluster.openshift.com   Ready    worker   4h35m   v1.20.0+9689d22

$ oc debug node/worker-00.sunilc020947.qe.devcluster.openshift.com
Starting pod/worker-00sunilc020947qedevclusteropenshiftcom-debug ...
...

sh-4.4# chroot /host
sh-4.4# echo b > /proc/sysrq-trigger

Removing debug pod ...

$ oc get nodes
NAME                                                 STATUS     ROLES    AGE     VERSION
master-00.sunilc020947.qe.devcluster.openshift.com   Ready      master   4h48m   v1.20.0+9689d22
master-01.sunilc020947.qe.devcluster.openshift.com   Ready      master   4h49m   v1.20.0+9689d22
master-02.sunilc020947.qe.devcluster.openshift.com   Ready      master   4h49m   v1.20.0+9689d22
worker-00.sunilc020947.qe.devcluster.openshift.com   NotReady   worker   4h39m   v1.20.0+9689d22
worker-01.sunilc020947.qe.devcluster.openshift.com   Ready      worker   4h40m   v1.20.0+9689d22
worker-02.sunilc020947.qe.devcluster.openshift.com   Ready      worker   4h39m   v1.20.0+9689d22

$  oc get nodes
NAME                                                 STATUS   ROLES    AGE     VERSION
master-00.sunilc020947.qe.devcluster.openshift.com   Ready    master   4h59m   v1.20.0+9689d22
master-01.sunilc020947.qe.devcluster.openshift.com   Ready    master   5h      v1.20.0+9689d22
master-02.sunilc020947.qe.devcluster.openshift.com   Ready    master   4h59m   v1.20.0+9689d22
worker-00.sunilc020947.qe.devcluster.openshift.com   Ready    worker   4h49m   v1.20.0+9689d22
worker-01.sunilc020947.qe.devcluster.openshift.com   Ready    worker   4h50m   v1.20.0+9689d22
worker-02.sunilc020947.qe.devcluster.openshift.com   Ready    worker   4h49m   v1.20.0+9689d22


$ oc debug node/worker-00.sunilc020947.qe.devcluster.openshift.com
Starting pod/worker-00sunilc020947qedevclusteropenshiftcom-debug ...
...

sh-4.4# systemctl status crio
● crio.service - Open Container Initiative Daemon
   Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/crio.service.d
           └─10-mco-default-env.conf, 10-mco-default-madv.conf, 10-mco-profile-unix-socket.conf, 20-nodenet.conf
   Active: active (running) since Thu 2021-09-02 16:39:06 UTC; 7min ago
     Docs: https://github.com/cri-o/cri-o
 Main PID: 2869 (crio)
    Tasks: 49
   Memory: 2.7G
      CPU: 1min 51.829s
   CGroup: /system.slice/crio.service
           └─2869 /usr/bin/crio --enable-metrics=true --metrics-port=9537
...


sh-4.4# crictl pods 
POD ID              CREATED             STATE               NAME                                                  NAMESPACE                                ATTEMPT             RUNTIME
9f68c39a30f87       28 seconds ago      Ready               worker-00sunilc020947qedevclusteropenshiftcom-debug   default                                  0                   (default)
4a6a66db421b1       4 minutes ago       NotReady            community-operators-gn2c9                             openshift-marketplace                    0                   (default)
370998fa67eaa       4 minutes ago       NotReady            community-operators-72lzd                             openshift-marketplace                    0                   (default)
85183fcab710a       7 minutes ago       Ready               tuned-8dzdl                                           openshift-cluster-node-tuning-operator   0                   (default)
b945ec336c835       7 minutes ago       Ready               node-ca-5rvwv                                         openshift-image-registry                 0                   (default)
d9890b5439648       7 minutes ago       Ready               network-check-target-lk8fn                            openshift-network-diagnostics            0                   (default)
847a64a273bcb       7 minutes ago       Ready               network-metrics-daemon-gc6qg                          openshift-multus                         0                   (default)
7c04ad0538605       7 minutes ago       Ready               redhat-marketplace-v7vrv                              openshift-marketplace                    0                   (default)
1a37f052e8fad       7 minutes ago       Ready               machine-config-daemon-vwwqc                           openshift-machine-config-operator        0                   (default)
7092983a9cf33       7 minutes ago       Ready               community-operators-gmcc9                             openshift-marketplace                    0                   (default)
2e2e850d0d469       7 minutes ago       Ready               multus-jnq2l                                          openshift-multus                         0                   (default)
2efa4c2748458       7 minutes ago       Ready               node-exporter-657sw                                   openshift-monitoring                     0                   (default)
81929f275b8b1       7 minutes ago       Ready               sdn-hr2qd                                             openshift-sdn                            0                   (default)
1c6c6f9c5a6ac       7 minutes ago       Ready               dns-default-46gqg                                     openshift-dns                            0                   (default)
4cc58a720a731       7 minutes ago       Ready               ingress-canary-k7l7t                                  openshift-ingress-canary                 0                   (default)

Comment 15 errata-xmlrpc 2021-09-08 13:17:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.29 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3303


Note You need to log in before you can comment on or make changes to this bug.