Bug 1943539 - crio-wipe is failing to start "Failed to shutdown storage before wiping: A layer is mounted: layer is in use by a container"
Summary: crio-wipe is failing to start "Failed to shutdown storage before wiping: A la...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.8.0
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1930248
TreeView+ depends on / blocked
 
Reported: 2021-03-26 11:43 UTC by Stephen Benjamin
Modified: 2022-01-26 16:52 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:55:47 UTC
Target Upstream Version:
Embargoed:
smiron: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 4691 0 None closed crio wipe: only completely wipe storage after a reboot 2021-05-14 17:06:55 UTC
Github cri-o cri-o pull 4697 0 None closed crio-wipe: only clear storage if CleanShutdownFile is supported 2021-05-14 17:06:59 UTC
Github openshift machine-config-operator pull 2551 0 None open Bug 1943539: templates: order service files that use podman around crio-wipe 2021-05-14 17:07:03 UTC
Red Hat Knowledge Base (Solution) 6675291 0 None None None 2022-01-26 16:52:52 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:56:22 UTC

Description Stephen Benjamin 2021-03-26 11:43:17 UTC
Description of problem:

In the last 24 hours or so, e2e-metal-ipi jobs have started failing with errors like this:

 level=info msg=Waiting up to 20m0s for the Kubernetes API at https://api.ostest.test.metalkube.org:6443...
level=info msg=API v1.20.0+93f62ca up
level=info msg=Waiting up to 30m0s for bootstrapping to complete...
level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ostest.test.metalkube.org:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.111.5:6443: connect: connection refused 


When I login to my control plane hosts, I see that the crio-wipe service is failed on all of them

[core@master-1 ~]$ sudo systemctl status crio-wipe --no-pager -l
● crio-wipe.service - CRI-O Auto Update Script
   Loaded: loaded (/usr/lib/systemd/system/crio-wipe.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2021-03-26 11:22:09 UTC; 13min ago
  Process: 2166 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS wipe (code=exited, status=1/FAILURE)
 Main PID: 2166 (code=exited, status=1/FAILURE)
      CPU: 109ms

Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: Starting CRI-O Auto Update Script...
Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26T11:22:09Z" level=info msg="Starting CRI-O, version: 1.21.0-38.rhaos4.8.git50b6b5c.el8-dev, git: ()"
Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26 11:22:09.179025635Z" level=info msg="File /var/lib/crio/clean.shutdown not found. Wiping storage directory /var/lib/containers/storage because of suspected dirty shutdown"
Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26 11:22:09.180674068Z" level=info msg="Failed to wipe storage cleanly: unlinkat /var/lib/containers/storage/overlay-containers/b0cdfa54bcb5e3eedb16b72891c11d52c56b9948dcd4a3a1164a125c85d595bb/userdata/shm: device or resource busy"
Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26 11:22:09.180764564Z" level=fatal msg="Failed to shutdown storage before wiping: A layer is mounted: layer is in use by a container"
Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: crio-wipe.service: Main process exited, code=exited, status=1/FAILURE
Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: crio-wipe.service: Failed with result 'exit-code'.
Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: Failed to start CRI-O Auto Update Script.


cri-o is not running, and my cluster never bootstraps.


Version-Release number of selected component (if applicable):

Red Hat Enterprise Linux CoreOS 48.83.202103251458-0
CRI-O, version: 1.21.0-38.rhaos4.8.git50b6b5c.el8-dev


How reproducible:
Appears to be always

Steps to Reproduce:
1. Install metal IPI


Actual results:
Install fails

Expected results:
Install succeeds

Additional info:

Some logs available in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack/1375387422078013440/artifacts/e2e-metal-ipi-ovn-dualstack/baremetalds-devscripts-gather/artifacts/, but I also have a live reproducer environment if anyone would like to look.

Comment 1 Peter Hunt 2021-03-26 13:40:50 UTC
this will be fixed in the attached PR

Comment 3 Peter Hunt 2021-03-29 13:31:28 UTC
fix is in 48.83.202103262224

Comment 5 Peter Hunt 2021-03-29 17:10:05 UTC
oops, we'll also need this PR

Comment 6 Peter Hunt 2021-03-30 13:41:08 UTC
new fix is in 48.83.202103301119-0

Comment 15 Peter Hunt 2021-04-28 17:34:48 UTC
the aforementioned fixes don't cover all cases. there is still a race with crio-wipe and podman containers run on system boot. To fix these, we'll need a new PR in MCO

Comment 18 Nir 2021-05-18 14:34:20 UTC
Was this bug introduced in 4.8 or exists in previous versions?
if it's the latter, do we have backport plans?

Comment 19 Peter Hunt 2021-05-18 18:12:41 UTC
it was introduced in 4.8, no backports needed!

Comment 24 Shelly Miron 2021-06-07 12:49:43 UTC
Hi, i was able to reproduce the issue, and this time also attached the must gather with the --node-name=worker-0-0:


must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1943539-new-must-gather.tar.gz


[kni@provisionhost-0-0 ~]$ ssh core@worker-0-0
Red Hat Enterprise Linux CoreOS 48.84.202106032218-0
  Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).
WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html
---
Last login: Mon Jun  7 12:00:40 2021 from 192.168.123.145
[systemd]
Failed Units: 10
  crio-wipe.service
  kubelet-auto-node-size.service
  node-valid-hostname.service
  sssd.service
  systemd-coredump
  systemd-coredump
  systemd-coredump
  systemd-coredump
  systemd-coredump
  systemd-hostnamed.service


[kni@provisionhost-0-0 ~]$ oc get nodes
NAME         STATUS                        ROLES    AGE     VERSION
master-0-0   Ready                         master   5h31m   v1.21.0-rc.0+2dfc46b
master-0-1   Ready                         master   5h31m   v1.21.0-rc.0+2dfc46b
master-0-2   Ready                         master   5h31m   v1.21.0-rc.0+2dfc46b
worker-0-0   NotReady,SchedulingDisabled   worker   5h1m    v1.21.0-rc.0+2dfc46b
worker-0-1   Ready                         worker   5h1m    v1.21.0-rc.0+2dfc46b
worker-0-2   Ready                         worker   5h1m    v1.21.0-rc.0+2dfc46b
worker-0-3   Ready                         worker   5h1m    v1.21.0-rc.0+2dfc46b


core@localhost ~]$ sudo systemctl status crio-wipe.service
● crio-wipe.service - CRI-O Auto Update Script
   Loaded: loaded (/usr/lib/systemd/system/crio-wipe.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2021-06-07 12:20:54 UTC; 28min ago
 Main PID: 1102 (code=exited, status=1/FAILURE)
      CPU: 116ms
Jun 07 12:20:54 localhost systemd[1]: Starting CRI-O Auto Update Script...
Jun 07 12:20:54 localhost crio[1102]: time="2021-06-07 12:20:54.822651586Z" level=info msg="Starting CRI-O, version: 1.21.1-2.rhaos4.8.gitf635341.el8, git: ()"
Jun 07 12:20:54 localhost crio[1102]: time="2021-06-07 12:20:54.831052663Z" level=fatal msg="Mkdir /var/lib/containers/storage/overlay/compat738689383: no space lef>
Jun 07 12:20:54 localhost systemd[1]: crio-wipe.service: Main process exited, code=exited, status=1/FAILURE
Jun 07 12:20:54 localhost systemd[1]: crio-wipe.service: Failed with result 'exit-code'.
Jun 07 12:20:54 localhost systemd[1]: Failed to start CRI-O Auto Update Script.
Jun 07 12:20:54 localhost systemd[1]: crio-wipe.service: Consumed 116ms CPU time

Comment 25 Peter Hunt 2021-06-09 19:58:07 UTC
```
Jun 07 12:20:54 localhost crio[1102]: time="2021-06-07 12:20:54.831052663Z" level=fatal msg="Mkdir /var/lib/containers/storage/overlay/compat738689383: no space lef>
```
IDK but this looks unlrelated to the bug here, and just looks like you've run out of disk space

Comment 29 Shelly Miron 2021-06-13 05:59:55 UTC
ok, so moving to verified.

Comment 32 errata-xmlrpc 2021-07-27 22:55:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.