Description of problem: In the last 24 hours or so, e2e-metal-ipi jobs have started failing with errors like this: level=info msg=Waiting up to 20m0s for the Kubernetes API at https://api.ostest.test.metalkube.org:6443... level=info msg=API v1.20.0+93f62ca up level=info msg=Waiting up to 30m0s for bootstrapping to complete... level=error msg=Attempted to gather ClusterOperator status after installation failure: listing ClusterOperator objects: Get "https://api.ostest.test.metalkube.org:6443/apis/config.openshift.io/v1/clusteroperators": dial tcp 192.168.111.5:6443: connect: connection refused When I login to my control plane hosts, I see that the crio-wipe service is failed on all of them [core@master-1 ~]$ sudo systemctl status crio-wipe --no-pager -l ● crio-wipe.service - CRI-O Auto Update Script Loaded: loaded (/usr/lib/systemd/system/crio-wipe.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2021-03-26 11:22:09 UTC; 13min ago Process: 2166 ExecStart=/usr/bin/crio $CRIO_CONFIG_OPTIONS $CRIO_RUNTIME_OPTIONS $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS wipe (code=exited, status=1/FAILURE) Main PID: 2166 (code=exited, status=1/FAILURE) CPU: 109ms Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: Starting CRI-O Auto Update Script... Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26T11:22:09Z" level=info msg="Starting CRI-O, version: 1.21.0-38.rhaos4.8.git50b6b5c.el8-dev, git: ()" Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26 11:22:09.179025635Z" level=info msg="File /var/lib/crio/clean.shutdown not found. Wiping storage directory /var/lib/containers/storage because of suspected dirty shutdown" Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26 11:22:09.180674068Z" level=info msg="Failed to wipe storage cleanly: unlinkat /var/lib/containers/storage/overlay-containers/b0cdfa54bcb5e3eedb16b72891c11d52c56b9948dcd4a3a1164a125c85d595bb/userdata/shm: device or resource busy" Mar 26 11:22:09 master-1.ostest.test.metalkube.org crio[2166]: time="2021-03-26 11:22:09.180764564Z" level=fatal msg="Failed to shutdown storage before wiping: A layer is mounted: layer is in use by a container" Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: crio-wipe.service: Main process exited, code=exited, status=1/FAILURE Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: crio-wipe.service: Failed with result 'exit-code'. Mar 26 11:22:09 master-1.ostest.test.metalkube.org systemd[1]: Failed to start CRI-O Auto Update Script. cri-o is not running, and my cluster never bootstraps. Version-Release number of selected component (if applicable): Red Hat Enterprise Linux CoreOS 48.83.202103251458-0 CRI-O, version: 1.21.0-38.rhaos4.8.git50b6b5c.el8-dev How reproducible: Appears to be always Steps to Reproduce: 1. Install metal IPI Actual results: Install fails Expected results: Install succeeds Additional info: Some logs available in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-ovn-dualstack/1375387422078013440/artifacts/e2e-metal-ipi-ovn-dualstack/baremetalds-devscripts-gather/artifacts/, but I also have a live reproducer environment if anyone would like to look.
this will be fixed in the attached PR
fix is in 48.83.202103262224
oops, we'll also need this PR
new fix is in 48.83.202103301119-0
the aforementioned fixes don't cover all cases. there is still a race with crio-wipe and podman containers run on system boot. To fix these, we'll need a new PR in MCO
Was this bug introduced in 4.8 or exists in previous versions? if it's the latter, do we have backport plans?
it was introduced in 4.8, no backports needed!
Hi, i was able to reproduce the issue, and this time also attached the must gather with the --node-name=worker-0-0: must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ1943539-new-must-gather.tar.gz [kni@provisionhost-0-0 ~]$ ssh core@worker-0-0 Red Hat Enterprise Linux CoreOS 48.84.202106032218-0 Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system managed by the Machine Config Operator (`clusteroperator/machine-config`). WARNING: Direct SSH access to machines is not recommended; instead, make configuration changes via `machineconfig` objects: https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html --- Last login: Mon Jun 7 12:00:40 2021 from 192.168.123.145 [systemd] Failed Units: 10 crio-wipe.service kubelet-auto-node-size.service node-valid-hostname.service sssd.service systemd-coredump systemd-coredump systemd-coredump systemd-coredump systemd-coredump systemd-hostnamed.service [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 5h31m v1.21.0-rc.0+2dfc46b master-0-1 Ready master 5h31m v1.21.0-rc.0+2dfc46b master-0-2 Ready master 5h31m v1.21.0-rc.0+2dfc46b worker-0-0 NotReady,SchedulingDisabled worker 5h1m v1.21.0-rc.0+2dfc46b worker-0-1 Ready worker 5h1m v1.21.0-rc.0+2dfc46b worker-0-2 Ready worker 5h1m v1.21.0-rc.0+2dfc46b worker-0-3 Ready worker 5h1m v1.21.0-rc.0+2dfc46b core@localhost ~]$ sudo systemctl status crio-wipe.service ● crio-wipe.service - CRI-O Auto Update Script Loaded: loaded (/usr/lib/systemd/system/crio-wipe.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2021-06-07 12:20:54 UTC; 28min ago Main PID: 1102 (code=exited, status=1/FAILURE) CPU: 116ms Jun 07 12:20:54 localhost systemd[1]: Starting CRI-O Auto Update Script... Jun 07 12:20:54 localhost crio[1102]: time="2021-06-07 12:20:54.822651586Z" level=info msg="Starting CRI-O, version: 1.21.1-2.rhaos4.8.gitf635341.el8, git: ()" Jun 07 12:20:54 localhost crio[1102]: time="2021-06-07 12:20:54.831052663Z" level=fatal msg="Mkdir /var/lib/containers/storage/overlay/compat738689383: no space lef> Jun 07 12:20:54 localhost systemd[1]: crio-wipe.service: Main process exited, code=exited, status=1/FAILURE Jun 07 12:20:54 localhost systemd[1]: crio-wipe.service: Failed with result 'exit-code'. Jun 07 12:20:54 localhost systemd[1]: Failed to start CRI-O Auto Update Script. Jun 07 12:20:54 localhost systemd[1]: crio-wipe.service: Consumed 116ms CPU time
``` Jun 07 12:20:54 localhost crio[1102]: time="2021-06-07 12:20:54.831052663Z" level=fatal msg="Mkdir /var/lib/containers/storage/overlay/compat738689383: no space lef> ``` IDK but this looks unlrelated to the bug here, and just looks like you've run out of disk space
ok, so moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438