Created attachment 1817838 [details] master-1 failure to boot OCP Version at Install Time: 4.9.0-0.ci.test-2021-08-26-081628-ci-op-btm7f0rp-latest RHCOS Version at Install Time: rhcos-49.84.202108221651-0-openstack.x86_64.qcow2.gz Platform: OpenStack Architecture: x86_64 A CI job failed: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/21117/rehearse-21117-pull-ci-openshift-cloud-provider-openstack-master-e2e-openstack-ccm-install/1430804969698627584 We can see from the openstack console logs that it failed because one of the 3 masters failed to boot: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_release/21117/rehearse-21117-pull-ci-openshift-cloud-provider-openstack-master-e2e-openstack-ccm-install/1430804969698627584/artifacts/e2e-openstack-ccm-install/openstack-gather/artifacts/nodes/console_btm7f0rp-13339-tjj6m-master-1.log I have attached this log to the BZ in case it is reaped by prow. Note that the other 2 masters both booted. The boot ends in: You are in emergency mode. After logging in, type "journalctl -xb" to view system logs, "systemctl reboot" to reboot, "systemctl default" or "exit" to boot into default mode. Press Enter for maintenance (or press Control-D to continue): I can see earlier in the logs: Starting OSTree Remount OS/ Bind Mounts... [ 91.471884] ostree-remount[1270]: ostree-remount: failed to remount(ro) /sysroot: Device or resource busy Mounting Mount etcd as a ramdisk... [[0;1;31mFAILED[0m] Failed to start OSTree Remount OS/ Bind Mounts. which may be relevant. I assume this is a startup race? I am not able to reproduce this issue. I am reporting this in case it is of interest to you. As long as it doesn't become a regular flake it is not significantly impacting me. On the face of it it appears to be similar to: https://github.com/coreos/fedora-coreos-tracker/issues/746
Indeed, it seems https://github.com/coreos/fedora-coreos-tracker/issues/746 was fixed by https://github.com/ostreedev/ostree/pull/2387 which is not currently in RHCOS. Thanks for the report.
Possible duplicate of bug 1992618.
Setting medium Pri/Sev and targeting for 4.9. It may not be possible to land that patch in RHCOS in time for 4.9 GA, but seems like something we should pursue for future releases. @Luca could you see that the ostree change in comment #1 makes it way into RHCOS?
Thanks for the report and the attached logs! The symptoms are the same as the Fedora CoreOS but I think the root cause is different, so I wouldn't start approaching it with a backport upfront. My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing some After/Before relationship and thus racing with 'ostree-remount.service'. That would indeed be similar to https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted. Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and where it is coming from? I did a quick search around but I couldn't locate it, and I believe it is not part of OCP proper.
(In reply to Luca BRUNO from comment #4) > Thanks for the report and the attached logs! The symptoms are the same as > the Fedora CoreOS but I think the root cause is different, so I wouldn't > start approaching it with a backport upfront. > > My gut feeling is that whatever "Mount etcd as a ramdisk" is, it is missing > some After/Before relationship and thus racing with 'ostree-remount.service'. > That would indeed be similar to > https://bugzilla.redhat.com/show_bug.cgi?id=1992618 as Benjamin noted. > > Matthew, do you maybe know which unit is doing that etcd-ramdisk setup and > where it is coming from? I did a quick search around but I couldn't locate > it, and I believe it is not part of OCP proper. It's a hack used in CI for when a cloud's underlying storage isn't fast enough for etcd. We're currently using it by default for OpenStack, but starting to phase it out as we're no longer running everything on the one cloud that really needs it. It's defined via ignition here: https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/conf/etcd/on-ramfs/ipi-conf-etcd-on-ramfs-commands.sh Sounds like we need to fix the dependencies of that unit?
Should have said: if the answer's yes please reassign it to Cloud Compute -> OpenStack Provider with my thanks and we'll fix it.
Yes, I think that a 'After=ostree-remount.service var.mount' on that mount unit should fix this race. Re-assigning so that the CI can be directly fixed. I'll keep working on the other ticket to see if we can make this less cumbersome for the users.
Changing the severity to LOW because it doesn't happen often.
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056