Description of problem: Applying changes through Machine Configs the nodes were stuck in the `SchedulingDisabled` status as the `/boot` partition was not mounted. ~~~ Mar 10 19:52:32 node1.example.com rpm-ostree[28008]: error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory ~~~ Workaround is to manually mount the `/dev/disk/by-label/boot` directory on `/boot` and reboote the nodes. Version-Release number of selected component (if applicable): OpenShift Container Platform version 4.7 RHCOS version 47.83.202102090044-0
This may also be fixed by https://github.com/coreos/rpm-ostree/pull/2386
OK looking more closely this also came up in the original https://bugzilla.redhat.com/show_bug.cgi?id=1886229 How reproducible is this? Does it happen every time? Can you get a bit more information like the output of: systemctl --failed journalctl -b -u boot.mount -u rpm-ostreed -u multipathd OK actually testing this out locally there seems to be some sort of race condition with multipathd; the above PR might help paper over it a bit, but we need to get to the bottom of this: ``` [root@cosa-devsh ~]# journalctl -b -u boot.mount -u rpm-ostreed -u multipathd -- Logs begin at Wed 2021-03-31 13:09:50 UTC, end at Wed 2021-03-31 13:11:33 UTC. -- Mar 31 13:11:00 localhost systemd[1]: Starting Device-Mapper Multipath Device Controller... Mar 31 13:11:00 localhost systemd[1]: Started Device-Mapper Multipath Device Controller. Mar 31 13:11:00 localhost multipathd[783]: --------start up-------- Mar 31 13:11:00 localhost multipathd[783]: read /etc/multipath.conf Mar 31 13:11:00 localhost multipathd[783]: path checkers start up Mar 31 13:11:00 localhost multipathd[783]: mpatha: load table [0 33554432 multipath 0 0 2 1 service-time 0 1 1 8:0 1 service-time 0 1 1 8:16 1] Mar 31 13:11:01 localhost systemd[1]: Stopping Device-Mapper Multipath Device Controller... Mar 31 13:11:01 localhost multipathd[1000]: ok Mar 31 13:11:01 localhost multipathd[783]: --------shut down------- Mar 31 13:11:01 localhost systemd[1]: multipathd.service: Succeeded. Mar 31 13:11:01 localhost systemd[1]: Stopped Device-Mapper Multipath Device Controller. Mar 31 13:11:03 localhost systemd[1]: Starting Device-Mapper Multipath Device Controller... Mar 31 13:11:03 localhost multipathd[1198]: --------start up-------- Mar 31 13:11:03 localhost multipathd[1198]: read /etc/multipath.conf Mar 31 13:11:03 localhost multipathd[1198]: path checkers start up Mar 31 13:11:03 localhost multipathd[1198]: Allocated existing binding [mpatha] for WWID [0xb18fdcfcecbeab1a] Mar 31 13:11:03 localhost systemd[1]: Started Device-Mapper Multipath Device Controller. Mar 31 13:11:03 localhost systemd[1]: Mounting CoreOS Dynamic Mount for /boot... Mar 31 13:11:03 localhost systemd[1]: boot.mount: Mount process exited, code=killed status=15 Mar 31 13:11:03 localhost systemd[1]: boot.mount: Failed with result 'signal'. Mar 31 13:11:03 localhost systemd[1]: Unmounted CoreOS Dynamic Mount for /boot. Mar 31 13:11:03 localhost systemd[1]: Mounting CoreOS Dynamic Mount for /boot... Mar 31 13:11:03 localhost systemd[1]: Mounted CoreOS Dynamic Mount for /boot. ``` Something is causing multipathd to restart immediately after starting - this in turn kills boot.mount transiently and is presumably what's causing the mount to not appear.
OK of course we start multipathd in the initramfs and the real root, so that's what's going on; this journal query makes the flow a bit more obvious: journalctl -b UNIT=local-fs-pre.target UNIT=local-fs.target UNIT=boot.mount + UNIT=multipathd.service + _SYSTEMD_UNIT=multipathd.service + "MESSAGE=Switching root." So far I've only seen this "boot.mount" service being killed once; AFAICS multipathd is started in local-fs-pre.target and our generated boot.mount has After=local-fs-pre.target; the journal logs confirm this. Trying to reproduce and narrow down the race. I think it may be related to the code in https://github.com/coreos/fedora-coreos-config/blob/testing-devel/overlay.d/05core/usr/lib/systemd/system-generators/coreos-boot-mount-generator#L31
A likely candidate is that there's no elegant way to wait for multipathd.service to take ownership of the underlying block devices (and that udev has regenerated the by-* links). Ideally, multipathd.service would block on that before notifying ready. Hmm, maybe what we should do is have the generator queue off of whether the rootfs is backed by multipath, and if so, use `What=/dev/disk/by-label/dm-mpath-boot` instead. This isn't bulletproof though, because in theory you could have reprovisioned just the rootfs on a multipath target, but left `/boot` on the original non-multipathed disk. I'm not sure how likely that scenario is though (why not just put everything on multipath?).
OK there's a much larger rework of multipath going on in https://bugzilla.redhat.com/show_bug.cgi?id=1954025 and https://github.com/coreos/fedora-coreos-config/pull/1011 But that said, I think there is likely a short-term fix possible here around changing the generator. I added some notes in https://github.com/coreos/fedora-coreos-tracker/pull/832 I think the core race condition here is we're looking at /dev/disk/by-label/boot in our mount generator, which absolutely can change. We need to use the information we have that the real root is multipath to require /boot be multipath too. One thing I've confirmed is that the udev rules we ship that create /dev/disk/by-label/dm-mpath-boot only run in the initramfs. I *think* this is because `DM_ACTIVATION=0` outside of the initramfs? But needs udev debugging. A related topic on that is: since we always mount `/boot` in the initramfs for Ignition, and we need to handle FIPS: why not just pass through the `/boot` mount from the initramfs too? That would close down a whole bunch of races.
The patch for this was backported as part of https://github.com/coreos/fedora-coreos-config/pull/1036 and is slated to be part of the next 4.8 bootimage bump.
Shouldn't have moved this to MODIFIED. Moving back to POST. Bot will pick it up once it's referenced in the bootimage bump PR.
The associated boot image bump BZ has code merged and marked VERIFIED; moving this to MODIFIED
Sanity verification on RHCOS 48.84.202106091622-0 which is the boot image for registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-07-02-143616 [root@ibm-p8-kvm-03-guest-02 ~]# findmnt / TARGET SOURCE FSTYPE OPTIONS / /dev/mapper/mpatha4[/ostree/deploy/rhcos/deploy/457db8ff03dda5b3ce1a8e242fd91ddbe6a82f838d1b0047c3d4aeaf6c53f572.0] xfs rw,rela [root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/disk/by-label/dm-mpath-root /dev/dm-4 [root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/mapper/mpatha4 /dev/dm-4 [root@ibm-p8-kvm-03-guest-02 ~]# findmnt /boot TARGET SOURCE FSTYPE OPTIONS /boot /dev/mapper/mpatha3 ext4 ro,nosuid,nodev,relatime,seclabel [root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/disk/by-label/dm-mpath-boot /dev/dm-3 [root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/mapper/mpatha3 /dev/dm-3 [root@ibm-p8-kvm-03-guest-02 ~]# rpm-ostree status State: idle Deployments: * ostree://457db8ff03dda5b3ce1a8e242fd91ddbe6a82f838d1b0047c3d4aeaf6c53f572 Version: 48.84.202106091622-0 (2021-06-09T16:25:42Z)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438