Bug 1944660 - dm-multipath race condition on bare metal causing /boot partition mount failures
Summary: dm-multipath race condition on bare metal causing /boot partition mount failures
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.7
Hardware: Unspecified
OS: Other
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Colin Walters
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1962850
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-30 12:33 UTC by Ashwini M. Khaire
Modified: 2021-07-27 22:56 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:56:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github coreos fedora-coreos-config pull 1022 0 None open coreos-boot-mount-generator: Always use mpath for /boot if for root 2021-05-18 20:13:42 UTC
Github openshift os pull 559 0 None closed Bump fedora-coreos-config to latest rhcos-4.8 commit 2021-06-03 20:26:24 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:56:56 UTC

Internal Links: 1944760 1946649

Description Ashwini M. Khaire 2021-03-30 12:33:55 UTC
Description of problem:

Applying changes through Machine Configs the nodes were stuck in the `SchedulingDisabled` status as the `/boot` partition was not mounted.

~~~
Mar 10 19:52:32 node1.example.com rpm-ostree[28008]: error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory
~~~

Workaround is to manually mount the `/dev/disk/by-label/boot` directory on `/boot` and reboote the nodes.


Version-Release number of selected component (if applicable):

OpenShift Container Platform version 4.7
RHCOS version 47.83.202102090044-0

Comment 2 Colin Walters 2021-03-30 13:29:17 UTC
This may also be fixed by https://github.com/coreos/rpm-ostree/pull/2386

Comment 3 Colin Walters 2021-03-31 13:19:26 UTC
OK looking more closely this also came up in the original https://bugzilla.redhat.com/show_bug.cgi?id=1886229

How reproducible is this?  Does it happen every time?

Can you get a bit more information like the output of:

systemctl --failed
journalctl -b -u boot.mount -u rpm-ostreed -u multipathd


OK actually testing this out locally there seems to be some sort of race condition with multipathd; the above PR might help paper over it a bit, but we need to get to the bottom of this:

```
[root@cosa-devsh ~]# journalctl -b -u boot.mount -u rpm-ostreed -u multipathd
-- Logs begin at Wed 2021-03-31 13:09:50 UTC, end at Wed 2021-03-31 13:11:33 UTC. --
Mar 31 13:11:00 localhost systemd[1]: Starting Device-Mapper Multipath Device Controller...
Mar 31 13:11:00 localhost systemd[1]: Started Device-Mapper Multipath Device Controller.
Mar 31 13:11:00 localhost multipathd[783]: --------start up--------
Mar 31 13:11:00 localhost multipathd[783]: read /etc/multipath.conf
Mar 31 13:11:00 localhost multipathd[783]: path checkers start up
Mar 31 13:11:00 localhost multipathd[783]: mpatha: load table [0 33554432 multipath 0 0 2 1 service-time 0 1 1 8:0 1 service-time 0 1 1 8:16 1]
Mar 31 13:11:01 localhost systemd[1]: Stopping Device-Mapper Multipath Device Controller...
Mar 31 13:11:01 localhost multipathd[1000]: ok
Mar 31 13:11:01 localhost multipathd[783]: --------shut down-------
Mar 31 13:11:01 localhost systemd[1]: multipathd.service: Succeeded.
Mar 31 13:11:01 localhost systemd[1]: Stopped Device-Mapper Multipath Device Controller.
Mar 31 13:11:03 localhost systemd[1]: Starting Device-Mapper Multipath Device Controller...
Mar 31 13:11:03 localhost multipathd[1198]: --------start up--------
Mar 31 13:11:03 localhost multipathd[1198]: read /etc/multipath.conf
Mar 31 13:11:03 localhost multipathd[1198]: path checkers start up
Mar 31 13:11:03 localhost multipathd[1198]: Allocated existing binding [mpatha] for WWID [0xb18fdcfcecbeab1a]
Mar 31 13:11:03 localhost systemd[1]: Started Device-Mapper Multipath Device Controller.
Mar 31 13:11:03 localhost systemd[1]: Mounting CoreOS Dynamic Mount for /boot...
Mar 31 13:11:03 localhost systemd[1]: boot.mount: Mount process exited, code=killed status=15
Mar 31 13:11:03 localhost systemd[1]: boot.mount: Failed with result 'signal'.
Mar 31 13:11:03 localhost systemd[1]: Unmounted CoreOS Dynamic Mount for /boot.
Mar 31 13:11:03 localhost systemd[1]: Mounting CoreOS Dynamic Mount for /boot...
Mar 31 13:11:03 localhost systemd[1]: Mounted CoreOS Dynamic Mount for /boot.
```

Something is causing multipathd to restart immediately after starting - this in turn kills boot.mount transiently and is presumably what's causing the mount to not appear.

Comment 4 Colin Walters 2021-03-31 13:49:29 UTC
OK of course we start multipathd in the initramfs and the real root, so that's what's going on; this journal query makes the flow a bit more obvious:

journalctl -b UNIT=local-fs-pre.target UNIT=local-fs.target UNIT=boot.mount + UNIT=multipathd.service + _SYSTEMD_UNIT=multipathd.service + "MESSAGE=Switching root."

So far I've only seen this "boot.mount" service being killed once; AFAICS multipathd is started in local-fs-pre.target and our generated boot.mount has After=local-fs-pre.target; the journal logs confirm this.

Trying to reproduce and narrow down the race.  I think it may be related to the code in https://github.com/coreos/fedora-coreos-config/blob/testing-devel/overlay.d/05core/usr/lib/systemd/system-generators/coreos-boot-mount-generator#L31

Comment 5 Jonathan Lebon 2021-03-31 14:34:17 UTC
A likely candidate is that there's no elegant way to wait for multipathd.service to take ownership of the underlying block devices (and that udev has regenerated the by-* links). Ideally, multipathd.service would block on that before notifying ready.

Hmm, maybe what we should do is have the generator queue off of whether the rootfs is backed by multipath, and if so, use `What=/dev/disk/by-label/dm-mpath-boot` instead. This isn't bulletproof though, because in theory you could have reprovisioned just the rootfs on a multipath target, but left `/boot` on the original non-multipathed disk. I'm not sure how likely that scenario is though (why not just put everything on multipath?).

Comment 6 Colin Walters 2021-05-18 13:06:23 UTC
OK there's a much larger rework of multipath going on in https://bugzilla.redhat.com/show_bug.cgi?id=1954025 and https://github.com/coreos/fedora-coreos-config/pull/1011

But that said, I think there is likely a short-term fix possible here around changing the generator.  I added some notes in https://github.com/coreos/fedora-coreos-tracker/pull/832

I think the core race condition here is we're looking at /dev/disk/by-label/boot in our mount generator, which absolutely can change.  We need to use the information we have that the real root is multipath to require /boot be multipath too.

One thing I've confirmed is that the udev rules we ship that create /dev/disk/by-label/dm-mpath-boot only run in the initramfs.  I *think* this is because `DM_ACTIVATION=0` outside of the initramfs?  But needs udev debugging.

A related topic on that is: since we always mount `/boot` in the initramfs for Ignition, and we need to handle FIPS: why not just pass through the `/boot` mount from the initramfs too?  That would close down a whole bunch of races.

Comment 7 Jonathan Lebon 2021-06-02 19:34:16 UTC
The patch for this was backported as part of https://github.com/coreos/fedora-coreos-config/pull/1036 and is slated to be part of the next 4.8 bootimage bump.

Comment 9 Jonathan Lebon 2021-06-07 14:19:09 UTC
Shouldn't have moved this to MODIFIED. Moving back to POST. Bot will pick it up once it's referenced in the bootimage bump PR.

Comment 10 Micah Abbott 2021-06-28 16:59:04 UTC
The associated boot image bump BZ has code merged and marked VERIFIED; moving this to MODIFIED

Comment 13 Michael Nguyen 2021-07-03 03:55:41 UTC
Sanity verification on RHCOS 48.84.202106091622-0 which is the boot image for registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-07-02-143616


[root@ibm-p8-kvm-03-guest-02 ~]# findmnt /
TARGET SOURCE                                                     FSTYPE OPTIONS
/      /dev/mapper/mpatha4[/ostree/deploy/rhcos/deploy/457db8ff03dda5b3ce1a8e242fd91ddbe6a82f838d1b0047c3d4aeaf6c53f572.0]
                                                                  xfs    rw,rela
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/disk/by-label/dm-mpath-root
/dev/dm-4
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/mapper/mpatha4
/dev/dm-4
[root@ibm-p8-kvm-03-guest-02 ~]# findmnt /boot
TARGET SOURCE              FSTYPE OPTIONS
/boot  /dev/mapper/mpatha3 ext4   ro,nosuid,nodev,relatime,seclabel
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/disk/by-label/dm-mpath-boot
/dev/dm-3
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/mapper/mpatha3             
/dev/dm-3
[root@ibm-p8-kvm-03-guest-02 ~]# rpm-ostree status
State: idle
Deployments:
* ostree://457db8ff03dda5b3ce1a8e242fd91ddbe6a82f838d1b0047c3d4aeaf6c53f572
                   Version: 48.84.202106091622-0 (2021-06-09T16:25:42Z)

Comment 15 errata-xmlrpc 2021-07-27 22:56:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.