1944660 – dm-multipath race condition on bare metal causing /boot partition mount failures

Bug 1944660 - dm-multipath race condition on bare metal causing /boot partition mount failures

Summary: dm-multipath race condition on bare metal causing /boot partition mount failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Other
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Colin Walters
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	1962850
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-30 12:33 UTC by Ashwini M. Khaire
Modified:	2024-12-20 19:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:56:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	coreos fedora-coreos-config pull 1022	None	open	coreos-boot-mount-generator: Always use mpath for /boot if for root	2021-05-18 20:13:42 UTC
Github	openshift os pull 559	None	closed	Bump fedora-coreos-config to latest rhcos-4.8 commit	2021-06-03 20:26:24 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:56:56 UTC

Internal Links: 1944760 1946649

Description Ashwini M. Khaire 2021-03-30 12:33:55 UTC

Description of problem:

Applying changes through Machine Configs the nodes were stuck in the `SchedulingDisabled` status as the `/boot` partition was not mounted.

~~~
Mar 10 19:52:32 node1.example.com rpm-ostree[28008]: error: Couldn't start daemon: Error setting up sysroot: loading sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory
~~~

Workaround is to manually mount the `/dev/disk/by-label/boot` directory on `/boot` and reboote the nodes.


Version-Release number of selected component (if applicable):

OpenShift Container Platform version 4.7
RHCOS version 47.83.202102090044-0

Comment 2 Colin Walters 2021-03-30 13:29:17 UTC

This may also be fixed by https://github.com/coreos/rpm-ostree/pull/2386

Comment 3 Colin Walters 2021-03-31 13:19:26 UTC

OK looking more closely this also came up in the original https://bugzilla.redhat.com/show_bug.cgi?id=1886229

How reproducible is this?  Does it happen every time?

Can you get a bit more information like the output of:

systemctl --failed
journalctl -b -u boot.mount -u rpm-ostreed -u multipathd


OK actually testing this out locally there seems to be some sort of race condition with multipathd; the above PR might help paper over it a bit, but we need to get to the bottom of this:

```
[root@cosa-devsh ~]# journalctl -b -u boot.mount -u rpm-ostreed -u multipathd
-- Logs begin at Wed 2021-03-31 13:09:50 UTC, end at Wed 2021-03-31 13:11:33 UTC. --
Mar 31 13:11:00 localhost systemd[1]: Starting Device-Mapper Multipath Device Controller...
Mar 31 13:11:00 localhost systemd[1]: Started Device-Mapper Multipath Device Controller.
Mar 31 13:11:00 localhost multipathd[783]: --------start up--------
Mar 31 13:11:00 localhost multipathd[783]: read /etc/multipath.conf
Mar 31 13:11:00 localhost multipathd[783]: path checkers start up
Mar 31 13:11:00 localhost multipathd[783]: mpatha: load table [0 33554432 multipath 0 0 2 1 service-time 0 1 1 8:0 1 service-time 0 1 1 8:16 1]
Mar 31 13:11:01 localhost systemd[1]: Stopping Device-Mapper Multipath Device Controller...
Mar 31 13:11:01 localhost multipathd[1000]: ok
Mar 31 13:11:01 localhost multipathd[783]: --------shut down-------
Mar 31 13:11:01 localhost systemd[1]: multipathd.service: Succeeded.
Mar 31 13:11:01 localhost systemd[1]: Stopped Device-Mapper Multipath Device Controller.
Mar 31 13:11:03 localhost systemd[1]: Starting Device-Mapper Multipath Device Controller...
Mar 31 13:11:03 localhost multipathd[1198]: --------start up--------
Mar 31 13:11:03 localhost multipathd[1198]: read /etc/multipath.conf
Mar 31 13:11:03 localhost multipathd[1198]: path checkers start up
Mar 31 13:11:03 localhost multipathd[1198]: Allocated existing binding [mpatha] for WWID [0xb18fdcfcecbeab1a]
Mar 31 13:11:03 localhost systemd[1]: Started Device-Mapper Multipath Device Controller.
Mar 31 13:11:03 localhost systemd[1]: Mounting CoreOS Dynamic Mount for /boot...
Mar 31 13:11:03 localhost systemd[1]: boot.mount: Mount process exited, code=killed status=15
Mar 31 13:11:03 localhost systemd[1]: boot.mount: Failed with result 'signal'.
Mar 31 13:11:03 localhost systemd[1]: Unmounted CoreOS Dynamic Mount for /boot.
Mar 31 13:11:03 localhost systemd[1]: Mounting CoreOS Dynamic Mount for /boot...
Mar 31 13:11:03 localhost systemd[1]: Mounted CoreOS Dynamic Mount for /boot.
```

Something is causing multipathd to restart immediately after starting - this in turn kills boot.mount transiently and is presumably what's causing the mount to not appear.

Comment 4 Colin Walters 2021-03-31 13:49:29 UTC

OK of course we start multipathd in the initramfs and the real root, so that's what's going on; this journal query makes the flow a bit more obvious:

journalctl -b UNIT=local-fs-pre.target UNIT=local-fs.target UNIT=boot.mount + UNIT=multipathd.service + _SYSTEMD_UNIT=multipathd.service + "MESSAGE=Switching root."

So far I've only seen this "boot.mount" service being killed once; AFAICS multipathd is started in local-fs-pre.target and our generated boot.mount has After=local-fs-pre.target; the journal logs confirm this.

Trying to reproduce and narrow down the race.  I think it may be related to the code in https://github.com/coreos/fedora-coreos-config/blob/testing-devel/overlay.d/05core/usr/lib/systemd/system-generators/coreos-boot-mount-generator#L31

Comment 5 Jonathan Lebon 2021-03-31 14:34:17 UTC

A likely candidate is that there's no elegant way to wait for multipathd.service to take ownership of the underlying block devices (and that udev has regenerated the by-* links). Ideally, multipathd.service would block on that before notifying ready.

Hmm, maybe what we should do is have the generator queue off of whether the rootfs is backed by multipath, and if so, use `What=/dev/disk/by-label/dm-mpath-boot` instead. This isn't bulletproof though, because in theory you could have reprovisioned just the rootfs on a multipath target, but left `/boot` on the original non-multipathed disk. I'm not sure how likely that scenario is though (why not just put everything on multipath?).

Comment 6 Colin Walters 2021-05-18 13:06:23 UTC

OK there's a much larger rework of multipath going on in https://bugzilla.redhat.com/show_bug.cgi?id=1954025 and https://github.com/coreos/fedora-coreos-config/pull/1011

But that said, I think there is likely a short-term fix possible here around changing the generator.  I added some notes in https://github.com/coreos/fedora-coreos-tracker/pull/832

I think the core race condition here is we're looking at /dev/disk/by-label/boot in our mount generator, which absolutely can change.  We need to use the information we have that the real root is multipath to require /boot be multipath too.

One thing I've confirmed is that the udev rules we ship that create /dev/disk/by-label/dm-mpath-boot only run in the initramfs.  I *think* this is because `DM_ACTIVATION=0` outside of the initramfs?  But needs udev debugging.

A related topic on that is: since we always mount `/boot` in the initramfs for Ignition, and we need to handle FIPS: why not just pass through the `/boot` mount from the initramfs too?  That would close down a whole bunch of races.

Comment 7 Jonathan Lebon 2021-06-02 19:34:16 UTC

The patch for this was backported as part of https://github.com/coreos/fedora-coreos-config/pull/1036 and is slated to be part of the next 4.8 bootimage bump.

Comment 9 Jonathan Lebon 2021-06-07 14:19:09 UTC

Shouldn't have moved this to MODIFIED. Moving back to POST. Bot will pick it up once it's referenced in the bootimage bump PR.

Comment 10 Micah Abbott 2021-06-28 16:59:04 UTC

The associated boot image bump BZ has code merged and marked VERIFIED; moving this to MODIFIED

Comment 13 Michael Nguyen 2021-07-03 03:55:41 UTC

Sanity verification on RHCOS 48.84.202106091622-0 which is the boot image for registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-07-02-143616


[root@ibm-p8-kvm-03-guest-02 ~]# findmnt /
TARGET SOURCE                                                     FSTYPE OPTIONS
/      /dev/mapper/mpatha4[/ostree/deploy/rhcos/deploy/457db8ff03dda5b3ce1a8e242fd91ddbe6a82f838d1b0047c3d4aeaf6c53f572.0]
                                                                  xfs    rw,rela
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/disk/by-label/dm-mpath-root
/dev/dm-4
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/mapper/mpatha4
/dev/dm-4
[root@ibm-p8-kvm-03-guest-02 ~]# findmnt /boot
TARGET SOURCE              FSTYPE OPTIONS
/boot  /dev/mapper/mpatha3 ext4   ro,nosuid,nodev,relatime,seclabel
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/disk/by-label/dm-mpath-boot
/dev/dm-3
[root@ibm-p8-kvm-03-guest-02 ~]# readlink -f /dev/mapper/mpatha3             
/dev/dm-3
[root@ibm-p8-kvm-03-guest-02 ~]# rpm-ostree status
State: idle
Deployments:
* ostree://457db8ff03dda5b3ce1a8e242fd91ddbe6a82f838d1b0047c3d4aeaf6c53f572
                   Version: 48.84.202106091622-0 (2021-06-09T16:25:42Z)

Comment 15 errata-xmlrpc 2021-07-27 22:56:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.