Bug 2172269 - dracut 059 breaks installer image boot
Summary: dracut 059 breaks installer image boot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: dracut
Version: 38
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: dracut-maint-list
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker openqa
Depends On:
Blocks: F39BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2023-02-21 19:18 UTC by Adam Williamson
Modified: 2023-03-27 11:04 UTC (History)
6 users (show)

Fixed In Version: dracut-059-2.fc38
Clone Of:
Environment:
Last Closed: 2023-03-26 00:20:14 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github dracutdevs dracut issues 2232 0 None open New 90overlayfs module does not set up the overlay in three cases where it happened before (breaks Fedora/RHEL installer... 2023-02-23 22:45:16 UTC

Description Adam Williamson 2023-02-21 19:18:12 UTC
With the new dracut 059 (well, the diff also includes anaconda 39.2, but dracut seems the most likely suspect), installer images no longer boot. They loop on an error "mount: /sysroot: special device LiveOS_rootfs does not exist." This is obviously an automatic blocker as it prevents the installer image booting at all.

Comment 1 Adam Williamson 2023-02-21 19:23:00 UTC
Note, previous build was 057, for some reason we're a couple of months behind getting 058 or 059.

Comment 2 Adam Williamson 2023-02-21 19:31:31 UTC
Checking the journal messages shows:

overlayfs: failed to resolve '/run/overlayfs': -2

Comment 3 Adam Williamson 2023-02-21 20:23:59 UTC
OK, so I think see what's going on here, more or less, but fixing it seems a bit tricky, at least right now (maybe I'll figure it out shortly).

I believe the trouble starts with this commit: https://github.com/dracutdevs/dracut/commit/8caaad4fc2d75982eb87f5ebc72a4c276986f756

it moves the setup of the overlayfs on the `dmsquash-live-root.sh` path out from being inline in that file to a separate script. Before that change, we had this block:

if [ -n "$overlayfs" ]; then
...
    mkdir -m 0755 -p /run/overlayfs
    mkdir -m 0755 -p /run/ovlwork
    if [ -n "$reset_overlay" ] && [ -h /run/overlayfs ]; then
        ovlfs=$(readlink /run/overlayfs)
        info "Resetting the OverlayFS overlay directory."
        rm -r -- "${ovlfs:?}"/* "${ovlfs:?}"/.* > /dev/null 2>&1
    fi
    if [ -n "$readonly_overlay" ] && [ -h /run/overlayfs-r ]; then
        ovlfs=lowerdir=/run/overlayfs-r:/run/rootfsbase
    else
        ovlfs=lowerdir=/run/rootfsbase
    fi
...

There were/are various ways `$overlayfs` can get set in that script. It can get set to "yes" if a certain arg is on the cmdline:

getargbool 0 rd.live.overlay.overlayfs && overlayfs="yes"

but it can get set to "required" in three other cases. Two are in this block (edited and condensed):

===

if [ -z "$setup" -a -n "$devspec" -a -n "$pathspec" -a -n "$overlay" ]; then
...
    if [ -f /run/initramfs/overlayfs$pathspec -a -w /run/initramfs/overlayfs$pathspec ]; then
...
        if [ -z "$oltype" ] || [ "$oltype" = DM_snapshot_cow ]; then
...
        else
...
            if [ -d /run/initramfs/overlayfs/overlayfs ] && [ -d /run/initramfs/overlayfs/ovlwork ]; then
...
                overlayfs="required"
...
    elif [ -d /run/initramfs/overlayfs$pathspec ] && [ -d /run/initramfs/overlayfs$pathspec/../ovlwork ]; then
...
        overlayfs="required"

===

One is in this block (again edited and condensed):

===

# we might have an embedded fs image on squashfs (compressed live)
if [ -e /run/initramfs/live/${live_dir}/${squash_image} ]; then
    SQUASHED="/run/initramfs/live/${live_dir}/${squash_image}"
fi
if [ -e "$SQUASHED" ]; then
...
    if [ -d /run/initramfs/squashfs/LiveOS ]; then
...
    elif [ -d /run/initramfs/squashfs/proc ]; then
...
        overlayfs="required"

===

so if the cmdline arg is set, *or* if we go down any of those three paths, we wound up with $overlayfs as a non-zero length string and did the stuff to set up /run/overlayfs.

*After* that change, we now have the file `mount-overlayfs.sh` set up as a hook (I think), which does more or less the same stuff - but *only* if the cmdline arg is set. That's the only conditional that was 'ported' to that file:

===

getargbool 0 rd.live.overlay.overlayfs && overlayfs="yes"
...
if [ -n "$overlayfs" ]; then
    (do the stuff)
fi

===

i.e. if $overlayfs isn't set it doesn't do the stuff, and that `getargbool` is the *only* way $overlayfs can be set in that file.

Things were complicated a bit further by a later commit: https://github.com/dracutdevs/dracut/commit/40dd5c90e0efcb9ebaa9abb42a38c7316e9706bd

that basically tweaks the approach by making the script part of a new module called 90overlayfs. But it doesn't ultimately change the logic much: the script only *does* stuff if the cmdline arg is set.

I booted an affected image with `rd.debug` so we get `sh -z` output from all the scripts, and from that we can see that our images are indeed hitting one of the paths where overlayfs gets set to "required", specifically, the third (squashfs-y) one, because we see:

+ reloadsysrootmountunit=':>/xor_overlayfs;'
+ overlayfs=required

that "reloadsysrootmountunit" line is a dead giveaway we're in that third block.

So the problem is the new 90overlayfs module doesn't actually set up the overlayfs because it isn't set to do it in all the cases where it should be - it wasn't properly set up to do it on all the paths where $overlayfs is set to "required" in dmsquash-live-root.sh.

The obvious thing to do is just port all those cases over. The first two are, uh, rather complicated, but the third at least *seems* easy, so I was going to do that. But there's a problem even with that: dmsquash-live-root.sh unmounts the squashfs at the end!

[ -e "$SQUASHED" ] && umount -l /run/initramfs/squashfs

so we can't even port *that* check into the 90overlayfs module because `/run/initramfs/squashfs/proc` won't be there any more.

We probably need to have dmsquash-live-root.sh just 'signal' to the module somehow when it actually needs to set up the overlayfs, because recreating all these checks at the point where the module runs seems impractical. But I'm not sure off the top of my head what's the canonical way to do that in dracut, or if there might be a better choice. I'll look into it in a bit.

Comment 4 Adam Williamson 2023-02-22 08:08:42 UTC
For now pvalena has reverted the entire PR downstream, but not sure if that will be the long-term fix.

Comment 5 Lukáš Nykrýn 2023-02-23 13:59:55 UTC
I don't have a reproducer at hand, but what about extending the kernel cmdline in those other scripts?
I mean replace 

overlayfs="required"
with
echo "rd.live.overlay.overlayfs=1" > /etc/cmdline.d/dracut-need-overlay.conf

Adam, can you try that?

Comment 6 Adam Williamson 2023-02-23 16:59:26 UTC
Wow, uh, yikes. I mean, that could work (if there's no caching involved in how `getargbool` works, at least?) but it seems very hacky. Surely there's a better way? I was assuming there must be existing cases where different parts of dracut need to 'signal' to each other like this and there would be an existing canonical way to do it, I just don't happen to know what that is so I couldn't write a PR.

I guess I'll file an upstream issue for this? That way we can get some input from the folks who wrote and reviewed the change...

Comment 7 Lukáš Nykrýn 2023-02-24 06:23:52 UTC
We are doing something like this on several places:

modules.d/35connman/cm-config.sh:    echo rd.neednet >> /etc/cmdline.d/connman.conf
modules.d/35network-legacy/parse-ip-opts.sh:    echo "rd.neednet=1" > /etc/cmdline.d/dracut-neednet.conf
modules.d/35network-legacy/parse-ip-opts.sh:            >> /etc/cmdline.d/80-enx.conf
modules.d/35network-manager/nm-config.sh:    echo rd.neednet >> /etc/cmdline.d/35-neednet.conf
modules.d/40network/net-lib.sh:    echo "ifname=$name$num:$mac" >> /etc/cmdline.d/45-ifname.conf
modules.d/40network/net-lib.sh:    ) >> /etc/cmdline.d/40-ibft.conf
modules.d/80cms/cmsifup.sh:} > /etc/cmdline.d/80-cms.conf
modules.d/95fcoe-uefi/parse-uefifcoe.sh:    print_fcoe_uefi_conf "$i" > /etc/cmdline.d/40-fcoe-uefi.conf && break
modules.d/95nvmf/parse-nvmf-boot-connections.sh:    echo "rd.neednet=1" > /etc/cmdline.d/nvmf-neednet.conf
modules.d/98dracut-systemd/dracut-cmdline-ask.sh:    [ -n "$line" ] && printf -- "%s\n" "$line" >> /etc/cmdline.d/99-cmdline-ask.conf
modules.d/99base/init.sh:        echo "$line" >> /etc/cmdline.d/99-cmdline-ask.conf

Comment 8 Laszlo 2023-02-24 14:30:01 UTC
Upstream fix https://github.com/dracutdevs/dracut/pull/2233 .

If someone could help to confirm the upstream fix, that would be appreciated.

In addition it seems installer already sets rd.live.overlay.overlayfs in https://github.com/livecd-tools/livecd-tools/blob/main/imgcreate/live.py#L127 in certain conditions. Why only on certain conditions ?

Comment 9 Adam Williamson 2023-02-24 17:31:43 UTC
I don't know the reason for the conditional there off the top of my head, but that code is used in building live images, which aren't affected by the bug. Installer images (that is, the network installer, Server DVD installer, and Silverblue DVD installer - images that boot to a dedicated installer environment, not to some kind of live desktop) are the ones affected by this bug. Those are built by lorax - https://github.com/weldr/lorax/ - which doesn't use the imgcreate library.

(FWIW, our current official lives don't seem to have rd.live.overlay.overlayfs on the cmdline either, they have rd.live.image . I think this may be changing as part of https://fedoraproject.org/wiki/Changes/ModernizeLiveMedia , but we had to revert the persistent overlay part of that for now as the initial attempt broke stuff. From a quick look at the boot logs, it doesn't look like the current official lives use overlayfs at all).

Comment 10 Adam Williamson 2023-02-24 17:32:40 UTC
Oh, and I'll test the proposed fix today, thanks.

Comment 11 Adam Williamson 2023-02-24 21:58:51 UTC
Tested, it seems to work.

Comment 12 Pavel Valena 2023-02-27 09:32:13 UTC
(In reply to Adam Williamson from comment #9)
> I don't know the reason for the conditional there off the top of my head,
> but that code is used in building live images, which aren't affected by the
> bug. Installer images (that is, the network installer, Server DVD installer,
> and Silverblue DVD installer - images that boot to a dedicated installer
> environment, not to some kind of live desktop) are the ones affected by this
> bug. Those are built by lorax - https://github.com/weldr/lorax/ - which
> doesn't use the imgcreate library.
> 
> (FWIW, our current official lives don't seem to have
> rd.live.overlay.overlayfs on the cmdline either, they have rd.live.image . I
> think this may be changing as part of
> https://fedoraproject.org/wiki/Changes/ModernizeLiveMedia , but we had to
> revert the persistent overlay part of that for now as the initial attempt
> broke stuff. From a quick look at the boot logs, it doesn't look like the
> current official lives use overlayfs at all).

There's new module (triggered by dmsquash-live): https://github.com/dracutdevs/dracut/tree/master/modules.d/90dmsquash-live-autooverlay

Also, I've reverted the change in https://src.fedoraproject.org/rpms/dracut/pull-request/30 (F39 only).

Comment 13 Laszlo 2023-02-28 12:34:40 UTC
@pvalena given the upstream discussion, would you be open to revert https://src.fedoraproject.org/rpms/dracut/c/05988c6a16621c75d2fe3ed0cfddfb6ce2d18f93?branch=rawhide and pull in https://github.com/dracutdevs/dracut/commit/0e780720efe6488c4e07af39926575ee12f40339 .

I hope to understand if there is any remaining issue and I hope to see Fedora releasing the new overlayfs dracut module to match other distro's. Thanks !

Comment 14 Adam Williamson 2023-02-28 22:52:42 UTC
We can possibly do that for F38 *after* Beta is released. Hard to justify doing it during Beta freeze when we know the reversion is working fine.

Comment 15 Laszlo 2023-03-01 02:02:30 UTC
> We can possibly do that for F38 *after* Beta is released.

That would be great, thanks !

Reversion reintroduced a bug where overlay does not work with NFS (which also breaks the test suite that ships with dracut). Other distro's shipping with this version of dracut does not have this bug. Not trying to pressure anybody to make anything happen, but this is a trade-off that is being made.

Comment 16 Pavel Valena 2023-03-09 13:12:21 UTC
I can push the fix to rawhide (and drop the revert), and later to even to F38, if that works. But I'm still unsure of F37, as I wanted to push the updated 059 there also (possibly having the patch). Depends on which if those is more reliable :).

Comment 17 Pavel Valena 2023-03-09 13:14:16 UTC
FYI, I've not done the revert-fix-build for F38, and the first buld was simply untagged: https://koji.fedoraproject.org/koji/buildinfo?buildID=2156534

Comment 18 Pavel Valena 2023-03-09 16:03:10 UTC
Can anyone test the functionality specifically? I do not know how to create my own boot-media. But might learn to....

https://src.fedoraproject.org/rpms/dracut/pull-request/32

 - Scratch-builds:
(copr) https://copr.fedorainfracloud.org/coprs/build/5618842
(rawhide) https://koji.fedoraproject.org/koji/taskinfo?taskID=98491273
(f38) https://koji.fedoraproject.org/koji/taskinfo?taskID=98494231
(f37) https://koji.fedoraproject.org/koji/taskinfo?taskID=98491269

Comment 20 Adam Williamson 2023-03-09 16:15:32 UTC
Anything you submit as an update will automatically be tested by openQA, and for anything besides Rawhide, if the test fails the update will be blocked from being pushed. That's how I found this bug in the first place. There is an openQA test that creates an installer ISO and checks if it works.

If you want a test before submitting an official update, just do a scratch build and ask me; I can manually trigger tests on scratch builds.

Comment 21 Pavel Valena 2023-03-21 18:26:39 UTC
Pull-request: https://src.fedoraproject.org/rpms/dracut/pull-request/34#

Comment 22 Fedora Update System 2023-03-23 11:05:14 UTC
FEDORA-2023-e8ca690ff3 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-e8ca690ff3

Comment 23 Geraldo Simião 2023-03-25 23:34:46 UTC
I donwloaded the last netinst-x86_64-38-20230325.n.0.iso image, booted and installed successfully a VM with this image. I assume it have the last dracut build?
Is this a valid test for this bug? Or must I test a Server DVD installer?

Comment 24 Geraldo Simião 2023-03-26 00:05:21 UTC
oops, this image uses dracut-057-6.fc38. 
Sorry for the noise, I'll wait for a new iso with dracut-059-2.fc38

Comment 25 Fedora Update System 2023-03-26 00:20:14 UTC
FEDORA-2023-e8ca690ff3 has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 26 Geraldo Simião 2023-03-27 11:04:38 UTC
Tested the new Fedora-Server-netinst-x86_64-38-20230326.n.1
It have dracut-059-2.fc38 and it boots and installs correctly.


Note You need to log in before you can comment on or make changes to this bug.