Bug 1989119 - Nfsroot system fails to boot when using systemd-networkd [NEEDINFO]
Summary: Nfsroot system fails to boot when using systemd-networkd
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: dracut
Version: 38
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: dracut-maint-list
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-02 13:30 UTC by Göran Uddeborg
Modified: 2024-05-21 14:12 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-05-21 14:12:38 UTC
Type: Bug
Embargoed:
goeran: needinfo? (tomek)


Attachments (Terms of Use)
rdsosreport (141.79 KB, application/x-xz)
2021-08-02 13:30 UTC, Göran Uddeborg
no flags Details

Description Göran Uddeborg 2021-08-02 13:30:44 UTC
Created attachment 1810130 [details]
rdsosreport

Description of problem:
The machine is using an NFS system as root. After upgrading the machine from Fedora 32 to 34, it no longer boots, but hangs while running the code in the initramfs. It finally writes a message "Could not boot" and drops to an emergency shell.


Version-Release number of selected component (if applicable):
dracut-055-3.fc34.x86_64


How reproducible:
Every time


Steps to Reproduce:
1. Boot latest kernel


Actual results:
Ends up in emergency shell.


Expected results:
Successful boot.


Additional info:
In the emergency shell, trying the command to mount the root file system, (mount 172.17.0.1:/remote/pluto /sysroot) works fine without any problems.

Going back to the last kernel from Fedora 32, 5.11.22-100.fc32.x86_64, the boot works fine and the machine comes up, using the newly installed userspace for Fedora 34.

My understanding is the problem is not because of differences in the kernel proper, but some change in how dracut tries to boot the system in the initramfs, thus the assignment of the bugzilla.

As a side note: I previously used the short name for the NFS server (mimmi) on the kernel command line. The system was able to add the search domain and look up the name. This still works for the Fedora 32 kernel, but in the emergency shell name resolution does not seem to work. This might very well be a separate problem or simply a consequence of the other error, but I wanted to mention it just in case it is relevant. I have changed to the IP address on the kernel command line to avoid this particular issue for the time being.

I've tried to debug it. I'm not sure if the following helps, but in case it does this is what I have figured out.

It seems to me dracut-initqueue fails since it is waiting for the file "$hookdir"/initqueue/finished/nfsroot.sh to succeed. This file contains code looking for the /proc directory in /sysroot. I fail to understand how that could show up during the initqueue phase. It depends on the (NFS) root directory being mounted, but that doesn't happen until sysroot.mount is done. Sysroot.mount in turn is waiting for dracut-pre-mount.service, which in turn waits for dracut-initqueue.service. It looks like a circular dependency to me.

But when I look at the Fedora 32 initramfs, it looks very much the same, so it should have been a loop there too. It is clearly something important I don't understand.

Comparing the dracut-initqueue scripts between the two versions (ignoring whitespace adjustments etc.) there is a difference in the loop inside the "if [ $main_loop -gt $((2 * RDRETRY / 3)) ]" conditional on line 61. An inner conditional wasn't there in the previous version, and it looks a little strange there is code checking in initqueue/finished but then actually running initqueue/timeout. But again, I don't see how this makes any difference; the conditional should succeed since there always is something in initqueue/finished, nfsroot.sh to be precise.

Comment 1 Göran Uddeborg 2021-08-10 15:40:11 UTC
I've tried to debug this and think I have a bit more information.

A relevant difference seems to be that while the F32 dracut used ifup-style scripting to bring the network up, the F34 version doesn't. I thought I had switched to networkd long ago, but although there were definitions in place in /etc/systemd/network when I made the last initramfs under 32, it still uses ifup-style scripting in the initramfs.

To my surprise, somewhere in those scripts, the NFS root gets mounted on /sysroot. That's why it doesn't hang the same way in the oldy dracut; the file system is mounted and thus /proc is there. Is this really how things are intended to work? I would have thought the system should be mounted later in sysroot.mount (from man dracut.bootup).

In the new system, I can't seem to use old ifup-style activation of the network, even if I try. But rather, the networkd managed network does come up.

As a workaround, I simply commented out the line "echo '[ -e $NEWROOT/proc ]' …" in /usr/lib/dracut/modules.d/95nfs/parse-nfsroot.sh and generated a new initramfs.

That got me past the initqueue stage, but the next problem was that the (generated) sysroot.mount entry had "What=nfs4:172.17.0.1:/remote/pluto", which will run the command "/usr/bin/mount nfs4:172.17.0.1:/remote/pluto /sysroot …". The mount command doesn't understand that syntax. I had to remove the "nfs4:" part from the kernel command line.

Then it successfully mount the root filesystem. But when it tries to do the switch root step it hangs for a while and starts to complain that NFS server 172.17.0.1 is not responding. My impression is that the switch root somehow breaks networkd and the machine looses its configuration. I tried to add rd.break=cleanup and look around. Everything looked fine as far as I could tell. When I tried to manually do the switch (systemctl --no-block switch-root /sysroot) I can repeat the problem; it looses contact to the nfs server.

Comment 2 Göran Uddeborg 2021-09-07 19:49:09 UTC
Here is some further updates.

In order to switch from using systemd-networkd to NetworkManager I removed my configuration files from /etc/systemd/network, and generated a new initramfs. Booting with that one works! It works even with the original versions of the module files from dracut, and with the original "root=nfs4:mimmi:/remote/pluto" parameter on the kernel command line.

So this problem seems to be specific to trying to use systemd-networkd in combination with NFS root. I've updated the summary line accordingly.

Comment 3 Ian Dall 2021-12-30 10:56:16 UTC
I'm seeing exactly the same problem except with Fedora 33 (works) and Fedora 35 (doesn't work).

If I stop at rd.break=initqueue I can manually mount the NFS root volume but the systemd sysroot.mount unit never succeeds. Using systemctl list-jobs, it seems as though sysroot.mount is waiting for dracut-initqueue.service and dracut-initqueue.service is running, but not finished because it is waiting for /sysroot/proc to appear, which never happens until sysroot.mount succeeds.

There is also a problem not starting systemd-resolved early enough and with systemd-resolve and systemd-timesync missing from /etc/{passwd,group}, which are really different bugs. However:

 * if those users/groups are added by modifying the 00systemd dracut module and regenerating initramfs); then
 * stop before dracut-initqueue using the command line option rd.break=initqueue; then
 * start systemd-resolved manually; then
 * mount the NFS volume on /sysroot; and then
 * continue the boot (^D);

the system proceeds to boot as expected.

The weird thing is that the Fedora 33 version seems to contain the same circular dependency but it works. In my case the working Fedora 33 version uses Network Manager for network configuration (as for comment #2).

It seem to me that maybe dracut-initqueue.service should be constrained to run before initrd-root-fs.target rather than before dracut-pre-mount.service.

Comment 4 Ben Cotton 2022-05-12 15:05:11 UTC
This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 5 Göran Uddeborg 2022-05-14 11:48:06 UTC
I can't easily check this on Fedora Linux 36 currently, but according to comment 3 the problem still remains at least in Fedora Linux 35.

Comment 6 Ian Dall 2022-11-21 05:54:36 UTC
See bug 2036214

Comment 7 Ben Cotton 2022-11-29 17:02:47 UTC
This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 8 Ben Cotton 2023-02-07 14:52:23 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 38 development cycle.
Changing version to 38.

Comment 9 Aoife Moloney 2024-05-07 15:44:16 UTC
This message is a reminder that Fedora Linux 38 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 38 on 2024-05-21.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '38'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 38 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 10 Göran Uddeborg 2024-05-07 17:08:48 UTC
You switched this to "rawhide", Tomasz Torcz, and then it was switched to 38 by a mass change. Is there any interest in fixing this?

For my own case the system where this was discovered is nowadays booting using NetworkManager- It is being used and I don't want to experiment with it to check if the problem is gone. (I doubt it, but I don't actually know.) Therefore I won't change the version.

Comment 11 Tomasz Torcz 2024-05-07 17:28:37 UTC
I will check if this is still the case shortly.
But I suppose this doesn't work still. I saw no development around nfsroot and dracut seems to be abandoned.

Comment 12 Aoife Moloney 2024-05-21 14:12:38 UTC
Fedora Linux 38 entered end-of-life (EOL) status on 2024-05-21.

Fedora Linux 38 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.