It seems that grub2 does not implement XFS journal recovery. If a reboot is triggered fairly shortly after an upgrade (with dnf in my case, but see bug 1416650 for a potentially related issue involving GNOME Software), then only the XFS journal contains the required data to locate critical files such as /boot/grub2/grub.cfg. GRUB then drops into a prompt, and the files appear as empty to GRUB.
In my case, the /boot/grub2/grub.cfg file and the initramfs for the latest kernel were affected. I could boot manually from the GRUB prompt, which triggered
Apr 20 13:47:58 oldenburg kernel: XFS (sda1): Starting recovery (logdev: internal)
Apr 20 13:48:00 oldenburg kernel: XFS (sda1): Ending recovery (logdev: internal)
This system has an SSD, so a two-second delay indicates quite some recovery work.
/boot is on the same XFS file system that is mounted at /. I use legacy (non-UEFI) boot.
This is the second time the issue has occurred. It could be a relatively recent kernel regression that more shutdowns (with reboot/poweroff) are unclean, even though the system shuts down of its own.
Some relevant package versions:
Something definitely has changed in the shutdown sequence. The last clean shutdown I had was with:
kernel: Linux version 4.15.12-301.fc27.x86_64
on 2018-04-04. After that, all shutdowns are unclean.
Of course, another package update could have introduced this regression. It might not be a kernel change after all.
Raised on the mailing list:
Sounds like this is a classic system? See also https://github.com/ostreedev/ostree/pull/1049
(In reply to Colin Walters from comment #3)
> Sounds like this is a classic system? See also
Yes, it's a typical RPM-based installation.
maybe plymouth? Reminds me of:
Not sure what changed recently, mind you.
(Potentially OT, should we consider switching /boot to ext2 for simplicity? There's really no xfs-specific advantage on a small /boot partition...)
Also, we've had a rash of complaints on the xfs list issues which can be traced to the root filesystem no longer being cleanly unmounted on a reboot. I've seen them for ext4 too. Bugs have been filed, to no avail:
This issue /really/ needs attention.
Anyway, all that aside, we do need some way to force all data & relevant metadata all the way through the log & to disk, and freeze/thaw works but it's pretty heavy handed. I wonder if we should consider some sort of "flush the log" ioctl...
(In reply to Eric Sandeen from comment #6)
> Anyway, all that aside, we do need some way to force all data & relevant
> metadata all the way through the log & to disk, and freeze/thaw works but
> it's pretty heavy handed. I wonder if we should consider some sort of
> "flush the log" ioctl...
Do you really think grubby, grub-mkconfig, and systemd are not freeze/thawing as a fallback because it's heavy handed, and that they're all going to jump at implementing a new ioctl to solve this? I think that's specious but I would love to be wrong.
I suspect this is related, but is not affecting (re)boot:
And why hasn't https://cgit.freedesktop.org/plymouth/commit/?id=9e5a276f322cfce46b5b2ed2125cb9ec67df7e9f been reverted in the interim?
BTW bug 1227736 is a dup, but I've sufficiently mangled it with stream of consciousness debugging notes I'm fine with that one being marked as the dup.
(In reply to Eric Sandeen from comment #6)
> (Potentially OT, should we consider switching /boot to ext2 for simplicity?
> There's really no xfs-specific advantage on a small /boot partition...)
Firmware sizes are skyrocketing (entire Linux distributions can now be firmware), so sharing / and /boot actually simplifies things when it can be done.
(In reply to Chris Murphy from comment #7)
> And why hasn't
> ?id=9e5a276f322cfce46b5b2ed2125cb9ec67df7e9f been reverted in the interim?
Well it's not side-effect free. It means rather than seeing
2) shutdown splash
3) power off
users will now see
2) shutdown splash
3) flicker and debug messages
4) power off
I agree we should fix the bug though… There's no reason we can't have our cake and eat it too. I do wonder what changed recently.
Not recently, the problem was first reported three years ago in bug 1227736. But in that same time frame we have:
- systemd offline updates reboot way faster than dnf updates, encourages this problem
- Some time ago, maybe about 3 years ago, a bug was fixed that caused reboot/shutdown to hang while waiting 1m30 on email@example.com (or maybe firstname.lastname@example.org), during which time the file system has plenty of time to flush completely on its own.
In my opinion the central design flaw is the initiation of reboot (which is a sequence of events, not one command) before /boot changes are fully committed to disk. The reboot sequence is full of potential traps where it can hang or crash for myriad reasons.
The thing that modifies /boot is obligated to ensure those changes are fully committed before it exits, because its direct consumer: the bootloader, requires it. This should happen before reboot is initiated. I think it's incredibly bad design that the thing that modifies /boot expects something else, in some other sequence, to ensure those changes are fully committed.
The simplest fix would be to modify new-kernel-pkg to remove the PPC64LE specific usage of freeze/thaw and just always do it on XFS, ext4, Btrfs.
A new ioctl sounds like it'll take longer to trickle down. First something generic all the fs devs can agree on, then it has to get to downstream kernels, and then getting grubby and GRUB to actually use them, and trickle down. I estimate 2 years for Fedora and 8 years for Debian.
are you guys sure that the underlying reason is not the broken systemd in F27?
Zbigniew J?drzejewski-Szmek 2018-03-20 09:27:59 EDT
Yeah, it's certainly possible that this causes unclean fs shutdowns.
OK, I'll try to backport this to F27.
(In reply to Harald Reindl from comment #11)
> are you guys sure that the underlying reason is not the broken systemd in
Yes. Bug 1227736 goes back 3 years, and it's the same bug being discussed here.
This pull was merged upstream almost a year ago but is still not in Fedora's grubby. So even if we're using FAT or ext2 for the grub.cfg, if the file system does not unmount cleanly, good chance we won't boot.
I'm unable to make heads or tails out of the different versioning upstream and Fedora appear to use for this package, but it looks like Fedora's is 3 years behind upstream, minus a few patches Fedora adds during build.
As you noted https://github.com/rhboot/grubby/pull/24 does not fifreeze/fithaw at all, it only covers the ext2 or FAT use-case.
BTW new-kernel-pkg is not the only route that leads to this bug.
Using "dracut -f" and having an unclean shutdown will trigger it as well, unless the system's dracut is newer than https://github.com/dracutdevs/dracut/commit/de576db3c225723542b48e2b470693bfe9f4dfb9
Using grubby to update kernel parameters followed by an unclean shutdown leads to the same scenario.
A fix for both new-kernel-pkg and grubby is in:
I'm changing this bug to grubby, the bug being that it's not properly committing grub.cfg, kernel, or initramfs no matter what the volume format is.
And bug 1227736 will remain as plymouth, since its preventing rootfs from remount-ro, by violating the systemd guidelines for kill exemption.
dracut is doing fsfreeze if /boot is a mountpoint. Funny, because on Btrfs when fstab says to do 'mount -o subvol=boot /boot' basically it allows doing freeze on an entire root fs. So today I was stracing for other reasons not knowing any of this and it kept hanging the entire system.
So yeah, it is possible to do 'fsfreeze -f && fsfreeze -u' and the freeze somehow never exits, so the thaw never arrives, and hilariously at next boot the default GRUB entry doesn't work so the system isn't bootable without user intervention.
The bug report includes sysrq+t during the hang.
Anyway, other testing so far shows sync() alone makes kernel, initramfs, and grub.cfg bootloader safe on FAT, ext4, Btrfs, but not XFS. The sample size is just under a dozen tests, so it's not scientific. I have no idea if this should always work on ext4 and Btrfs, and never work on XFS, but that's what it looks like so far.
I've reproduced the fsfreeze -f hang on XFS, ergo it's not Btrfs specific. I've updated the dracut bug, advocating they revert the fsfreeze change because it absolutely makes things worse than better.
This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30 Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '27'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 27 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Fedora 27 changed to end-of-life (EOL) status on 2018-11-30. Fedora 27 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.