Bug 1227736 - Minimal grub after a kernel update with gnome-software
Summary: Minimal grub after a kernel update with gnome-software
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: plymouth
Version: 28
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: François Cami
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker https://fedoraproject...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-03 12:06 UTC by Sébastien Wilmet
Modified: 2019-05-28 22:14 UTC (History)
25 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-05-28 22:14:11 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
My /var/log/grubby (6.21 KB, text/plain)
2015-06-08 18:33 UTC, Sébastien Wilmet
no flags Details
My /var/log/grubby with Fedora 23 (7.41 KB, text/plain)
2015-12-02 16:32 UTC, Sébastien Wilmet
no flags Details
My /etc/default/grub (214 bytes, text/plain)
2015-12-02 16:33 UTC, Sébastien Wilmet
no flags Details
My /boot/grub2/grub.cfg (after grub reinstall) (5.60 KB, text/plain)
2015-12-02 19:20 UTC, Sébastien Wilmet
no flags Details
My /boot/grub2/grubenv (after grub reinstall) (1.00 KB, text/plain)
2015-12-02 19:21 UTC, Sébastien Wilmet
no flags Details
My /var/log/grubby (11.12 KB, text/plain)
2016-01-15 13:38 UTC, Sébastien Wilmet
no flags Details
My /boot/grub2/grub.cfg (5.62 KB, text/plain)
2016-01-15 13:40 UTC, Sébastien Wilmet
no flags Details
My /boot/grub2/grubenv (1.00 KB, text/plain)
2016-01-15 13:42 UTC, Sébastien Wilmet
no flags Details
F25: My /etc/default/grub (218 bytes, text/plain)
2016-12-21 11:45 UTC, Sébastien Wilmet
no flags Details
F25: My /var/log/grubby before re-installing grub (14.08 KB, text/plain)
2016-12-21 11:47 UTC, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grub.cfg before re-installing grub (5.70 KB, text/plain)
2016-12-21 11:48 UTC, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grubenv before re-installing grub (1.00 KB, text/plain)
2016-12-21 11:49 UTC, Sébastien Wilmet
no flags Details
F25: My /var/log/grubby after re-installing grub (14.08 KB, text/plain)
2016-12-21 11:57 UTC, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grub.cfg after re-installing grub (5.81 KB, text/plain)
2016-12-21 11:58 UTC, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grubenv after re-installing grub (1.00 KB, text/plain)
2016-12-21 11:59 UTC, Sébastien Wilmet
no flags Details
List of packages that I install (1.01 KB, text/plain)
2017-03-14 21:19 UTC, Sébastien Wilmet
no flags Details
xfs_repair log (63.51 KB, text/plain)
2017-03-15 05:19 UTC, Chris Murphy
no flags Details
xfs_repair log 2 (216.60 KB, text/plain)
2017-03-15 05:28 UTC, Chris Murphy
no flags Details
dmesg output ThinkPad L530, Fedora 25 (65.03 KB, text/plain)
2017-03-15 09:38 UTC, Sébastien Wilmet
no flags Details
offline update boot, systemd debug log (873.74 KB, text/plain)
2017-03-15 19:45 UTC, Chris Murphy
no flags Details

Description Sébastien Wilmet 2015-06-03 12:06:00 UTC
Description of problem:
After a recent package update with Fedora Workstation 22 (with gnome-software), on the next boot I got a minimal grub (with a message beginning with "Minimal BASH-like line editing is supported."). I had to re-install grub with a live DVD.

In the updated packages, there was the kernel 4.0.4. I think it's the only package that could have altered grub.

With
$ rpm -q --scripts kernel-core-4.0.4-301.fc22.x86_64

I see that /bin/kernel-install is called, and kernel-install is provided by systemd. So it's maybe a bug in systemd, I don't know.

Version-Release number of selected component (if applicable):
Fedora Workstation 22
kernel-4.0.4-301.fc22
systemd-219-15.fc22
gnome-software-3.16.2-2.fc22

Additional info:
If it matters, my / partition (that contains /boot) is in XFS. It's a primary partition, I don't use LVM.

I didn't try to reproduce the bug, since it's quite annoying (re-installing grub with a live CD/DVD/USB).

Comment 1 Brian Lane 2015-06-08 16:24:56 UTC
Please attach /var/log/grubby

Comment 2 Sébastien Wilmet 2015-06-08 18:33:17 UTC
Created attachment 1036453 [details]
My /var/log/grubby

Comment 3 Sébastien Wilmet 2015-12-02 16:30:12 UTC
I've had the same problem with Fedora 23. I did a fresh install (at beta time). It seems to happen only with gnome-software.

I have another issue, which is maybe related. When a new kernel is installed, the default grub entry is the second one, not the first one. I've edited /etc/default/grub, I don't remember the default content but I have now GRUB_DEFAULT=0. So when that bug happens, I run:
# grub2-mkconfig -o /boot/grub2/grub.cfg

I'll attach my new /var/log/grubby.

This is quite a serious issue. You have the chance that I know how to reinstall grub from a live DVD, and that I'm still using Fedora.

Comment 4 Sébastien Wilmet 2015-12-02 16:32:24 UTC
Created attachment 1101558 [details]
My /var/log/grubby with Fedora 23

Comment 5 Sébastien Wilmet 2015-12-02 16:33:11 UTC
Created attachment 1101559 [details]
My /etc/default/grub

Comment 6 Brian Lane 2015-12-02 19:00:47 UTC
Could you also attach /boot/grub2/grub.cfg and /boot/grub2/grubenv? Thanks.

Comment 7 Sébastien Wilmet 2015-12-02 19:20:14 UTC
Created attachment 1101598 [details]
My /boot/grub2/grub.cfg (after grub reinstall)

Comment 8 Sébastien Wilmet 2015-12-02 19:21:13 UTC
Created attachment 1101600 [details]
My /boot/grub2/grubenv (after grub reinstall)

Comment 9 Brian Lane 2015-12-02 19:40:46 UTC
Nothing unusual in those. Next time it happens please save copies of the broken configs before fixing them. Without that there isn't much we can do.

Comment 10 Sébastien Wilmet 2016-01-15 13:36:44 UTC
I've decided to break my system again and do the update with gnome-software instead of dnf. See the following attachments.

Comment 11 Sébastien Wilmet 2016-01-15 13:38:30 UTC
Created attachment 1115164 [details]
My /var/log/grubby

Comment 12 Sébastien Wilmet 2016-01-15 13:40:28 UTC
Created attachment 1115166 [details]
My /boot/grub2/grub.cfg

The new attachments are the files present on the disk just after the gnome-software update, so with the minimal grub. _Before_ re-installing grub.

Comment 13 Sébastien Wilmet 2016-01-15 13:42:31 UTC
Created attachment 1115168 [details]
My /boot/grub2/grubenv

/boot/grub2/grubenv was actually a symbolic link to /boot/efi/EFI/fedora/grubenv, but I suppose it's normal.

Comment 14 Sébastien Wilmet 2016-04-29 08:58:38 UTC
Happens on Fedora 24 too.

Comment 15 Sébastien Wilmet 2016-12-21 11:42:00 UTC
So for me, this bug happens with F22, F23, F24 and now F25.

I'll attach the requested files for F25. Is this bug filed in the correct component? Do you have any clue what's going wrong? As a workaround I can do the updates with dnf, but for a "normal" user, I think the only choice is to install another distro. I take the time to report this bug and try to reproduce it with new Fedora versions, because it's a critical problem. This problem will still likely be present in Fedora Atomic/rpm-ostree Workstation, where it won't be possible to do the updates with dnf.

Comment 16 Sébastien Wilmet 2016-12-21 11:45:56 UTC
Created attachment 1234304 [details]
F25: My /etc/default/grub

Comment 17 Sébastien Wilmet 2016-12-21 11:47:14 UTC
Created attachment 1234305 [details]
F25: My /var/log/grubby before re-installing grub

Comment 18 Sébastien Wilmet 2016-12-21 11:48:43 UTC
Created attachment 1234307 [details]
F25: My /boot/grub2/grub.cfg before re-installing grub

Comment 19 Sébastien Wilmet 2016-12-21 11:49:38 UTC
Created attachment 1234308 [details]
F25: My /boot/grub2/grubenv before re-installing grub

Comment 20 Sébastien Wilmet 2016-12-21 11:53:23 UTC
The gnome-software update contained an update to kernel-4.8.14-300.fc25.

Comment 21 Sébastien Wilmet 2016-12-21 11:57:24 UTC
Created attachment 1234309 [details]
F25: My /var/log/grubby after re-installing grub

Comment 22 Sébastien Wilmet 2016-12-21 11:58:21 UTC
Created attachment 1234312 [details]
F25: My /boot/grub2/grub.cfg after re-installing grub

Comment 23 Sébastien Wilmet 2016-12-21 11:59:28 UTC
Created attachment 1234313 [details]
F25: My /boot/grub2/grubenv after re-installing grub

Comment 24 Sébastien Wilmet 2016-12-21 12:13:00 UTC
/var/log/grubby and /boot/grub2/grubenv have exactly the same content before and after re-installing grub.

In /boot/grub2/grub.cfg before re-installing grub, there is:

menuentry 'Fedora (4.8.14-300.fc25.x86_64) 25 (Workstation Edition)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.8.0-0.rc7.git0.1.fc25.x86_64-advanced-5d46062a-d140-441f-b55f-306b39fab6db' {

-> 4.8.0-0.rc7, the bug probably comes from that.

Comment 25 Sébastien Wilmet 2016-12-21 13:48:37 UTC
By looking at `dnf history`, kernel-4.8.0-0.rc7.git0.1.fc25.x86_64 was the first kernel removed (and thus the first installed). I did a fresh install of F25 beta (I always do a fresh install of new Fedora versions).

Comment 26 Sébastien Wilmet 2017-03-13 18:29:11 UTC
I've had the same bug on another computer (where I install more or less the same set of packages).

Someone else has had the same problem here:
https://ask.fedoraproject.org/en/question/99550/fedora-25-no-longer-boots-after-upgrade-how-to-fix/

Should I propose this bug as a blocker?

Comment 27 Heiko Adams 2017-03-13 20:17:50 UTC
Got the same issue after installing Fedora 26 Workstation from live cd

Comment 28 Alexander Ploumistos 2017-03-13 20:45:13 UTC
Is your system (U)EFI by any chance? If yes, you need to change the linux16 entries to linuxefi in grub.cfg. I've seen something similar on a dual boot laptop, where windows messed with the efi partition and when I repaired grub from a live USB, instead of kernel entries starting with linuxefi, I kept getting linux16.

Comment 29 Chris Murphy 2017-03-14 02:33:04 UTC
I've never seen this before, Fedora 22 through Fedora 26. It would be most helpful if there are exact reproduce steps, as tedious as that is.

I can't tell what step fixed the problem. Was it grub2-install or was it grub2-mkconfig? Getting a grub prompt suggests that core.img is loaded, and normal.mod is found which means it's reading XFS OK, but isn't finding the grub.cfg, and hence no menu. Does the problem manifest as:
grub> 
OR
grub rescue>
?
The latter suggests core.img is found, but not normal.mod. In all of my testing, I've never had a long lived /boot on XFS. But there isn't enough information to know if this might be some ondisk change in XFS that and older embedded GRUB core.img isn't able to read; in which case grub2-install would fix it; or if grubby is misfiring the modification of grub.cfg and the grub.cfg is therefore missing.

Basically we need a computer with the problem, and not fixed, and then find out either from the grub menu whether it can in fact see files in /boot/grub2 and read them: this can be done with grub commands ls, cd, and configfile, and it's possible to use tab for path autocompletion.

Comment 30 Sébastien Wilmet 2017-03-14 20:42:43 UTC
(In reply to Alexander Ploumistos from comment #28)
> Is your system (U)EFI by any chance?

No, the UEFI/Legacy Boot option is set to "Legacy Only".

(In reply to Chris Murphy from comment #29)

I can reproduce the problem later this week and provide more info. I don't remember if the prompt is grub> or grub rescue>.

Comment 31 Chris Murphy 2017-03-14 20:57:14 UTC
> No, the UEFI/Legacy Boot option is set to "Legacy Only".

So it is UEFI, but it's using a compatibility support module to present a faux BIOS. This is really suboptimal and makes it an edge case.

Is this a multiboot system or Fedora only? If multiboot, what other OS's are present?


> I can reproduce the problem later this week and provide more info. I don't
> remember if the prompt is grub> or grub rescue>.

And the problem never happens with 'dnf' updates, but always happens with gnome-software updates? (I am assuming either form of update includes a kernel update.)

What's weird about that is the RPM is the same, and all the scripts run following kernel install are the same. I'm not thinking of why a pk offline update via systemd would make any difference with bootloader stuff compared to dnf; about the only thing I'm thinking of is maybe there's a more abrupt reboot with pk offline update. Maybe the drive is lying when it gets fsync, and a faster reboot means an orphaned/missing grub.cfg? Seems specious though...

Comment 32 Chris Murphy 2017-03-14 21:00:23 UTC
If the problem is reproduced, when booting from alternate media and mounting the file system for the first time, I'd like to see a complete dmesg from that environment. In particular I'm interested in seeing the messages for the first mount of the XFS volume, following the update resulting in a grub prompt. After that first mount, if you can umount the volume, and run xfs_repair -n and attach the results that might be useful.

Is this XFS volume originally created with Fedora 22, and the system has been upgraded to 23 > 24 > 25? Or was it ever clean installed (new file system created)?

Comment 33 Sébastien Wilmet 2017-03-14 21:19:14 UTC
Created attachment 1263052 [details]
List of packages that I install

I always follow these steps when I install Fedora (for my own use):
1. Install Fedora Workstation
2. `dnf upgrade` + reboot (not with gnome-software).
3. Install attached list of packages
4. Disable selinux + reboot

Then each time there is a kernel update done by gnome-software, I get the minimal grub. There is no problem with a kernel update with dnf.

Comment 34 Sébastien Wilmet 2017-03-14 21:28:51 UTC
(In reply to Chris Murphy from comment #31)
> Is this a multiboot system or Fedora only? If multiboot, what other OS's are
> present?

Fedora only on one computer (with UEFI legacy mode). Dual-boot with Windows 7 on my day-job computer (I don't know if it's BIOS or UEFI or UEFI legacy, I can tell you that next week).

(In reply to Chris Murphy from comment #32)
> Is this XFS volume originally created with Fedora 22, and the system has
> been upgraded to 23 > 24 > 25? Or was it ever clean installed (new file
> system created)?

I always do a fresh install, by reformatting the / XFS partition (containing /boot).

Comment 35 Sébastien Wilmet 2017-03-14 21:51:31 UTC
My partitions:
- /dev/sda1 20GB XFS mounted on /
- /dev/sda2 4GB swap
- /dev/sda3 [rest of the disk] ext4 mounted on /home

Comment 36 Sébastien Wilmet 2017-03-14 22:04:41 UTC
See also comment #24, the /boot/grub2/grub.cfg file before re-installing grub was suspicious.

Comment 37 Chris Murphy 2017-03-14 22:15:23 UTC
(In reply to Sébastien Wilmet from comment #33)
> Created attachment 1263052 [details]
> List of packages that I install
> 
> I always follow these steps when I install Fedora (for my own use):
> 1. Install Fedora Workstation
> 2. `dnf upgrade` + reboot (not with gnome-software).
> 3. Install attached list of packages
> 4. Disable selinux + reboot
> 
> Then each time there is a kernel update done by gnome-software, I get the
> minimal grub. There is no problem with a kernel update with dnf.

I don't understand the last sentence in the context of everything that came before it. 'dnf update' after a clean install does a kernel update using dnf. You're saying you get different results if you do 'dnf update' vs 'dnf update kernel' ?

Comment 38 Chris Murphy 2017-03-14 22:32:48 UTC
> Then each time there is a kernel update done by gnome-software, I get the
> minimal grub. There is no problem with a kernel update with dnf.

Nevermind. I grok this now.

Comment 39 Chris Murphy 2017-03-15 04:38:48 UTC
OK I have a reproducer in a VM, very simple.
1. Install Fedora 25, custom partitioning, single standard partition mounted at /, format XFS. That's it, no other partitions.
2. Reboot from installer.
3. Drop to a VT shell, to avoid gnome-software from downloading updates.
4. dnf update --exclude=kernel-*
5. Reboot
6. Login to gnome-shell, launch Gnome Software, available OS update information shows only kernel packages to install. Click on Restart & Install.
## System reboots, I see plymouth splash with update status, and then reboot to grub> prompt.

Observations:
a. I can use ls and configfile to navigate to a grub.cfg - which is present but using configfile it doesn't load - I just get another grub> prompt with no error. So this file is empty or otherwise not readable.
b. Reboot with install media; and run xfs_repair -n.

Lots of crazy stuff that I'm not expecting. Many of these entries, each with a different inode number.
imap claims in-use inode 661691 is free, would correct imap
imap claims in-use inode 668038 is free, would correct imap

c. 
# mount -o ro,norecovery /dev/sda1 /mnt

The grub.cfg is a 0 length file.

d. 
# umount /mnt
# blockdev --setro /dev/sda1
# mount /dev/sda1 /mnt
mount: /dev/sda1 is write-protected, mounting read-only
mount: cannot mount /dev/sda1 read-only


[  880.621327] XFS (sda1): Unmounting Filesystem
[  884.909167] XFS (sda1): Mounting V5 Filesystem
[  884.951148] XFS (sda1): recovery required on read-only device.
[  884.951150] XFS (sda1): write access unavailable, cannot proceed.
[  884.951152] XFS (sda1): log mount/recovery failed: error -30
[  884.951197] XFS (sda1): log mount failed

# blockdev --setrw /dev/sda1
# mount /dev/sda1 /mnt

No error in user space. Kernel messages...

[ 1209.577452] XFS (sda1): Mounting V5 Filesystem
[ 1209.625671] XFS (sda1): Starting recovery (logdev: internal)
[ 1209.793035] XFS (sda1): Ending recovery (logdev: internal)

And now grub.cfg is not 0 length, 4970 bytes.

So clearly there file system is in a dirty state after pk offline update is done and systemd does a reboot, I guess it's not doing fsync or fdatasync or not waiting long enough for the kernel to finish it - no idea. But the fs is dirty and therefore journal replay is necessary to make it consistent again, but GRUB can't do journal replay.

e. Now that journal replay has been done, xfs_repair -n comes up clean, and the system also reboots.

Comment 40 Chris Murphy 2017-03-15 05:19:42 UTC
Created attachment 1263175 [details]
xfs_repair log

Comment 41 Chris Murphy 2017-03-15 05:28:07 UTC
Created attachment 1263176 [details]
xfs_repair log 2

previous one was truncated

Comment 42 Chris Murphy 2017-03-15 06:24:23 UTC
Sébastien can you do a clean boot of one of these machines, and attach dmesg file to this bug report. All I'm looking for is the kernel discovery of the drive you're using XFS on. Thanks.

Comment 43 Sébastien Wilmet 2017-03-15 09:38:21 UTC
Created attachment 1263254 [details]
dmesg output ThinkPad L530, Fedora 25

Comment 44 Eric Sandeen 2017-03-15 18:48:15 UTC
The core question here, at least in Chris' case, is why the log is dirty after what should be a clean reboot.  After that, the errors in i.e. xfs_repair -n are expected until the log gets replayed.

Do you have logs from when the system rebooted and produced this dirty log?

Comment 45 Chris Murphy 2017-03-15 19:45:08 UTC
Created attachment 1263442 [details]
offline update boot, systemd debug log

This boot is the one after choosing Restart & Install in gnome-software; so this is the system update and reboot. I've set parameters
systemd.log_level=debug systemd.log_target=console console=ttyS0,38400
And capturing with 'virsh console log'

The gotcha though is the problem doesn't happen as reported with systemd debugging. The subsequent boot does have a GRUB menu but it's stale; it only shows one kernel. If I choose it, during boot there is log replay. I reboot again and now I have two kernel entries.

(This is the same as reported on XFS list, but is the entire log for the boot that results in the dirty fs.)

Comment 46 Chris Murphy 2017-03-15 19:51:55 UTC
(In reply to Eric Sandeen from comment #44)
> Do you have logs from when the system rebooted and produced this dirty log?

See comment 45 and let me know if that's what you're looking for.

During pk offline update, journald is writing to the persistent journal and it's possible to extract that and post it; it's a rather different perspective than the console output for whatever reason; however journald stops before the ro remount of root fs (where /boot/grub2/grub.cfg is located) so a bunch of stuff isn't logged.

Comment 47 Eric Sandeen 2017-03-15 20:12:55 UTC
I think Darrick is on the right track on the xfs list; something prevented xfs from remounting ro during shutdown, and so the log was not written out & cleared.  It'll take some system(d)-level sleuthing to prove or disprove that theory...

Unfortunately I don't think there are any xfs tracepoints currently in place that would help us figure out if the remount,ro was successful or not.

Comment 48 Chris Murphy 2017-03-15 20:31:04 UTC
The three remount attempts is the smoke, but it just makes me ask more questions. Why three? Was the remount refused? Did XFS remount the third time or did systemd just give up after three and rebooted anyway? Why would the kernel honor the systemd reboot if the fs is still dirty?

Comment 49 Chris Murphy 2017-03-17 03:59:30 UTC
I get the same three remounting messages on Btrfs.

Remounting '/' read-only with options 'seclabel,space_cache,subvolid=5,subvol=/'.
Remounting '/' read-only with options 'seclabel,space_cache,subvolid=5,subvol=/'.
Remounting '/' read-only with options 'seclabel,space_cache,subvolid=5,subvol=/'.
All filesystems unmounted.

And just like with XFS, the last line claims it's unmounted.

Comment 50 Chris Murphy 2017-03-21 20:10:46 UTC
OK so there are multiple problems going on:

1. Failure to remount ro due to plymouth, which tells us in the logs "Process 304 (plymouthd) has been marked to be excluded from killing. It is running from the root file system, and thus likely to block re-mounting of the root file system to read-only. Please consider moving it into an initrd file system instead."

2. Rootfs is not umounted, by design
https://github.com/systemd/systemd/blob/master/src/core/umount.c line 413

3. sync happens
https://github.com/systemd/systemd/blob/master/src/core/shutdown.c line 213

However, at least on XFS the changes are only reflected in the journal prior to reboot.

4. Systemd and kernel permit reboot of dirty file system: Both XFS and ext4 are left dirty, it's just that ext4 doesn't manifest by being unbootable, but both e2fsck and normal mount show that it is in fact left dirty after a pk offline update. Btrfs doesn't appear to be affected by any of this.

5. GRUB has no idea about reading the XFS journal, so it doesn't see the sync'd changes, and thus boot failure.


I have no idea really whose fault it is, seems like a bad idea between systemd and the kernel to do a reboot when the fs is still dirty. But the setup for this is Plymouth isn't in the initramfs as its own designers apparently want; so that means this is a dracut bug.

For reference:
https://lists.freedesktop.org/archives/systemd-devel/2017-March/038486.html
https://www.spinics.net/lists/linux-xfs/msg04957.html

Comment 51 Chris Murphy 2017-03-28 17:35:11 UTC
This is a plymouth bug, it should only mark itself exempt from being killed if it runs from an initramfs. Plymouth needs to either remove this kill exemption, or it needs to get baked into the initramfs.

https://lists.freedesktop.org/archives/systemd-devel/2017-March/038527.html

Comment 52 Fedora Blocker Bugs Application 2017-03-28 17:55:49 UTC
Proposed as a Blocker for 26-beta by Fedora user chrismurphy using the blocker tracking app because:

 Beta "The installed system must be able to download and install updates with the default graphical package manager in all release-blocking desktops."
1. single volume XFS layout is valid, permitted by installer
2. system fails to boot following offline update
3. reproduces on baremetal and VM 

Making the installer require /boot on a separate volume seems to fix this, but since all layouts end up with dirty file systems after reboot and depend on fs recovery code fixing things, I think it's a fragile work around. Ultimately the central problem needs to be fixed, plymouth needs to not inhibit the remount-ro.

Comment 54 Geoffrey Marr 2017-04-10 20:24:11 UTC
Discussed during the 2017-04-10 blocker review meeting: [1]

The decision was made to delay the classification of this bug as we need more data to make an informed decision.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-04-10/f26-blocker-review.2017-04-10-16.01.txt

Comment 56 Geoffrey Marr 2017-04-24 19:13:55 UTC
Discussed during the 2017-04-24 blocker review meeting: [1]

The decision to delay the classification of this bug was made as we need more information to classify this bug one way or the other.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-04-24/f26-blocker-review.2017-04-24-16.00.txt

Comment 57 Geoffrey Marr 2017-05-01 18:29:07 UTC
Discussed during the 2017-05-01 blocker review meeting: [1]

The decision to delay the classification of this bug was made as we need more data to make an informed decision. Adam W. will reach out to the necessary parties this week to get more information.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-05-01/f26-blocker-review.2017-05-01-16.02.txt

Comment 58 Chris Murphy 2017-05-08 03:59:45 UTC
Suggestion: Revert the commit in comment 55, and I'll test a compose to see what happens. 

That commit makes plymouth non-killable by systemd, and is why systemd can't remount-ro, and why the fs isn't cleanly unmounted, and why grub can't figure things out. I suspect the purpose of the commit is to prevent something ugly happening onscreen as systemd kills plymouth. But ugly is better than dataloss, no matter how many more people are affected by ugly and how few experience data loss.

Comment 59 Mike Ruckman 2017-05-12 21:22:15 UTC
Discussed in 2017-05-08 Blocker Review Meeting. This bug doesn't qualify as a blocker due to the fact that it's a non-default installation method as well as how invasive the fix is.

Comment 60 Adam Williamson 2017-06-12 23:29:13 UTC
FWIW, I did mail Ray, Harald and Zbigniew about this one. Harald didn't reply, but both Ray and Zbigniew acknowledged there were things to improve in plymouth and systemd here. I'm not sure what the current status of actually fixing them is, however. Guys, any chance of doing something about this yet?

Comment 61 Chris Murphy 2017-06-13 00:18:36 UTC
I spoke at length with file system guys, both on the XFS devel list and fs-devel@, and they are kinda stalled on doing anything. The gist is that sync() on journaled file systems only guarantees it's crash safe: data and journal metadata is flushed to disk, i.e. the log is dirty and the file system itself is not updated. At next boot, the kernel replays the dirty journal and updates the file system.

The problem is bootloaders like GRUB, syslinux, uboot, depend on the file system metadata being correct, they cannot read a dirty journal. And this is not limited to XFS, it applies to ext3/4, and Btrfs even though this specific bug doesn't seem to manifest on those filesystems. And because the file system is not correct, the bootloader binary (the part that executes right after POST) is running before the kernel can clean things up, and fails to find the changed bootloader configuration file, or the new kernel, or the new initramfs.


So that means before rebooting one of three things must happen to make sure the file system is completely up to date, so that the bootloader can find the new shiny things:

a.) the file system is umounted
b.) the file system is remounted-ro
c.) the file system is frozen/unfrozen using fsfreeze

They also argue that because it's the bootloader that needs the file system in such a pristine state, it is the responsibility of the thing that updates the bootloader configuration to do an fsfreeze. i.e. they actually think this is a bootloader bug: specifically on Fedora that's grubby's new-kernel-pkg (and on other distros it would be either grub-mkconfig script, or if they use neither of these, whatever kernel package post-install script modifies the bootloader configuration).

Of course, systemd could come to the rescue and help obviate that work.

Interestingly enough, new-kernel-pkg *does* use fsfreeze for exactly this purpose on PPC64LE only. See lines 923-929:

https://github.com/rhboot/grubby/blob/master/new-kernel-pkg

So actually the fastes fix might actually be to use that existing code in new-kernel-pkg and just always freeze (?) or maybe it needs a test to check whether the file system all the files are on support freeze, or do error handling if it doesn't, i.e. do both sync and freeze and then just don't blow up if the file system doesn't freeze, like FAT or ext2.

Comment 62 Chris Murphy 2017-06-13 00:33:16 UTC
>always freeze
i.e. regardless of architecture.

So the change needed to that code (I'm sorta eduguessing here):
- comment out 927 and 929
- if /boot is a directory then freeze/unfreeze / (rootfs)
- if /boot is a mountpoint, freeze/unfreeze it

fsfreeze needs a mountpoint, so for this bug, /boot is a dir and I guess fsfreeze won't work on that, the command would need to freeze /. Alternatively, maybe it's acceptable to always sync and freeze/unfreeze both /boot and /?

Also when freezing FAT this is what I get:

[chris@f25s ~]$ sudo fsfreeze -f /boot/efi
fsfreeze: /boot/efi: freeze failed: Operation not supported

So some error handling may be needed, but maybe that's trivial compared to a systemd/plymouth solution.

Comment 63 Chris Murphy 2017-07-19 04:51:16 UTC
I've opened a bug with grubby.
https://github.com/rhboot/grubby/issues/25

Comment 64 Fedora End Of Life 2017-11-16 19:50:14 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 65 Fedora End Of Life 2017-12-12 10:07:26 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 66 Chris Murphy 2017-12-13 10:47:02 UTC
Plymouth is doing the wrong thing by exempting itself from being killed. Systemd is doing the wrong thing by rebooting despite umount and remount ro failure, and needs to do an fsfreeze as a fallback. 

And finally the thing doing the bootloader configuration change is *the* central thing most responsible for making sure its changes are committed to disk: that's grub-mkconfig in the generic case and grubby in the Fedora case. Since I changed this bug from grubby to plymouth, the XFS devs made very a compelling case that this is mostly the fault of grubby (and grub), which don't even do sync() let alone fsfreeze. There is code in grubby to do fsfreeze on another arch but it should always do it.

I filed this with upstream grubby and there's been no action, no one has even read the bug in six months so I wonder if grubby is effectively no longer being maintained. And for that reason I'm reluctant to change the bug back to grubby.
https://github.com/rhboot/grubby/issues/25

Comment 67 François Cami 2017-12-13 11:11:41 UTC
FYI upstream grubby does fsync now:
https://github.com/rhboot/grubby/pull/24/commits/174b72ce989b9fee59351118f1d93b94a34d55fe

I'll look at what we are missing there.

Comment 68 Chris Murphy 2017-12-13 11:53:48 UTC
As I explicitly mention in the grubby bug, sync() is not sufficient on journaled file systems. It will work on non-journaled file systems like FAT, where sync() should force flush of both data and metadata to the block device. But for journaled file systems, this only ensures data and journal are flushed to the block device, it doesn't guarantee the file system itself is clean. To get the functional equivalent of either umount or remount read-only and have a clean file system that the bootloader can find the new grub.cfg, it's necessary to use FIFREEZE().

Comment 69 Chris Murphy 2017-12-13 12:00:08 UTC
I've filed an upstream bug with GRUB for the grub-mkconfig case (which doesn't really affect Fedora as it only gets used during anaconda installs).
https://savannah.gnu.org/bugs/index.php?52657

Comment 70 François Cami 2017-12-13 12:01:24 UTC
@Chris: Yes indeed.
However grubby is by far not the only piece of code where we need to do this.

Ray, as I'm actively looking at it, would you mind if I owned this bug?

Comment 71 Ray Strode [halfline] 2017-12-13 15:38:28 UTC
sure, take it !

Fixing plymouth might just need a 

# plymouth update-root-fs --new-root-dir

call in the shutdown path to swivel back into the initramfs.

or it might need something more complicated like sending the drm fd to a fresh process and then exiting the existing process.

Comment 72 François Cami 2017-12-13 15:42:19 UTC
Taken, thanks for the information Ray.

Comment 73 Fedora End Of Life 2018-02-20 15:30:44 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 28 development cycle.
Changing version to '28'.

Comment 74 Chris Murphy 2018-04-23 18:03:40 UTC
DUP bug with discussion.
https://bugzilla.redhat.com/show_bug.cgi?id=1569970

Comment 75 Ben Cotton 2019-05-02 21:02:09 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 76 Ben Cotton 2019-05-28 22:14:11 UTC
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.