Bug 1227736 - Minimal grub after a kernel update with gnome-software [NEEDINFO]
Minimal grub after a kernel update with gnome-software
Status: NEW
Product: Fedora
Classification: Fedora
Component: plymouth (Show other bugs)
25
x86_64 Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Ray Strode [halfline]
Fedora Extras Quality Assurance
RejectedBlocker
: CommonBugs
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-06-03 08:06 EDT by Sébastien Wilmet
Modified: 2017-07-19 00:51 EDT (History)
20 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
bugzilla: needinfo? (rstrode)


Attachments (Terms of Use)
My /var/log/grubby (6.21 KB, text/plain)
2015-06-08 14:33 EDT, Sébastien Wilmet
no flags Details
My /var/log/grubby with Fedora 23 (7.41 KB, text/plain)
2015-12-02 11:32 EST, Sébastien Wilmet
no flags Details
My /etc/default/grub (214 bytes, text/plain)
2015-12-02 11:33 EST, Sébastien Wilmet
no flags Details
My /boot/grub2/grub.cfg (after grub reinstall) (5.60 KB, text/plain)
2015-12-02 14:20 EST, Sébastien Wilmet
no flags Details
My /boot/grub2/grubenv (after grub reinstall) (1.00 KB, text/plain)
2015-12-02 14:21 EST, Sébastien Wilmet
no flags Details
My /var/log/grubby (11.12 KB, text/plain)
2016-01-15 08:38 EST, Sébastien Wilmet
no flags Details
My /boot/grub2/grub.cfg (5.62 KB, text/plain)
2016-01-15 08:40 EST, Sébastien Wilmet
no flags Details
My /boot/grub2/grubenv (1.00 KB, text/plain)
2016-01-15 08:42 EST, Sébastien Wilmet
no flags Details
F25: My /etc/default/grub (218 bytes, text/plain)
2016-12-21 06:45 EST, Sébastien Wilmet
no flags Details
F25: My /var/log/grubby before re-installing grub (14.08 KB, text/plain)
2016-12-21 06:47 EST, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grub.cfg before re-installing grub (5.70 KB, text/plain)
2016-12-21 06:48 EST, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grubenv before re-installing grub (1.00 KB, text/plain)
2016-12-21 06:49 EST, Sébastien Wilmet
no flags Details
F25: My /var/log/grubby after re-installing grub (14.08 KB, text/plain)
2016-12-21 06:57 EST, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grub.cfg after re-installing grub (5.81 KB, text/plain)
2016-12-21 06:58 EST, Sébastien Wilmet
no flags Details
F25: My /boot/grub2/grubenv after re-installing grub (1.00 KB, text/plain)
2016-12-21 06:59 EST, Sébastien Wilmet
no flags Details
List of packages that I install (1.01 KB, text/plain)
2017-03-14 17:19 EDT, Sébastien Wilmet
no flags Details
xfs_repair log (63.51 KB, text/plain)
2017-03-15 01:19 EDT, Chris Murphy
no flags Details
xfs_repair log 2 (216.60 KB, text/plain)
2017-03-15 01:28 EDT, Chris Murphy
no flags Details
dmesg output ThinkPad L530, Fedora 25 (65.03 KB, text/plain)
2017-03-15 05:38 EDT, Sébastien Wilmet
no flags Details
offline update boot, systemd debug log (873.74 KB, text/plain)
2017-03-15 15:45 EDT, Chris Murphy
no flags Details

  None (edit)
Description Sébastien Wilmet 2015-06-03 08:06:00 EDT
Description of problem:
After a recent package update with Fedora Workstation 22 (with gnome-software), on the next boot I got a minimal grub (with a message beginning with "Minimal BASH-like line editing is supported."). I had to re-install grub with a live DVD.

In the updated packages, there was the kernel 4.0.4. I think it's the only package that could have altered grub.

With
$ rpm -q --scripts kernel-core-4.0.4-301.fc22.x86_64

I see that /bin/kernel-install is called, and kernel-install is provided by systemd. So it's maybe a bug in systemd, I don't know.

Version-Release number of selected component (if applicable):
Fedora Workstation 22
kernel-4.0.4-301.fc22
systemd-219-15.fc22
gnome-software-3.16.2-2.fc22

Additional info:
If it matters, my / partition (that contains /boot) is in XFS. It's a primary partition, I don't use LVM.

I didn't try to reproduce the bug, since it's quite annoying (re-installing grub with a live CD/DVD/USB).
Comment 1 Brian Lane 2015-06-08 12:24:56 EDT
Please attach /var/log/grubby
Comment 2 Sébastien Wilmet 2015-06-08 14:33:17 EDT
Created attachment 1036453 [details]
My /var/log/grubby
Comment 3 Sébastien Wilmet 2015-12-02 11:30:12 EST
I've had the same problem with Fedora 23. I did a fresh install (at beta time). It seems to happen only with gnome-software.

I have another issue, which is maybe related. When a new kernel is installed, the default grub entry is the second one, not the first one. I've edited /etc/default/grub, I don't remember the default content but I have now GRUB_DEFAULT=0. So when that bug happens, I run:
# grub2-mkconfig -o /boot/grub2/grub.cfg

I'll attach my new /var/log/grubby.

This is quite a serious issue. You have the chance that I know how to reinstall grub from a live DVD, and that I'm still using Fedora.
Comment 4 Sébastien Wilmet 2015-12-02 11:32 EST
Created attachment 1101558 [details]
My /var/log/grubby with Fedora 23
Comment 5 Sébastien Wilmet 2015-12-02 11:33 EST
Created attachment 1101559 [details]
My /etc/default/grub
Comment 6 Brian Lane 2015-12-02 14:00:47 EST
Could you also attach /boot/grub2/grub.cfg and /boot/grub2/grubenv? Thanks.
Comment 7 Sébastien Wilmet 2015-12-02 14:20 EST
Created attachment 1101598 [details]
My /boot/grub2/grub.cfg (after grub reinstall)
Comment 8 Sébastien Wilmet 2015-12-02 14:21 EST
Created attachment 1101600 [details]
My /boot/grub2/grubenv (after grub reinstall)
Comment 9 Brian Lane 2015-12-02 14:40:46 EST
Nothing unusual in those. Next time it happens please save copies of the broken configs before fixing them. Without that there isn't much we can do.
Comment 10 Sébastien Wilmet 2016-01-15 08:36:44 EST
I've decided to break my system again and do the update with gnome-software instead of dnf. See the following attachments.
Comment 11 Sébastien Wilmet 2016-01-15 08:38 EST
Created attachment 1115164 [details]
My /var/log/grubby
Comment 12 Sébastien Wilmet 2016-01-15 08:40 EST
Created attachment 1115166 [details]
My /boot/grub2/grub.cfg

The new attachments are the files present on the disk just after the gnome-software update, so with the minimal grub. _Before_ re-installing grub.
Comment 13 Sébastien Wilmet 2016-01-15 08:42 EST
Created attachment 1115168 [details]
My /boot/grub2/grubenv

/boot/grub2/grubenv was actually a symbolic link to /boot/efi/EFI/fedora/grubenv, but I suppose it's normal.
Comment 14 Sébastien Wilmet 2016-04-29 04:58:38 EDT
Happens on Fedora 24 too.
Comment 15 Sébastien Wilmet 2016-12-21 06:42:00 EST
So for me, this bug happens with F22, F23, F24 and now F25.

I'll attach the requested files for F25. Is this bug filed in the correct component? Do you have any clue what's going wrong? As a workaround I can do the updates with dnf, but for a "normal" user, I think the only choice is to install another distro. I take the time to report this bug and try to reproduce it with new Fedora versions, because it's a critical problem. This problem will still likely be present in Fedora Atomic/rpm-ostree Workstation, where it won't be possible to do the updates with dnf.
Comment 16 Sébastien Wilmet 2016-12-21 06:45 EST
Created attachment 1234304 [details]
F25: My /etc/default/grub
Comment 17 Sébastien Wilmet 2016-12-21 06:47 EST
Created attachment 1234305 [details]
F25: My /var/log/grubby before re-installing grub
Comment 18 Sébastien Wilmet 2016-12-21 06:48 EST
Created attachment 1234307 [details]
F25: My /boot/grub2/grub.cfg before re-installing grub
Comment 19 Sébastien Wilmet 2016-12-21 06:49 EST
Created attachment 1234308 [details]
F25: My /boot/grub2/grubenv before re-installing grub
Comment 20 Sébastien Wilmet 2016-12-21 06:53:23 EST
The gnome-software update contained an update to kernel-4.8.14-300.fc25.
Comment 21 Sébastien Wilmet 2016-12-21 06:57 EST
Created attachment 1234309 [details]
F25: My /var/log/grubby after re-installing grub
Comment 22 Sébastien Wilmet 2016-12-21 06:58 EST
Created attachment 1234312 [details]
F25: My /boot/grub2/grub.cfg after re-installing grub
Comment 23 Sébastien Wilmet 2016-12-21 06:59 EST
Created attachment 1234313 [details]
F25: My /boot/grub2/grubenv after re-installing grub
Comment 24 Sébastien Wilmet 2016-12-21 07:13:00 EST
/var/log/grubby and /boot/grub2/grubenv have exactly the same content before and after re-installing grub.

In /boot/grub2/grub.cfg before re-installing grub, there is:

menuentry 'Fedora (4.8.14-300.fc25.x86_64) 25 (Workstation Edition)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.8.0-0.rc7.git0.1.fc25.x86_64-advanced-5d46062a-d140-441f-b55f-306b39fab6db' {

-> 4.8.0-0.rc7, the bug probably comes from that.
Comment 25 Sébastien Wilmet 2016-12-21 08:48:37 EST
By looking at `dnf history`, kernel-4.8.0-0.rc7.git0.1.fc25.x86_64 was the first kernel removed (and thus the first installed). I did a fresh install of F25 beta (I always do a fresh install of new Fedora versions).
Comment 26 Sébastien Wilmet 2017-03-13 14:29:11 EDT
I've had the same bug on another computer (where I install more or less the same set of packages).

Someone else has had the same problem here:
https://ask.fedoraproject.org/en/question/99550/fedora-25-no-longer-boots-after-upgrade-how-to-fix/

Should I propose this bug as a blocker?
Comment 27 Heiko Adams 2017-03-13 16:17:50 EDT
Got the same issue after installing Fedora 26 Workstation from live cd
Comment 28 Alexander Ploumistos 2017-03-13 16:45:13 EDT
Is your system (U)EFI by any chance? If yes, you need to change the linux16 entries to linuxefi in grub.cfg. I've seen something similar on a dual boot laptop, where windows messed with the efi partition and when I repaired grub from a live USB, instead of kernel entries starting with linuxefi, I kept getting linux16.
Comment 29 Chris Murphy 2017-03-13 22:33:04 EDT
I've never seen this before, Fedora 22 through Fedora 26. It would be most helpful if there are exact reproduce steps, as tedious as that is.

I can't tell what step fixed the problem. Was it grub2-install or was it grub2-mkconfig? Getting a grub prompt suggests that core.img is loaded, and normal.mod is found which means it's reading XFS OK, but isn't finding the grub.cfg, and hence no menu. Does the problem manifest as:
grub> 
OR
grub rescue>
?
The latter suggests core.img is found, but not normal.mod. In all of my testing, I've never had a long lived /boot on XFS. But there isn't enough information to know if this might be some ondisk change in XFS that and older embedded GRUB core.img isn't able to read; in which case grub2-install would fix it; or if grubby is misfiring the modification of grub.cfg and the grub.cfg is therefore missing.

Basically we need a computer with the problem, and not fixed, and then find out either from the grub menu whether it can in fact see files in /boot/grub2 and read them: this can be done with grub commands ls, cd, and configfile, and it's possible to use tab for path autocompletion.
Comment 30 Sébastien Wilmet 2017-03-14 16:42:43 EDT
(In reply to Alexander Ploumistos from comment #28)
> Is your system (U)EFI by any chance?

No, the UEFI/Legacy Boot option is set to "Legacy Only".

(In reply to Chris Murphy from comment #29)

I can reproduce the problem later this week and provide more info. I don't remember if the prompt is grub> or grub rescue>.
Comment 31 Chris Murphy 2017-03-14 16:57:14 EDT
> No, the UEFI/Legacy Boot option is set to "Legacy Only".

So it is UEFI, but it's using a compatibility support module to present a faux BIOS. This is really suboptimal and makes it an edge case.

Is this a multiboot system or Fedora only? If multiboot, what other OS's are present?


> I can reproduce the problem later this week and provide more info. I don't
> remember if the prompt is grub> or grub rescue>.

And the problem never happens with 'dnf' updates, but always happens with gnome-software updates? (I am assuming either form of update includes a kernel update.)

What's weird about that is the RPM is the same, and all the scripts run following kernel install are the same. I'm not thinking of why a pk offline update via systemd would make any difference with bootloader stuff compared to dnf; about the only thing I'm thinking of is maybe there's a more abrupt reboot with pk offline update. Maybe the drive is lying when it gets fsync, and a faster reboot means an orphaned/missing grub.cfg? Seems specious though...
Comment 32 Chris Murphy 2017-03-14 17:00:23 EDT
If the problem is reproduced, when booting from alternate media and mounting the file system for the first time, I'd like to see a complete dmesg from that environment. In particular I'm interested in seeing the messages for the first mount of the XFS volume, following the update resulting in a grub prompt. After that first mount, if you can umount the volume, and run xfs_repair -n and attach the results that might be useful.

Is this XFS volume originally created with Fedora 22, and the system has been upgraded to 23 > 24 > 25? Or was it ever clean installed (new file system created)?
Comment 33 Sébastien Wilmet 2017-03-14 17:19 EDT
Created attachment 1263052 [details]
List of packages that I install

I always follow these steps when I install Fedora (for my own use):
1. Install Fedora Workstation
2. `dnf upgrade` + reboot (not with gnome-software).
3. Install attached list of packages
4. Disable selinux + reboot

Then each time there is a kernel update done by gnome-software, I get the minimal grub. There is no problem with a kernel update with dnf.
Comment 34 Sébastien Wilmet 2017-03-14 17:28:51 EDT
(In reply to Chris Murphy from comment #31)
> Is this a multiboot system or Fedora only? If multiboot, what other OS's are
> present?

Fedora only on one computer (with UEFI legacy mode). Dual-boot with Windows 7 on my day-job computer (I don't know if it's BIOS or UEFI or UEFI legacy, I can tell you that next week).

(In reply to Chris Murphy from comment #32)
> Is this XFS volume originally created with Fedora 22, and the system has
> been upgraded to 23 > 24 > 25? Or was it ever clean installed (new file
> system created)?

I always do a fresh install, by reformatting the / XFS partition (containing /boot).
Comment 35 Sébastien Wilmet 2017-03-14 17:51:31 EDT
My partitions:
- /dev/sda1 20GB XFS mounted on /
- /dev/sda2 4GB swap
- /dev/sda3 [rest of the disk] ext4 mounted on /home
Comment 36 Sébastien Wilmet 2017-03-14 18:04:41 EDT
See also comment #24, the /boot/grub2/grub.cfg file before re-installing grub was suspicious.
Comment 37 Chris Murphy 2017-03-14 18:15:23 EDT
(In reply to Sébastien Wilmet from comment #33)
> Created attachment 1263052 [details]
> List of packages that I install
> 
> I always follow these steps when I install Fedora (for my own use):
> 1. Install Fedora Workstation
> 2. `dnf upgrade` + reboot (not with gnome-software).
> 3. Install attached list of packages
> 4. Disable selinux + reboot
> 
> Then each time there is a kernel update done by gnome-software, I get the
> minimal grub. There is no problem with a kernel update with dnf.

I don't understand the last sentence in the context of everything that came before it. 'dnf update' after a clean install does a kernel update using dnf. You're saying you get different results if you do 'dnf update' vs 'dnf update kernel' ?
Comment 38 Chris Murphy 2017-03-14 18:32:48 EDT
> Then each time there is a kernel update done by gnome-software, I get the
> minimal grub. There is no problem with a kernel update with dnf.

Nevermind. I grok this now.
Comment 39 Chris Murphy 2017-03-15 00:38:48 EDT
OK I have a reproducer in a VM, very simple.
1. Install Fedora 25, custom partitioning, single standard partition mounted at /, format XFS. That's it, no other partitions.
2. Reboot from installer.
3. Drop to a VT shell, to avoid gnome-software from downloading updates.
4. dnf update --exclude=kernel-*
5. Reboot
6. Login to gnome-shell, launch Gnome Software, available OS update information shows only kernel packages to install. Click on Restart & Install.
## System reboots, I see plymouth splash with update status, and then reboot to grub> prompt.

Observations:
a. I can use ls and configfile to navigate to a grub.cfg - which is present but using configfile it doesn't load - I just get another grub> prompt with no error. So this file is empty or otherwise not readable.
b. Reboot with install media; and run xfs_repair -n.

Lots of crazy stuff that I'm not expecting. Many of these entries, each with a different inode number.
imap claims in-use inode 661691 is free, would correct imap
imap claims in-use inode 668038 is free, would correct imap

c. 
# mount -o ro,norecovery /dev/sda1 /mnt

The grub.cfg is a 0 length file.

d. 
# umount /mnt
# blockdev --setro /dev/sda1
# mount /dev/sda1 /mnt
mount: /dev/sda1 is write-protected, mounting read-only
mount: cannot mount /dev/sda1 read-only


[  880.621327] XFS (sda1): Unmounting Filesystem
[  884.909167] XFS (sda1): Mounting V5 Filesystem
[  884.951148] XFS (sda1): recovery required on read-only device.
[  884.951150] XFS (sda1): write access unavailable, cannot proceed.
[  884.951152] XFS (sda1): log mount/recovery failed: error -30
[  884.951197] XFS (sda1): log mount failed

# blockdev --setrw /dev/sda1
# mount /dev/sda1 /mnt

No error in user space. Kernel messages...

[ 1209.577452] XFS (sda1): Mounting V5 Filesystem
[ 1209.625671] XFS (sda1): Starting recovery (logdev: internal)
[ 1209.793035] XFS (sda1): Ending recovery (logdev: internal)

And now grub.cfg is not 0 length, 4970 bytes.

So clearly there file system is in a dirty state after pk offline update is done and systemd does a reboot, I guess it's not doing fsync or fdatasync or not waiting long enough for the kernel to finish it - no idea. But the fs is dirty and therefore journal replay is necessary to make it consistent again, but GRUB can't do journal replay.

e. Now that journal replay has been done, xfs_repair -n comes up clean, and the system also reboots.
Comment 40 Chris Murphy 2017-03-15 01:19 EDT
Created attachment 1263175 [details]
xfs_repair log
Comment 41 Chris Murphy 2017-03-15 01:28 EDT
Created attachment 1263176 [details]
xfs_repair log 2

previous one was truncated
Comment 42 Chris Murphy 2017-03-15 02:24:23 EDT
Sébastien can you do a clean boot of one of these machines, and attach dmesg file to this bug report. All I'm looking for is the kernel discovery of the drive you're using XFS on. Thanks.
Comment 43 Sébastien Wilmet 2017-03-15 05:38 EDT
Created attachment 1263254 [details]
dmesg output ThinkPad L530, Fedora 25
Comment 44 Eric Sandeen 2017-03-15 14:48:15 EDT
The core question here, at least in Chris' case, is why the log is dirty after what should be a clean reboot.  After that, the errors in i.e. xfs_repair -n are expected until the log gets replayed.

Do you have logs from when the system rebooted and produced this dirty log?
Comment 45 Chris Murphy 2017-03-15 15:45 EDT
Created attachment 1263442 [details]
offline update boot, systemd debug log

This boot is the one after choosing Restart & Install in gnome-software; so this is the system update and reboot. I've set parameters
systemd.log_level=debug systemd.log_target=console console=ttyS0,38400
And capturing with 'virsh console log'

The gotcha though is the problem doesn't happen as reported with systemd debugging. The subsequent boot does have a GRUB menu but it's stale; it only shows one kernel. If I choose it, during boot there is log replay. I reboot again and now I have two kernel entries.

(This is the same as reported on XFS list, but is the entire log for the boot that results in the dirty fs.)
Comment 46 Chris Murphy 2017-03-15 15:51:55 EDT
(In reply to Eric Sandeen from comment #44)
> Do you have logs from when the system rebooted and produced this dirty log?

See comment 45 and let me know if that's what you're looking for.

During pk offline update, journald is writing to the persistent journal and it's possible to extract that and post it; it's a rather different perspective than the console output for whatever reason; however journald stops before the ro remount of root fs (where /boot/grub2/grub.cfg is located) so a bunch of stuff isn't logged.
Comment 47 Eric Sandeen 2017-03-15 16:12:55 EDT
I think Darrick is on the right track on the xfs list; something prevented xfs from remounting ro during shutdown, and so the log was not written out & cleared.  It'll take some system(d)-level sleuthing to prove or disprove that theory...

Unfortunately I don't think there are any xfs tracepoints currently in place that would help us figure out if the remount,ro was successful or not.
Comment 48 Chris Murphy 2017-03-15 16:31:04 EDT
The three remount attempts is the smoke, but it just makes me ask more questions. Why three? Was the remount refused? Did XFS remount the third time or did systemd just give up after three and rebooted anyway? Why would the kernel honor the systemd reboot if the fs is still dirty?
Comment 49 Chris Murphy 2017-03-16 23:59:30 EDT
I get the same three remounting messages on Btrfs.

Remounting '/' read-only with options 'seclabel,space_cache,subvolid=5,subvol=/'.
Remounting '/' read-only with options 'seclabel,space_cache,subvolid=5,subvol=/'.
Remounting '/' read-only with options 'seclabel,space_cache,subvolid=5,subvol=/'.
All filesystems unmounted.

And just like with XFS, the last line claims it's unmounted.
Comment 50 Chris Murphy 2017-03-21 16:10:46 EDT
OK so there are multiple problems going on:

1. Failure to remount ro due to plymouth, which tells us in the logs "Process 304 (plymouthd) has been marked to be excluded from killing. It is running from the root file system, and thus likely to block re-mounting of the root file system to read-only. Please consider moving it into an initrd file system instead."

2. Rootfs is not umounted, by design
https://github.com/systemd/systemd/blob/master/src/core/umount.c line 413

3. sync happens
https://github.com/systemd/systemd/blob/master/src/core/shutdown.c line 213

However, at least on XFS the changes are only reflected in the journal prior to reboot.

4. Systemd and kernel permit reboot of dirty file system: Both XFS and ext4 are left dirty, it's just that ext4 doesn't manifest by being unbootable, but both e2fsck and normal mount show that it is in fact left dirty after a pk offline update. Btrfs doesn't appear to be affected by any of this.

5. GRUB has no idea about reading the XFS journal, so it doesn't see the sync'd changes, and thus boot failure.


I have no idea really whose fault it is, seems like a bad idea between systemd and the kernel to do a reboot when the fs is still dirty. But the setup for this is Plymouth isn't in the initramfs as its own designers apparently want; so that means this is a dracut bug.

For reference:
https://lists.freedesktop.org/archives/systemd-devel/2017-March/038486.html
https://www.spinics.net/lists/linux-xfs/msg04957.html
Comment 51 Chris Murphy 2017-03-28 13:35:11 EDT
This is a plymouth bug, it should only mark itself exempt from being killed if it runs from an initramfs. Plymouth needs to either remove this kill exemption, or it needs to get baked into the initramfs.

https://lists.freedesktop.org/archives/systemd-devel/2017-March/038527.html
Comment 52 Fedora Blocker Bugs Application 2017-03-28 13:55:49 EDT
Proposed as a Blocker for 26-beta by Fedora user chrismurphy using the blocker tracking app because:

 Beta "The installed system must be able to download and install updates with the default graphical package manager in all release-blocking desktops."
1. single volume XFS layout is valid, permitted by installer
2. system fails to boot following offline update
3. reproduces on baremetal and VM 

Making the installer require /boot on a separate volume seems to fix this, but since all layouts end up with dirty file systems after reboot and depend on fs recovery code fixing things, I think it's a fragile work around. Ultimately the central problem needs to be fixed, plymouth needs to not inhibit the remount-ro.
Comment 54 Geoffrey Marr 2017-04-10 16:24:11 EDT
Discussed during the 2017-04-10 blocker review meeting: [1]

The decision was made to delay the classification of this bug as we need more data to make an informed decision.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-04-10/f26-blocker-review.2017-04-10-16.01.txt
Comment 56 Geoffrey Marr 2017-04-24 15:13:55 EDT
Discussed during the 2017-04-24 blocker review meeting: [1]

The decision to delay the classification of this bug was made as we need more information to classify this bug one way or the other.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-04-24/f26-blocker-review.2017-04-24-16.00.txt
Comment 57 Geoffrey Marr 2017-05-01 14:29:07 EDT
Discussed during the 2017-05-01 blocker review meeting: [1]

The decision to delay the classification of this bug was made as we need more data to make an informed decision. Adam W. will reach out to the necessary parties this week to get more information.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-05-01/f26-blocker-review.2017-05-01-16.02.txt
Comment 58 Chris Murphy 2017-05-07 23:59:45 EDT
Suggestion: Revert the commit in comment 55, and I'll test a compose to see what happens. 

That commit makes plymouth non-killable by systemd, and is why systemd can't remount-ro, and why the fs isn't cleanly unmounted, and why grub can't figure things out. I suspect the purpose of the commit is to prevent something ugly happening onscreen as systemd kills plymouth. But ugly is better than dataloss, no matter how many more people are affected by ugly and how few experience data loss.
Comment 59 Mike Ruckman 2017-05-12 17:22:15 EDT
Discussed in 2017-05-08 Blocker Review Meeting. This bug doesn't qualify as a blocker due to the fact that it's a non-default installation method as well as how invasive the fix is.
Comment 60 Adam Williamson 2017-06-12 19:29:13 EDT
FWIW, I did mail Ray, Harald and Zbigniew about this one. Harald didn't reply, but both Ray and Zbigniew acknowledged there were things to improve in plymouth and systemd here. I'm not sure what the current status of actually fixing them is, however. Guys, any chance of doing something about this yet?
Comment 61 Chris Murphy 2017-06-12 20:18:36 EDT
I spoke at length with file system guys, both on the XFS devel list and fs-devel@, and they are kinda stalled on doing anything. The gist is that sync() on journaled file systems only guarantees it's crash safe: data and journal metadata is flushed to disk, i.e. the log is dirty and the file system itself is not updated. At next boot, the kernel replays the dirty journal and updates the file system.

The problem is bootloaders like GRUB, syslinux, uboot, depend on the file system metadata being correct, they cannot read a dirty journal. And this is not limited to XFS, it applies to ext3/4, and Btrfs even though this specific bug doesn't seem to manifest on those filesystems. And because the file system is not correct, the bootloader binary (the part that executes right after POST) is running before the kernel can clean things up, and fails to find the changed bootloader configuration file, or the new kernel, or the new initramfs.


So that means before rebooting one of three things must happen to make sure the file system is completely up to date, so that the bootloader can find the new shiny things:

a.) the file system is umounted
b.) the file system is remounted-ro
c.) the file system is frozen/unfrozen using fsfreeze

They also argue that because it's the bootloader that needs the file system in such a pristine state, it is the responsibility of the thing that updates the bootloader configuration to do an fsfreeze. i.e. they actually think this is a bootloader bug: specifically on Fedora that's grubby's new-kernel-pkg (and on other distros it would be either grub-mkconfig script, or if they use neither of these, whatever kernel package post-install script modifies the bootloader configuration).

Of course, systemd could come to the rescue and help obviate that work.

Interestingly enough, new-kernel-pkg *does* use fsfreeze for exactly this purpose on PPC64LE only. See lines 923-929:

https://github.com/rhboot/grubby/blob/master/new-kernel-pkg

So actually the fastes fix might actually be to use that existing code in new-kernel-pkg and just always freeze (?) or maybe it needs a test to check whether the file system all the files are on support freeze, or do error handling if it doesn't, i.e. do both sync and freeze and then just don't blow up if the file system doesn't freeze, like FAT or ext2.
Comment 62 Chris Murphy 2017-06-12 20:33:16 EDT
>always freeze
i.e. regardless of architecture.

So the change needed to that code (I'm sorta eduguessing here):
- comment out 927 and 929
- if /boot is a directory then freeze/unfreeze / (rootfs)
- if /boot is a mountpoint, freeze/unfreeze it

fsfreeze needs a mountpoint, so for this bug, /boot is a dir and I guess fsfreeze won't work on that, the command would need to freeze /. Alternatively, maybe it's acceptable to always sync and freeze/unfreeze both /boot and /?

Also when freezing FAT this is what I get:

[chris@f25s ~]$ sudo fsfreeze -f /boot/efi
fsfreeze: /boot/efi: freeze failed: Operation not supported

So some error handling may be needed, but maybe that's trivial compared to a systemd/plymouth solution.
Comment 63 Chris Murphy 2017-07-19 00:51:16 EDT
I've opened a bug with grubby.
https://github.com/rhboot/grubby/issues/25

Note You need to log in before you can comment on or make changes to this bug.