Bug 1703700 - Newer kernels do not boot and have invalid grub.cfg entries (on Xen DomU guests)
Summary: Newer kernels do not boot and have invalid grub.cfg entries (on Xen DomU guests)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: grub2
Version: 30
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Javier Martinez Canillas
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedFreezeException
: 1679759 (view as bug list)
Depends On:
Blocks: F31FinalBlocker F31FinalFreezeException
TreeView+ depends on / blocked
 
Reported: 2019-04-27 17:12 UTC by Steven Haigh
Modified: 2019-10-24 18:47 UTC (History)
18 users (show)

Fixed In Version: grub2-2.02-99.fc31 grub2-2.02-100.fc31 grub2-2.02-83.fc30
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-23 20:43:27 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
grub.cfg which was not updated with new kernel packages. (7.28 KB, text/plain)
2019-08-08 16:07 UTC, Steven Haigh
no flags Details
grub.cfg after booting into the only working kernel and running grub2-mkconfig -o /boot/grub/grub.cfg (6.54 KB, text/plain)
2019-08-08 16:08 UTC, Steven Haigh
no flags Details
[PATCH] 99-grub-mkconfig: Disable BLS usage for Xen DomU guests (2.84 KB, patch)
2019-10-09 10:07 UTC, Javier Martinez Canillas
no flags Details | Diff
[PATCH] 99-grub-mkconfig: Disable BLS usage for Xen DomU guests (3.10 KB, patch)
2019-10-09 15:26 UTC, Javier Martinez Canillas
no flags Details | Diff
strace of the pygrub attempt (141.70 KB, text/plain)
2019-10-24 18:47 UTC, Adam Williamson
no flags Details

Description Steven Haigh 2019-04-27 17:12:36 UTC
I've just done a heap of updates from F29 -> F30 as Xen DomU guests.

If GRUB_ENABLE_BLSCFG=true is set in /etc/default/grub, then none of the systems can boot after the upgrade.

Setting GRUB_ENABLE_BLSCFG=false restores the boot menu - however GRUB_DEFAULT=0 does not seem to correctly apply when generating the grub.cfg.

If there are two boot entries - 1) the kernel, 2) the rescue image, even with GRUB_DEFAULT=0, the second entry (the rescue image) becomes the default boot target.

Current F30 grub packages installed:
# rpm -qa | grep grub | sort
grub2-common-2.02-78.fc30.noarch
grub2-pc-2.02-78.fc30.x86_64
grub2-pc-modules-2.02-78.fc30.noarch
grub2-tools-2.02-78.fc30.x86_64
grub2-tools-efi-2.02-78.fc30.x86_64
grub2-tools-extra-2.02-78.fc30.x86_64
grub2-tools-minimal-2.02-78.fc30.x86_64
grubby-8.40-30.fc30.x86_64

Comment 1 Steven Haigh 2019-04-27 17:14:08 UTC
Sorry, I can't edit the comment - but forgot one part:

After setting GRUB_ENABLE_BLSCFG=false, you then need to run grub2-mkconfig -o /boot/grub2/grub.cfg.

After doing that and rebooting, the default entry (GRUB_DEFAULT=0) is not the first entry in the list.

Comment 2 Peter Bieringer 2019-05-01 09:02:41 UTC
this is not limited to Xen, looks like GRUB_ENABLE_BLSCFG=true causes issues on real hardware systems also... see my comment here

https://bugzilla.redhat.com/show_bug.cgi?id=1652806#c64

Comment 3 Adam Williamson 2019-05-01 15:05:16 UTC
Peter: this bug was specifically filed to be about the Xen case, not about any others. BLS is a big change in F30, it is entirely possible for there to multiple different bugs in it (in fact there have been at least a dozen different ones so far). Just because you both have a case where the system fails to boot and it seems to be BLS-related does not mean you are hitting the same bug.

Please either follow up on https://bugzilla.redhat.com/show_bug.cgi?id=1652806 or file a new bug, but unless Javier determines that you and Steven are actually hitting the same problem, let's not assume you are...

Comment 4 Steven Haigh 2019-05-14 13:56:30 UTC
Thinking about this further - and noticing it being referenced on xen-devel mailing list, I would like to suggest the following - which may have been overlooked right now...

If the grub %post scripting checked to see if it was installing / upgrading in a Xen DomU, it could set 'GRUB_ENABLE_BLSCFG=false' in /etc/default/grub automatically. This would fix both new installs and upgrades.

The final fix would be figuring out why pygrub currently boots the *second* entry in the resulting grub.cfg - unlike how F29 worked. This may be either a fix on the grub2-mkconfig or pygrub side - I'm not quite sure yet. This would likely restore functionality completely. At least until something else more suitable is done?

Comment 5 Steven Haigh 2019-07-01 05:46:55 UTC
For what its worth, newer kernels still don't appear in the grub menu.

I'm required to run 'grub2-mkconfig -o /boot/grub2/grub.cfg' manually every time a new kernel is installed.

I have tried with the grubby-depreciated package installed also with no resolution.

Comment 6 Javier Martinez Canillas 2019-07-01 08:20:34 UTC
(In reply to Steven Haigh from comment #5)
> For what its worth, newer kernels still don't appear in the grub menu.
> 
> I'm required to run 'grub2-mkconfig -o /boot/grub2/grub.cfg' manually every
> time a new kernel is installed.
> 
> I have tried with the grubby-depreciated package installed also with no
> resolution.

And did you disabled BLS (GRUB_ENABLE_BLSCFG=false in /etc/default/grub and re-generate your grub.cfg with grub2-mkconfig) when installing the grubby-deprecated package?

Comment 7 Steven Haigh 2019-07-01 08:27:50 UTC
Yes. I set: GRUB_ENABLE_BLSCFG=false

I've had to recover many VMs that fail to boot because of a new kernel install - but after finding an old kernel that is still present on the disk, a manual run of grub2-mkconfig causes things to be fine again. Until the next kernel update.

Comment 8 Steven Haigh 2019-07-22 15:08:40 UTC
As further reference, even an upgrade to kernel 5.1.18 across the board has had me run grub2-mkconfig on several machines that fail to boot.

Most were last booted with 5.1.16 - which still appeared in the menu - but either didn't have 5.1.18, or the boot failed.

Booting into 5.1.16, running grub2-mkconfig and then rebooting allows a successful boot into kernel 5.1.18.

Any suggestions would be good, as this is a royal pain in the butt.

Comment 9 Steven Haigh 2019-07-22 15:15:25 UTC
# rpm -qa | grep grub | sort
grub2-common-2.02-81.fc30.noarch
grub2-pc-2.02-81.fc30.x86_64
grub2-pc-modules-2.02-81.fc30.noarch
grub2-tools-2.02-81.fc30.x86_64
grub2-tools-efi-2.02-81.fc30.x86_64
grub2-tools-extra-2.02-81.fc30.x86_64
grub2-tools-minimal-2.02-81.fc30.x86_64
grubby-8.40-31.fc30.x86_64
grubby-deprecated-8.40-31.fc30.x86_64

# cat /etc/default/grub 
GRUB_TIMEOUT=1
GRUB_DEFAULT=0
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="audit=0 selinux=0 console=hvc0"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=false

Comment 10 Javier Martinez Canillas 2019-07-22 15:28:07 UTC
(In reply to Steven Haigh from comment #8)
> As further reference, even an upgrade to kernel 5.1.18 across the board has
> had me run grub2-mkconfig on several machines that fail to boot.
> 
> Most were last booted with 5.1.16 - which still appeared in the menu - but
> either didn't have 5.1.18, or the boot failed.
> 
> Booting into 5.1.16, running grub2-mkconfig and then rebooting allows a
> successful boot into kernel 5.1.18.
> 
> Any suggestions would be good, as this is a royal pain in the butt.

Do you have the grubby-deprecated package installed?

Comment 11 Steven Haigh 2019-07-23 00:48:28 UTC
> Do you have the grubby-deprecated package installed?

Affirm. I added config + installed packages in Comment #9.

Comment 12 Javier Martinez Canillas 2019-07-31 21:39:44 UTC
(In reply to Steven Haigh from comment #11)
> > Do you have the grubby-deprecated package installed?
> 
> Affirm. I added config + installed packages in Comment #9.

I see. Then new entries should be added to your grub.cfg by the old grubby tool (that's installed by the grubby-deprecated package).

Can you please share your grub.cfg ?

Comment 13 Steven Haigh 2019-08-03 00:18:16 UTC
Does this need to be from a machine that is 'faulty' or after running grub2-mkconfig?

Comment 14 Steven Haigh 2019-08-08 16:07:35 UTC
Created attachment 1601869 [details]
grub.cfg which was not updated with new kernel packages.

Comment 15 Steven Haigh 2019-08-08 16:08:28 UTC
Created attachment 1601870 [details]
grub.cfg after booting into the only working kernel and running grub2-mkconfig -o /boot/grub/grub.cfg

Comment 16 Steven Haigh 2019-09-18 03:57:16 UTC
Looking into this further as its still an issue....

I have removed everything to do with grubby, as looking at the kernel scripts, we run:

posttrans scriptlet (using /bin/sh):
/bin/kernel-install add 5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz || exit $?

Running kernel-install manually with the verbose step:

# /bin/kernel-install --verbose add 5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz
+/usr/lib/kernel/install.d/00-entry-directory.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/20-grub.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/20-grubby.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/50-depmod.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz 
Running depmod -a 5.2.14-200.fc30.x86_64
+/usr/lib/kernel/install.d/50-dracut.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/51-dracut-rescue.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/90-loaderentry.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz 
+/usr/lib/kernel/install.d/99-grub-mkconfig.install add 5.2.14-200.fc30.x86_64 /boot/ebdacf59978342fdb2b3d376662bb059/5.2.14-200.fc30.x86_64 /lib/modules/5.2.14-200.fc30.x86_64/vmlinuz

The resulting grub entry has the following:
menuentry 'Fedora (5.2.14-200.fc30.x86_64) 30 (Thirty)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.2.13-200.fc30.x86_64-advanced-a22c8698-28c0-44c9-87d7-58d7c88c0ea2' {
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod ext2
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root  a22c8698-28c0-44c9-87d7-58d7c88c0ea2
        else
          search --no-floppy --fs-uuid --set=root a22c8698-28c0-44c9-87d7-58d7c88c0ea2
        fi
        linux   //boot/vmlinuz-5.2.14-200.fc30.x86_64 root=UUID=a22c8698-28c0-44c9-87d7-58d7c88c0ea2 ro audit=0 selinux=0 console=hvc0 xen_blkfront.max_indirect_segments=128 LANG=en_AU.UTF-8
        initrd //boot/initramfs-5.2.14-200.fc30.x86_64.img
}


A manual run of grub2-mkconfig -o /boot/grub2/grub.cfg results in the following:
menuentry 'Fedora (5.2.14-200.fc30.x86_64) 30 (Thirty)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.2.14-200.fc30.x86_64-advanced-a22c8698-28c0-44c9-87d7-58d7c88c0ea2' {
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod ext2
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root  a22c8698-28c0-44c9-87d7-58d7c88c0ea2
        else
          search --no-floppy --fs-uuid --set=root a22c8698-28c0-44c9-87d7-58d7c88c0ea2
        fi
        linux   /boot/vmlinuz-5.2.14-200.fc30.x86_64 root=UUID=a22c8698-28c0-44c9-87d7-58d7c88c0ea2 ro audit=0 selinux=0 console=hvc0 xen_blkfront.max_indirect_segments=128 
        initrd  /boot/initramfs-5.2.14-200.fc30.x86_64.img
}

Comment 17 Steven Haigh 2019-09-18 04:02:30 UTC
It seems this bit of code in the kills the running of grub2-mkconfig:

# Is only needed for ppc64* since we can't assume a BLS capable bootloader there
if [[ $ARCH != "ppc64" && $ARCH != "ppc64le" ]]; then
    exit 0
fi

As such - grub2-mkconfig never gets run.

Comment 18 Steven Haigh 2019-09-18 04:08:18 UTC
Sorry, the above code is in the file: /usr/lib/kernel/install.d/99-grub-mkconfig.install

Without running grub2-mkconfig, the menu entry is *almost* correct, but the // at the start causes parsing of the grub.cfg file to fail - as the file //boot/vmlinuz-5.2.14-200.fc30.x86_64 is not found.

Running grub2-mkconfig fixes this.

Up for debate is if this should be fixed in the earlier scripts, or just let grub2-mkconfig fix the bug...

Comment 19 Steven Haigh 2019-09-18 04:11:02 UTC
For avoidance of doubt, here are the currently installed packages:

# rpm -qa | grep grub | sort
grub2-common-2.02-81.fc30.noarch
grub2-pc-2.02-81.fc30.x86_64
grub2-pc-modules-2.02-81.fc30.noarch
grub2-tools-2.02-81.fc30.x86_64
grub2-tools-efi-2.02-81.fc30.x86_64
grub2-tools-extra-2.02-81.fc30.x86_64
grub2-tools-minimal-2.02-81.fc30.x86_64

Comment 20 Chris Murphy 2019-09-19 17:18:52 UTC
I'm super confused by this bug because clearly the leading '//' is going to confuse anything, whether GRUB or pygrub. And that leading extra '/' as well as the trailing 'LANG=en_AU.UTF-8' strikes me as very much like the old real grubby being involved. But you're saying all traces of grubby are gone? So...maybe there's some post install script in the kernel package now editing grub.cfg's? I find that hard to believe.

Peter, Javier, what do you think about changing the GRUB_ENABLE_BLSCFG=false to just run 'grub2-mkconfig' making it more like upstream and other distros? That paradigm has never cared about the historic value in grub.cfg, it's always been about obliterating it in favor of the current truth - whatever that is. I don't really see how maintainable this is otherwise. And also it's decently likely, in the near term at least, that pygrub is going to learn how to parse grub.cfg+grubenv+bls snippets, and the sane legacy approach is to just use grub2-mkconfig after every kernel update.

Comment 21 Steven Haigh 2019-09-19 23:02:28 UTC
There is still the file: /usr/lib/kernel/install.d/20-grubby.install

That comes from:
systemd-udev-241-12.git1e19bcd.fc30.x86_64 : Rule-based device node and kernel event manager
Repo        : updates
Matched from:
Filename    : /usr/lib/kernel/install.d/20-grubby.install

Looking at the code in that file however, I'm not exactly sure how that could result in what we're seeing.

Yes, a fix would be to just run grub2-mkconfig and overwrite any previous entries generated by any other script. I would be happy with this.

Comment 22 Fedora Blocker Bugs Application 2019-09-30 15:19:19 UTC
Proposed as a Blocker for 31-final by Fedora user crcinau using the blocker tracking app because:

 The grub.cfg generated for kernels includes a double / on the initrd / vmlinuz lines which causes the entry to be unbootable. Hopefully a quick fix - but depends on further investigation / fixes.

Comment 23 Steven Haigh 2019-09-30 15:33:36 UTC
After testing with downgrading the kernel in F31, I ended up with the following differences between grub.cfg and the installed packages:

menuentry 'Fedora (5.3.1-300.fc31.x86_64) 31 (Thirty One)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.3.1-300.fc31.x86_64-advanced-e2f94071-1c3b-4b45-b6fb-22e3f952d4ae' {
menuentry 'Fedora (5.2.14-200.fc30.x86_64) 31 (Thirty One)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.2.14-200.fc30.x86_64-advanced-e2f94071-1c3b-4b45-b6fb-22e3f952d4ae' {

# rpm -qa | grep kernel | sort
kernel-5.3.0-1.fc31.x86_64
kernel-5.3.1-300.fc31.x86_64
kernel-core-5.3.0-1.fc31.x86_64
kernel-core-5.3.1-300.fc31.x86_64
kernel-headers-5.3.1-100.fc31.x86_64
kernel-modules-5.3.0-1.fc31.x86_64
kernel-modules-5.3.1-300.fc31.x86_64

Fixing via:
# grub2-mkconfig -o /boot/grub2/grub.cfg 
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.3.1-300.fc31.x86_64
Found initrd image: /boot/initramfs-5.3.1-300.fc31.x86_64.img
Found linux image: /boot/vmlinuz-5.3.0-1.fc31.x86_64
Found initrd image: /boot/initramfs-5.3.0-1.fc31.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-46e72612de204d5d8d6a9fe68e255ba3
Found initrd image: /boot/initramfs-0-rescue-46e72612de204d5d8d6a9fe68e255ba3.img
done

Generated entries are now correct:
menuentry 'Fedora (5.3.1-300.fc31.x86_64) 31 (Thirty One)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.3.1-300.fc31.x86_64-advanced-e2f94071-1c3b-4b45-b6fb-22e3f952d4ae' {
menuentry 'Fedora (5.3.0-1.fc31.x86_64) 31 (Thirty One)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-5.3.0-1.fc31.x86_64-advanced-e2f94071-1c3b-4b45-b6fb-22e3f952d4ae' {

Comment 24 Steven Haigh 2019-09-30 15:41:23 UTC
For the sake of the review, this issue seems to break in one of two ways:

1) The grub.cfg file is not update at all (as per comment 23); or
2) The entries for initrd / kernel lines have // at the start of the path (as per comment 20).

I have been unable to replicate which specific actions trigger which specific method of failure.

In all cases, running 'grub2-mkconfig -o /boot/grub2/grub.cfg' will fix the problem.

In this configuration, the file /etc/default/grub contains similar to:
GRUB_TIMEOUT=1
GRUB_DEFAULT=0
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="audit=0 selinux=0 console=hvc0"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=false

There are no grubby packages installed.

Comment 25 Geoffrey Marr 2019-09-30 23:47:14 UTC
Discussed during the 2019-09-30 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made as the details of the issue here are not yet entirely clear so it's hard to decide if it's a blocker, also we're not sure yet if we intend to complete the criterion change to block only on ec2 rather than all xen guest functionality.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-09-30/f31-blocker-review.2019-09-30-16.00.txt

Comment 26 Chris Murphy 2019-10-04 21:35:24 UTC
Basic criterion
"The installed system must be able appropriately to install, remove, and update software with the default console tool for the relevant software type (e.g. default console package manager). This includes downloading of packages to be installed/updated."
There's a note about "New kernels not default (and similar cases)"

Final criterion
"The release must boot successfully as Xen DomU with releases providing a functional, supported Xen Dom0 and widely used cloud providers utilizing Xen."

1. It does boot successfully on installation
2. kernel (critical path package) update fails to update the bootloader configuration: whether this is done by post-install script or by grubby-deprecated, I don't know, but I also don't think it matters per criterion, the resulting system fails to boot again
3. upgrades and updates are supposed to work
4. the latest kernel installed is expected to be used by default

I'm +1 final blocker. Any contra-arguments?

Comment 27 Chris Murphy 2019-10-04 21:59:41 UTC
The way we deploy any Fedora edition/spin/remix, since version 30, is  without "real" grubby (a.k.a. grubby-deprecated). That means out of the box, kernel updates on Xen DomU's has not been possible (they're incomplete and the system is left unbootable). It requires rather substantial post-install work before the first kernel update:
a. remove grubby, install grubby-deprecated
b. edit /etc/default/grub such that GRUB_ENABLE_BLSCFG=false
c. grub2-mkconfig -o /boot/grub2/grub.cfg to apply the change in b.

That arguably constitutes a separate bug, and I think it's also a blocker per the release criteria. But it wasn't caught during the Fedora 30 development process that Xen DomU's would break as a part of the BLS by default feature.

Ergo, even if *this* bug is fixed via a grubby-deprecated update, it doesn't solve the out of the box problem above. And I'm not sure what to do about that in the given time frame.

Comment 28 Steven Haigh 2019-10-05 02:18:17 UTC
While I've gone as far as I can think of in troubleshooting this, I'll share my current workaround - as this applies to both F30 and F31 in its current states.

In the installation kickstart, I run the following:

-----------------------------
sed -i 's/GRUB_ENABLE_BLSCFG=true/GRUB_ENABLE_BLSCFG=false' /etc/default/grub
grub2-mkconfig -o /boot/grub2/grub.cfg

## Hot patch for screwed up grub config scripts...
cat << 'EOF' > /usr/lib/kernel/install.d/99-xx-force-grub2-mkconfig.install
#!/bin/bash
[[ -f /etc/default/grub ]] && . /etc/default/grub

COMMAND="$1"
case "$COMMAND" in
	add|remove)
		grub2-mkconfig --no-grubenv-update -o /boot/grub2/grub.cfg >& /dev/null
		;;
	*)
		;;
esac
EOF
chmod +x /usr/lib/kernel/install.d/99-xx-force-grub2-mkconfig.install
-----------------------------

I do not have any grubby* packages installed - and I didn't see them assist or make the problem worse in any way.

In the last week of testing, this seems to have worked properly - but I need a few more kernel updates to be pushed out to test if this is a complete workaround.

I don't pretend to think that this is a solution - as it could well have other effects in other environments...

I did think that it may be useful to use 'virt-what' in the installer - and then disable BLS if a Xen DomU is detected... but then there's more edge cases that might be caused by this - ie what if you use HVM (which works with BLS) but switch to PV or PVH etc...

Comment 29 Adam Williamson 2019-10-07 17:09:12 UTC
For the record, I want to note that this bug was brought up in the discussion about dropping the Xen criterion *back in May*, and Lars Kurth promised to get something done about it then. Full quote:

"== On [B1] / grub2-switch-to-blscfg  ==
This issue is about Fedora _domU_ and breaks the release criterion. And looks like, it wasn't tested at all.

"blscfg is okay in _dom0_ - it looks like the xen setup still gets put in non-blscfg format, and doesn't seem to matter in HVM _domU_."

"The big issue is _domU_ in PV which would need a fair amount of work in pygrub to fix properly, including reading variables from grubenv and extracting details from the loader files. This is really something to be fixed on the Xen side ... I do keep intending to have a look at it myself though I may not get around to it."

Instead of fixing pygrub, it would be better, more future proof and easier to "use pvgrub2 instead. To be honest, its very unclear to me why would anyone want to use pygrub, when pvgrub2 exists. pygrub is much more fragile (as it needs to re-implement a parser for 3rd-party configuration format, without stable specification) and less secure - it does that in dom0, including mounting domU controlled disk.

That said, the pvgrub2 option also requires some work, because:
- Fedora grub2 packages do not include the "xen" target platform
- Non-Fedora grub2 package don't have blscfg support
- If we'd talk about PVH (which isn't the case here), it requires grub
  2.04, which is at RC1 and isn't packaged for Fedora yet"

That would be much simpler, if blscfg was upstreamed into grub2 by Fedora community members. Do you know whether the Fedora has plans to do this?

In any case, I have taken an action to get this resolved (aka find someone to do the work)."

However, it doesn't seem like he did find anyone to "do the work", as the bug is still sitting here.

Comment 30 Geoffrey Marr 2019-10-08 06:21:23 UTC
Discussed during the 2019-10-07 blocker review meeting: [0]

The decision to delay the classification of this as a blocker bug was made as we are going to pull the xen folks in on the bug this week and see if any progress can be made before finally deciding what to do next week.

[0] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-10-07/f31-blocker-review.2019-10-07-16.02.txt

Comment 31 Javier Martinez Canillas 2019-10-08 15:05:39 UTC
Hello Chris,

(In reply to Chris Murphy from comment #20)
> I'm super confused by this bug because clearly the leading '//' is going to
> confuse anything, whether GRUB or pygrub. And that leading extra '/' as well
> as the trailing 'LANG=en_AU.UTF-8' strikes me as very much like the old real
> grubby being involved. But you're saying all traces of grubby are gone?
> So...maybe there's some post install script in the kernel package now
> editing grub.cfg's? I find that hard to believe.
> 
> Peter, Javier, what do you think about changing the GRUB_ENABLE_BLSCFG=false
> to just run 'grub2-mkconfig' making it more like upstream and other distros?
> That paradigm has never cared about the historic value in grub.cfg, it's
> always been about obliterating it in favor of the current truth - whatever
> that is. I don't really see how maintainable this is otherwise. And also
> it's decently likely, in the near term at least, that pygrub is going to
> learn how to parse grub.cfg+grubenv+bls snippets, and the sane legacy
> approach is to just use grub2-mkconfig after every kernel update.

Yes, agreed. As Steven mentioned we also force to re-generate a grub.cfg with grub2-mkconfig for ppc64le in /usr/lib/kernel/install.d/99-grub-mkconfig.install (we did that because even when there's BLS support in Petitboot since 1.8.0, we couldn't ensure that all the ppc64le OPAL machines wouldn't have an older Petitboot without BLS support).

So I think that a workaround could be what you are proposing, to extend the /usr/lib/kernel/install.d/99-grub-mkconfig.install to not only cover ppc64le machines but also Xen VMs running as DomU guests. And also add what Steven did in Comment 28, to set GRUB_ENABLE_BLSCFG=false before re-generating the grub.cfg with grub2-mkconfig.

Steven, is there a way for user-space to check if the machine is running as a DomU guest? I read that this could be achieved by checking if /sys/hypervisor/uuid exists and the UUID is not all zeros. Is that correct?

Comment 32 Steven Haigh 2019-10-09 00:18:24 UTC
Looking at the proposal for checking /sys/hypervisor/uuid, I can confirm the following:

Xen PVH DomU:
09fc5229-191a-42ae-a3d1-f3bad5ba6836

Xen Domain-0:
00000000-0000-0000-0000-000000000000

Xen HVM DomU:
2ec9fc0b-15e6-4b97-bd0b-8789b3c93234

This is probably bad - as a HVM (fully emulated) host can use BLS - as it loads grub from the boot sector.

If we want to look at values in /sys/hypervisor/, I would suggest:

1) Check that /sys/hypervisor/type contains 'xen'; and
2) Check that /sys/hypervisor/guest_type contains 'PVH'.

If these two conditions are met, BLS will fail.

Extending this logic further, it may be good to also put this conditional logic in the grub logic that enables BLS in the first place. Maybe set GRUB_ENABLE_BLSCFG=false in /etc/default/grub if the above two conditions are also met...

I have not been able to test this using pvgrub bootloader with Xen - as Fedora doesn't build grub with xen options to create said bootloader.

Comment 33 Steven Haigh 2019-10-09 00:26:54 UTC
Just had a further thought here... When checking /sys/hypervisor/guest_type for the Domain-0, it returns PV.

I guess it is also a valid use case for PV - which will also fail under BLS - however the Domain-0 *can* use BLS.

Another option may be to use the 'virt-what' command. While this may mean adding a dep on it - it would resolve a few of these issues as:

Domain-0:
xen
xen-dom0

Xen HVM:
xen
xen-hvm

Xen PVH:
xen
xen-domU

.... Or I guess figure out how virt-what knows the difference between the above types and implement similar logic...

Comment 34 Steven Haigh 2019-10-09 01:39:32 UTC
I figure I've got a lot of assumed knowledge - so to remove all doubt, I'll clarify the situation:

Xen Domain-0:
* the Xen host
* Can run BLS
* /sys/hypervisor/uuid = 00000000-0000-0000-0000-000000000000
* /sys/hypervisor/guest_type = PV

Xen PVH Domain:
* PVH guest
* Cannot run BLS
* /sys/hypervisor/uuid = non-zero
* /sys/hypervisor/guest_type = PVH

Xen HVM Domain:
* HVM Guest
* Can run BLS
* /sys/hypervisor/uuid = non-zero
* /sys/hypervisor/guest_type = HVM

Xen PV Domain:
* PV Guest
* Cannot run BLS
* /sys/hypervisor/uuid = non-zero
* /sys/hypervisor/guest_type = PV

Both PVH and PV domains (except Domain-0) are normally used with pygrub as the bootloader - which doesn't support BLS at all (yet - unknown future ETA).

I guess a valid scenario would be to check:
* If /sys/hypervisor/type == xen and /sys/hypervisor/guest_type == PV* (PV or PVH) and /sys/hypervisior/uuid != 00000000-0000-0000-0000-000000000000 - then run grub2-mkconfig with BLS disabled.

This would lead to being able to use BLS on both Xen HVMs or the Xen Domain-0 - which in theory should work fine.

It does leave the edge case of somewhat changing the a Xen HVM config to PVH - which is also a valid thing some people do - but I'm not sure how that would be resolved other than just disabling BLS for anything Xen until pygrub catches up (if / when?) and allows BLS booting.

Comment 35 Chris Murphy 2019-10-09 03:01:07 UTC
Sounds like /sys/hypervisor/guest_type alone can be relied upon, and if it contains PVH, HVM, PV - then set BLSCFG=false. If this could be done in Anaconda that would be badass and solve the problem entirely just by avoiding it in the first place. That HVM could support BLS, I suggest ignoring in favor of consistency, and avoiding end user confusion why some VMs use BLS and others use traditional grub.cfg.

If there's anything else uniquely Xen available in sysfs, it might be useful to check for that too, mostly as just a sanity test. What if what's in /sys/hypervisor/guest_type isn't guaranteed to be unique to xen? But this sort of deconfliction is not my expertise, I'm just throwing it out there.

Comment 36 Chris Murphy 2019-10-09 04:03:26 UTC
Just realized I missed this from comment 34:

Xen Domain-0
* /sys/hypervisor/guest_type = PV

Xen PV Domain:
* /sys/hypervisor/guest_type = PV

Is there another way to distinguish between them? Otherwise it suggests the Dom0 also needs BLSCFG=false, which is not the end of the world but does cause a fragmentation on baremetal (some use BLS and some don't).

Comment 37 Chris Murphy 2019-10-09 04:07:46 UTC
Ahh OK so maybe ignore guest_type, and only check type and UUID.
type=xen + UUID=zeros = Dom0 and thus BLS OK
type=xen + UUID=nonzero = guest and thus BLS not OK (one type is OK for BLS but ignore it and just step on grub.cfg anyway for consistency).

Comment 38 Steven Haigh 2019-10-09 04:08:33 UTC
Have been talking about this matter with cmurf on #fedora-qa and debating options...

Currently, it seems that we can deduce the following two scenarios:

in /sys/hypervisor:

1) type == xen && uuid == all zeros, then this is BLS safe (the Domain-0).
2) type == xen && uuid != all zeros, then this is BLS *unsafe* (covers PV, HVM and PVH guests).

This may be the most sane / simple test to do for the moment...

Comment 39 Steven Haigh 2019-10-09 04:21:15 UTC
Posted the question to the xen-devel mailing list to see if we've missed any other combinations / obvious matters.

https://lists.xen.org/archives/html/xen-devel/2019-10/msg00697.html

Comment 40 Javier Martinez Canillas 2019-10-09 10:07:47 UTC
Created attachment 1623769 [details]
[PATCH] 99-grub-mkconfig: Disable BLS usage for Xen DomU guests

Thanks a lot for the comments Steven and Chris, I've attached a (untested) patch that does what we discussed in this bz and over irc. Please let me know if I'm missing anything.

I also did a scratch grub2 build that contains the attached patch for you to test:

https://koji.fedoraproject.org/koji/taskinfo?taskID=38168862

If this works correctly for you then I will also backport the patch for F30.

Comment 41 Lars Kurth 2019-10-09 12:01:47 UTC
(In reply to Adam Williamson from comment #29)
> For the record, I want to note that this bug was brought up in the
> discussion about dropping the Xen criterion *back in May*, and Lars Kurth
> promised to get something done about it then. Full quote:

I thought I had replied earlier in the week, but that didn't come through.

I did promise and there has been some progress: however it was not as
quick as I hoped. There is a patch posted which should fix the immediate
issue through a workaround which will hopefully make it into Xen 4.13
(and if not should be back-ported to 4.13.1). And there is a rough plan in 
place to change pygrub to support BLS. But all of this would have to be
backported to supported versions of Xen. Even if we had a fix we still need
to deal with versions of Xen that are out there in the wild.

I parked any of the testing related stuff which was discussed as the
underlying issue has to be addressed first.

We will discuss this bug (and the related stuff) in tomorrow's Xen
community call.

Regards
Lars

Comment 42 Javier Martinez Canillas 2019-10-09 15:26:02 UTC
Created attachment 1623842 [details]
[PATCH] 99-grub-mkconfig: Disable BLS usage for Xen DomU guests

After a conversation with the Xen folks it was concluded that the best approach to test if a machine is a Xen Dom0 host or a DomU guest is by checking if /sys/hypervisor/type is set to xen and /proc/xen/capabilities contains the control_d string.

I've attached the latest patch that was tested by Steven and did a grub2-2.02-99.fc31 build including this fix.

Comment 43 Fedora Update System 2019-10-09 16:03:28 UTC
FEDORA-2019-591c552fba has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-591c552fba

Comment 44 Fedora Update System 2019-10-09 23:05:45 UTC
grub2-2.02-99.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-591c552fba

Comment 45 Steven Haigh 2019-10-10 02:08:21 UTC
I've noticed an issue here... 

I can confirm that GRUB_ENABLE_BLSCFG does get set to false in the correct conditions, but the ARCH / Xen check fails - meaning the exit 0 runs and grub2-mkconfig never gets called.

Modified script:
    https://paste.centos.org/view/1124f1ed

When running via the CLI, I get:

# KERNEL_INSTALL_MACHINE_ID=BLAH ./99-grub-mkconfig.install add
hv_type = xen and XEN_DOM0 = 
Exiting on ARCH / HV_TYPE / XEN_DOM0 check

Adding 'set -x' to the top of the script results in:

# KERNEL_INSTALL_MACHINE_ID=BLAH ./99-grub-mkconfig.install add
+ [[ -n BLAH ]]
+ [[ -e /sys/hypervisor/type ]]
+ read HV_TYPE
+ [[ -e /proc/xen/capabilities ]]
+ [[ xen = \x\e\n ]]
+ [[ '' != \t\r\u\e ]]
+ grep -q '^GRUB_ENABLE_BLSCFG="*true"*\s*$' /etc/default/grub
++ uname -m
+ ARCH=x86_64
+ echo 'hv_type = xen and XEN_DOM0 = '
hv_type = xen and XEN_DOM0 = 
+ [[ x86_64 != \p\p\c\6\4 ]]
+ [[ x86_64 != \p\p\c\6\4\l\e ]]
+ echo 'Exiting on ARCH / HV_TYPE / XEN_DOM0 check'
Exiting on ARCH / HV_TYPE / XEN_DOM0 check
+ exit 0

As such, we never get to the Xen checks.

Hate to return to sender on this one, but seems to be still buggy.

Comment 46 Fedora Update System 2019-10-10 07:42:17 UTC
FEDORA-2019-1265db97c0 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-1265db97c0

Comment 47 Fedora Update System 2019-10-10 07:46:17 UTC
FEDORA-2019-ad706bc4b9 has been submitted as an update to Fedora 30. https://bodhi.fedoraproject.org/updates/FEDORA-2019-ad706bc4b9

Comment 48 Fedora Update System 2019-10-10 14:34:00 UTC
grub2-2.02-100.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-1265db97c0

Comment 49 Adam Williamson 2019-10-10 15:03:15 UTC
Given we have a fix for this now, I'm at least +1 FE, let's get it in. Other votes?

Comment 50 Mohan Boddu 2019-10-10 17:23:55 UTC
Yeah, +1 FE

Comment 51 Kevin Fenzi 2019-10-10 17:26:11 UTC
+1 FE

Comment 52 Geoffrey Marr 2019-10-10 17:26:44 UTC
+1 FE

Comment 53 Fedora Update System 2019-10-10 17:29:31 UTC
grub2-2.02-83.fc30 has been pushed to the Fedora 30 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-ad706bc4b9

Comment 54 Adam Williamson 2019-10-10 17:31:57 UTC
That's +4, marking accepted FE.

Comment 55 Fedora Update System 2019-10-13 17:55:52 UTC
grub2-2.02-100.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.

Comment 56 Javier Martinez Canillas 2019-10-15 08:13:38 UTC
*** Bug 1679759 has been marked as a duplicate of this bug. ***

Comment 57 Fedora Update System 2019-10-15 22:39:39 UTC
grub2-2.02-83.fc30 has been pushed to the Fedora 30 stable repository. If problems still persist, please make note of it in this bug report.

Comment 58 Adam Williamson 2019-10-23 18:18:57 UTC
pbrobinson caught that we regressed this on F31: we pushed an *older* grub2 (-98) stable for another FE bug, it got pushed over the -100 that fixed this :( It seems the -98 update was never unpushed or obsoleted...

mboddu, we need to re-push -100 to fix this.

Comment 59 Adam Williamson 2019-10-23 20:43:27 UTC
-100 has been re-tagged and should show up in the next composes, so closing again.

Comment 60 Adam Williamson 2019-10-24 15:57:39 UTC
so...I don't know if I'm missing something here, but I ran an install of RC-1.9 in a Xen guest and it won't boot after install, with /var/log/xen/bootloader logs indicating that pygrub is "Unable to find partition containing kernel". Inspecting the image with guestfish, it seems to have been installed with BLS active - there's a populated /boot/loader/entries and /boot/grub/grub2.cfg doesn't contain any 'kernel' or 'initrd' lines.

RC-1.9 does have grub2 -100 in it...

Comment 61 Adam Williamson 2019-10-24 16:03:16 UTC
grubby-deprecated is not installed, /etc/default/grub says GRUB_ENABLE_BLSCFG=true, and running grub2-mkconfig -o /boot/grub2/grub.cfg still writes a BLS-y config.

Comment 62 Adam Williamson 2019-10-24 17:20:01 UTC
So with Javier's anaconda PR:

https://github.com/rhinstaller/anaconda/pull/2201

I seem to get a non-BLS installed system - /boot/grub2/grub.cfg contains actual boot entries - but the VM still fails to run with the same error from pygrub logged in a /var/log/xen/bootloader.N.log file: "Unable to find partition containing kernel". Not really sure what's going on there. It'd be good if someone more Xen expert than me could test, both with and without the updates image...

Comment 63 Michael Young 2019-10-24 18:23:05 UTC
(In reply to Adam Williamson from comment #62)
> So with Javier's anaconda PR:
> 
> https://github.com/rhinstaller/anaconda/pull/2201
> 
> I seem to get a non-BLS installed system - /boot/grub2/grub.cfg contains
> actual boot entries - but the VM still fails to run with the same error from
> pygrub logged in a /var/log/xen/bootloader.N.log file: "Unable to find
> partition containing kernel". Not really sure what's going on there. It'd be
> good if someone more Xen expert than me could test, both with and without
> the updates image...

You could try running pygrub directly, eg /usr/libexec/xen/bin/pygrub --debug /path/to/vm
to see if it gives any more information.

Comment 64 Adam Williamson 2019-10-24 18:43:30 UTC
That gets me:

Traceback (most recent call last):
  File "/usr/libexec/xen/bin/pygrub", line 902, in <module>
    fs = xenfsimage.open(file, offset, bootfsoptions)
OSError: [Errno 95] Operation not supported
Traceback (most recent call last):
  File "/usr/libexec/xen/bin/pygrub", line 902, in <module>
    fs = xenfsimage.open(file, offset, bootfsoptions)
OSError: [Errno 95] Operation not supported
Traceback (most recent call last):
  File "/usr/libexec/xen/bin/pygrub", line 931, in <module>
    raise RuntimeError("Unable to find partition containing kernel")
RuntimeError: Unable to find partition containing kernel

Trying to do `xenfsimage.open('/var/lib/libvirt/images/guest.img')` in a Python shell fails the same way, but I'm not sure why.

Comment 65 Adam Williamson 2019-10-24 18:47:14 UTC
Created attachment 1628930 [details]
strace of the pygrub attempt

Here's strace output for the pygrub attempt.


Note You need to log in before you can comment on or make changes to this bug.