We encountered a boot issue with the latest GRUB2 versions available in Fedora 41 when running in VMware. Unfortunately, VMware provides limited logging, and the only error message available is: "The firmware encountered an unexpected exception. The virtual machine cannot boot." Tested Scenarios: Summary: The GRUB version 2.12.x causes boot failure in VMware environments. For more details see: https://github.com/coreos/fedora-coreos-tracker/issues/1802 Reproducible: Always Steps to Reproduce: * Built FCOS41 with the latest packages and created a VMware OVA → Boot fails with the error message above. * Built FCOS41 with the latest packages, upgraded GRUB to the versions below, and created a VMware OVA → Boot fails with the same error: GRUB version: 2.12-7.fc42 * Built FCOS41 with the latest packages, downgraded GRUB and FUSE packages as follows, and created a VMware OVA → Boot succeeds: GRUB version: 2.06-124.fc41 FUSE version: 2.9.9-22.fc41 FUSE-libs version: 2.9.9-22.fc41 * Built FCOS41 with the latest packages, downgraded GRUB and FUSE as follows, and created a VMware OVA → Boot fails with the same error: GRUB version: 2.12-3.fc41 Actual Results: VMware does not boot Expected Results: VMware boots
Proposed as a Blocker for 41-final by Fedora user ravanelli using the blocker tracking app because: This bug significantly impacts the ability to use Fedora 41 on VMware environments. Fedora CoreOS (FCOS) is widely deployed in virtualized environments, and VMware remains one of the most commonly used platforms for Fedora CoreOS users. The failure of Fedora 41 to boot with GRUB version 2.12.x in VMware renders the operating system unusable in these environments, which could affect a large number of users relying on VMware for production and development workloads.
I don't see any mention of VMWare on https://fedoraproject.org/wiki/Basic_Release_Criteria so probably not blocking bug candidate, but freeze exception, definitely.
I added the options pager=0 and debug=all in the grub.cfg as suggested by Marta Lewandowska, still nothing shows up in VMware, it is failing even before it.
Thank you for trying that, Renata! It's too bad we don't have any normal debug from grub. If I would instrument a grub build for you-- a scratch build which would be incompatible with Secure Boot-- would you be able to put that into an image and try to boot it? The behavior I'm seeing on VMware from f-41 (grub2-2.12) is pretty weird. I tried installing from fedora-coreos-41.20240922.1.0-live.x86_64.iso and while that installs fine on BIOS, when I switch the firmware to UEFI, the VM cannot boot and shows a message that says: No compatible bootloader found. Trying to install on UEFI does not work because the same message shows up. I also tried installing Fedora-Server-netinst-x86_64-41_Beta-1.2.iso, and that one boots fine on BIOS, I can switch the firmware to UEFI, and that also boots, but it has the ia32 shim and grub binaries installed... when I install the x64 (64 bit versions) and remove the 32 bits ones, the system again can not boot and attempts to boot the installer instead.
(In reply to Marta Lewandowska from comment #4) > Thank you for trying that, Renata! It's too bad we don't have any normal > debug from grub. If I would instrument a grub build for you-- a scratch > build which would be incompatible with Secure Boot-- would you be able to > put that into an image and try to boot it? Yes, if you build that via bodhi, just send me link, or you can send me the rpms in other way.
(In reply to Renata Ravanelli from comment #5) > Yes, if you build that via bodhi, just send me link, or you can send me the > rpms in other way. Doesn't even need to be bodhi, just a koji link to a scratch build should be good enough.
-4 in https://pagure.io/fedora-qa/blocker-review/issue/1682 , rejected as a blocker. We should definitely track this and try and fix if it's on our side, though, and document if it can't be fixed.
Renata and I have isolated it to /boot/grub/console.cfg, particularly the line: serial --speed=115200. She has tried a few variants of this line, specifying --port or --unit, and they appear to work. So now it's up to you all: what you need in that file in order for the VM to properly output grub content to the serial console, if that's what you want.
Hey Marta, Thank you for the update. From an end user point of view this looks like a regression considering the exact same config works on the previous version of GRUB. Were we just relying on buggy behavior and the bug was fixed in the newer version? This will pose a real problem for our existing users who choose to upgrade their bootloader. We would need to come up with some way to update the console settings using the migration script. Another way to say it is: was this change in behavior intended? If we hit this problem I imagine there will (eventually) be customers who do as well and we'll need an answer for them.
Example on someone updating the bootloader and hitting this problem: https://github.com/coreos/fedora-coreos-tracker/issues/1802#issuecomment-2377851057
What's also odd is that all the other platforms still only use `serial --speed=115200` without any issues, but VMware, when using UEFI, is the only one reporting problems.
Hi Dusty, I am not particularly knowledgeable about this, but all the examples in the upstream grub documentation: https://www.gnu.org/software/grub/manual/grub/grub.html#serial as well as our own beaker provisioning of (only) x86 UEFI machines-- check out the provisioning kickstart file and search for 'serial'-- have specified --port and/or --unit long before this version of grub was released, so maybe you were relying on buggy behavior? I will look to see if the rebase touched the serial command.
[adamw@xps13a grub (master)]$ git log --oneline grub-2.06..grub-2.12 grub-core/term/serial.c f7a663c00 term/serial: Ensure proper NULL termination after grub_strncpy() 712309eaa term/serial: Use grub_strncpy() instead of grub_snprintf() when only copying string 8eb3d4df3 term/serial: Add support for PCI serial devices 35782e165 term/serial: Improve detection of duplicate serial ports e37dbba66 term/serial: Avoid double lookup of serial ports b73a44b28 term/serial: Replace usage of memcmp() with strncmp() c4e801631 term/serial: Add ability to specify MMIO ports via "serial" command 7b192ec4c term/ns8250: Use ACPI SPCR table when available to configure serial c2ef140a6 term/ns8250: Add configuration parameter when adding ports b232f6f66 term: Remove trailing whitespaces ...so I'd say yeah, rebase from 2.06 to 2.12 definitely touched it. "Use ACPI SPCR table when available to configure serial", "Avoid double lookup of serial ports" and "Improve detection of duplicate serial ports" are the places I might start looking...
Poking at the upstream grub code a bit, it does seem to explicitly handle the case where you don't specify a device: static grub_err_t grub_cmd_serial (grub_extcmd_context_t ctxt, int argc, char **args) { struct grub_arg_list *state = ctxt->state; char pname[40]; const char *name = NULL; ... if (!name) name = "auto"; ... port = grub_serial_find (name); ... } now if we look at grub_serial_find: struct grub_serial_port * grub_serial_find (const char *name) { ... #if (defined(__i386__) || defined(__x86_64__)) && !defined(GRUB_MACHINE_IEEE1275) && !defined(GRUB_MACHINE_QEMU) if (grub_strcmp (name, "auto") == 0) { /* Look for an SPCR if any. If not, default to com0. */ port = grub_ns8250_spcr_init (); if (port != NULL) return port; FOR_SERIAL_PORTS (port) if (grub_strcmp (port->name, "com0") == 0) return port; } #endif ... return NULL; } so autodetection is supposed to try and find a port on x86 when not using qemu or OFW (I think that's what IEEE1275 is about anyway), and fall through to just returning NULL otherwise (which still seems like it shouldn't prevent boot). That code was touched extensively by c4e801631, 7b192ec4c and 35782e165f, so there's definitely change between 2.06 and 2.12 there. It seems like 7b192ec4cd7c4b3207db010202349dd283e72041 *introduced* this auto-detect code and before that not setting an explicit port would have just defaulted to doing `grub_serial_find ("com0");`, which would have returned any port named "com0", or NULL. So...theory: this `grub_ns8250_spcr_init` autodetect code blows up on VMware for some reason?
CCing the author of 7b192ec4cd7c4b3207db010202349dd283e72041 , who seems to be on RHBZ - hi Ben! Hope you don't mind the ping. Does my analysis above seem plausible? Thanks!
Renata, I'm running a scratch build of grub2 with the ACPI SPCR serial port detection disabled by making grub_cmd_serial do `grub_serial_find ("com0");` like it used to, instead of `grub_serial_find ("auto");` - that should bypass the grub_ns8250_spcr_init call. It's running at https://koji.fedoraproject.org/koji/taskinfo?taskID=124858251 . Can you try it out when it's done and see if it avoids the bug?
huh. I also see another problem with that change: it apparently unintentionally breaks the old "default to com0" behaviour on anything that fails the `#if (defined(__i386__) || defined(__x86_64__)) && !defined(GRUB_MACHINE_IEEE1275) && !defined(GRUB_MACHINE_QEMU)` check. it should probably be something like this: if (grub_strcmp (name, "auto") == 0) { #if (defined(__i386__) || defined(__x86_64__)) && !defined(GRUB_MACHINE_IEEE1275) && !defined(GRUB_MACHINE_QEMU) /* Look for an SPCR if any. */ port = grub_ns8250_spcr_init (); if (port != NULL) return port; #endif /* If not, default to com0. */ FOR_SERIAL_PORTS (port) if (grub_strcmp (port->name, "com0") == 0) return port; } I think?
ugh, actually, that's still inside a wider `#if (defined(__mips__) || defined (__i386__) || defined (__x86_64__)) && !defined(GRUB_MACHINE_EMU) && !defined(GRUB_MACHINE_ARC)` block so it's still not really right :/ we'd probably have to move the `if (grub_strcmp (name, "auto") == 0)` block up above the `#if (defined(__mips__) || defined (__i386__) || defined (__x86_64__)) && !defined(GRUB_MACHINE_EMU) && !defined(GRUB_MACHINE_ARC)` block to fix that. anyhow, it's a bit of a sidebar.
Hrm... this code has been upstream for a while (and in Amazon Linux for even longer) and I've tested in VMware... give me a bit of time to dig into this and try to reproduce. One thing I find confusing is that it seems this is being reproduced with UEFI ? I didn't think people would use the serial driver with UEFI instead of UEFI console, but I will look into it, I might have missed something.
Hrm... Broadcom website is not letting me download anything until I file an account involving a "site ID" but the page to retreive this is broken and you can't open a support ticket unless an FAE helps you create a "Project" ! yay ! I'll continue trying to download vmware in the next few days but it doesn't look good. Sadly nothing unexpected from BCM ...
ugh, well that's repellent. I hadn't reproduced myself (yet), just poked at the code to try and see what might be up. https://www.techspot.com/downloads/189-vmware-workstation-for-windows.html is...probably an okay way to try and get it without signing up for anything? I disclaim all responsibility for consequences. :D
Ah ! I was going to try with the Linux version first :-) I'll give that a spin on a sandboxed machine thanks
(In reply to Adam Williamson from comment #17) > Renata, I'm running a scratch build of grub2 with the ACPI SPCR serial port > detection disabled by making grub_cmd_serial do `grub_serial_find ("com0");` > like it used to, instead of `grub_serial_find ("auto");` - that should > bypass the grub_ns8250_spcr_init call. It's running at > https://koji.fedoraproject.org/koji/taskinfo?taskID=124858251 . Can you try > it out when it's done and see if it avoids the bug? Yes, it does avoid the bug, I can boot and access it as expected.
benjamin: the URL has windows in it, but the page looks like it has Windows and Linux downloads.
...and for your reference, the change I had renata test was just this: [adamw@xps13a grub (master)]$ git diff diff --git a/grub-core/term/serial.c b/grub-core/term/serial.c index 8260dcb7a..72a6927b4 100644 --- a/grub-core/term/serial.c +++ b/grub-core/term/serial.c @@ -271,7 +271,7 @@ grub_cmd_serial (grub_extcmd_context_t ctxt, int argc, char **args) name = args[0]; if (!name) - name = "auto"; + name = "com0"; port = grub_serial_find (name); if (!port) I didn't include the patch I sent to grub-devel which reorders stuff, it was simply that, which I expect causes us to find and return a com0 in the `FOR_SERIAL_PORTS` loop at the start of `grub_serial_find`, and thus bypass all the later code. This is, AFAICT, probably how things worked on 2.06.
The auto-detection is necessary on ec2 and a number of other platforms. So I'm not too keen on taking it out. We need to understand why it's breaking on VMWare. I've managed to install Workstation 17.6.1 on my machine, I'll do some testing... one thing I noticed is that if I just create a VM and install f40 on it, it doesn't come with a serial port at all (even the kernel doesn't see a ttyS0). This is on a BIOS boot. I've installed f40, updated grub to f41, so far so good, added serial --speed 115200 to the grub.cfg, and I see an error message about "auto" not being found, but it boots. I haven't tested with UEFI yet. Adding a serial port to the VM makes at least that bit work and not crash. I can also reproduce the crash in UEFI. I'll setup something I can actually compile things with now :-) (Fedora Core OS blows my brains...) and can boot both UEFI and BIOS, and I'll debug it.
"The auto-detection is necessary on ec2 and a number of other platforms. So I'm not too keen on taking it out." Oh, yeah, I absolutely wasn't proposing that as a fix - I just did it as a triage step: the fact that the build "works" means that the problem is indeed where we thought it was.
It's crashing inside grub_acpi_find_table() which is ... concerning
Bug in the grub ACPI code. This should fix it. I'll do more tests and submit upstream: Subject: [PATCH] acpi: Fix out of bounds access in grub_acpi_xsdt_find_table() The calculation of the size of the table was incorrect (copy/pasta from grub_acpi_rsdt_find_table() I assume...). The entries are 64-bit long. This causes us to access beyond the end of the table which is causing crashes during boot on some systems. Typically this is causing a crash on VMWare when using UEFI and enabling serial autodetection, as grub_acpi_find_table (GRUB_ACPI_SPCR_SIGNATURE); Will goes past the end of the table (the SPCR table doesn't exits) Signed-off-by: Benjamin Herrenschmidt <benh.org> --- grub-core/kern/acpi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grub-core/kern/acpi.c b/grub-core/kern/acpi.c index 48ded4e2e..8ff0835d5 100644 --- a/grub-core/kern/acpi.c +++ b/grub-core/kern/acpi.c @@ -75,7 +75,7 @@ grub_acpi_xsdt_find_table (struct grub_acpi_table_header *xsdt, const char *sig) return 0; ptr = (grub_unaligned_uint64_t *) (xsdt + 1); - s = (xsdt->length - sizeof (*xsdt)) / sizeof (grub_uint32_t); + s = (xsdt->length - sizeof (*xsdt)) / sizeof (grub_uint64_t); for (; s; s--, ptr++) { struct grub_acpi_table_header *tbl;
That's awesome! Thanks so much for figuring that out so quickly, we really appreciate it.
I don't like bugs :-)
Thanks all for working on it ;) I really appreciate the quick fix and the time spent on it!
Scratch build with Ben's proposed upstream fix is running at https://koji.fedoraproject.org/koji/taskinfo?taskID=124892097 - Renata, if you could test that when it's done that'd be great.
(In reply to Adam Williamson from comment #34) > Scratch build with Ben's proposed upstream fix is running at > https://koji.fedoraproject.org/koji/taskinfo?taskID=124892097 - Renata, if > you could test that when it's done that'd be great. Thanks Adam, it worked as expected, booted fine.
Thanks. Nicolas, could you possibly do official builds for Rawhide and F41, and an update for F41? It'd be great to have this fixed for F41 release.
Sure, I'll do it today.
FEDORA-2024-7d58433dd5 (grub2-2.12-10.fc41) has been submitted as an update to Fedora 41. https://bodhi.fedoraproject.org/updates/FEDORA-2024-7d58433dd5
+3 FE in https://pagure.io/fedora-qa/blocker-review/issue/1682 , marking accepted FE.
FEDORA-2024-7d58433dd5 has been pushed to the Fedora 41 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-7d58433dd5` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-7d58433dd5 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2024-7d58433dd5 (grub2-2.12-10.fc41) has been pushed to the Fedora 41 stable repository. If problem still persists, please make note of it in this bug report.