Bug 2317048 - Grub2 not working in VMware
Summary: Grub2 not working in VMware
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: grub2
Version: 41
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Nicolas Frayer
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: RejectedBlocker AcceptedFreezeException
Depends On:
Blocks: F41FinalFreezeException
TreeView+ depends on / blocked
 
Reported: 2024-10-07 19:49 UTC by Renata Ravanelli
Modified: 2024-10-25 10:41 UTC (History)
11 users (show)

Fixed In Version: grub2-2.12-10.fc41
Clone Of:
Environment:
Last Closed: 2024-10-18 21:20:02 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Renata Ravanelli 2024-10-07 19:49:56 UTC
We encountered a boot issue with the latest GRUB2 versions available in Fedora 41 when running in VMware. Unfortunately, VMware provides limited logging, and the only error message available is:

"The firmware encountered an unexpected exception. The virtual machine cannot boot."
Tested Scenarios:

Summary:

The GRUB version 2.12.x causes boot failure in VMware environments.

For more details see: https://github.com/coreos/fedora-coreos-tracker/issues/1802


Reproducible: Always

Steps to Reproduce:
* Built FCOS41 with the latest packages and created a VMware OVA → Boot fails with the error message above.

* Built FCOS41 with the latest packages, upgraded GRUB to the versions below, and created a VMware OVA → Boot fails with the same error:
GRUB version: 2.12-7.fc42

* Built FCOS41 with the latest packages, downgraded GRUB and FUSE packages as follows, and created a VMware OVA → Boot succeeds:
GRUB version: 2.06-124.fc41
FUSE version: 2.9.9-22.fc41
FUSE-libs version: 2.9.9-22.fc41

* Built FCOS41 with the latest packages, downgraded GRUB and FUSE as follows, and created a VMware OVA → Boot fails with the same error:
GRUB version: 2.12-3.fc41


Actual Results:  
VMware does not boot

Expected Results:  
VMware boots

Comment 1 Fedora Blocker Bugs Application 2024-10-07 20:20:55 UTC
Proposed as a Blocker for 41-final by Fedora user ravanelli using the blocker tracking app because:

 This bug significantly impacts the ability to use Fedora 41 on VMware environments. Fedora CoreOS (FCOS) is widely deployed in virtualized environments, and VMware remains one of the most commonly used platforms for Fedora CoreOS users. The failure of Fedora 41 to boot with GRUB version 2.12.x in VMware renders the operating system unusable in these environments, which could affect a large number of users relying on VMware for production and development workloads.

Comment 2 Dusty Mabe 2024-10-07 21:31:05 UTC
I don't see any mention of VMWare on https://fedoraproject.org/wiki/Basic_Release_Criteria so probably not blocking bug candidate, but freeze exception, definitely.

Comment 3 Renata Ravanelli 2024-10-08 02:12:39 UTC
I added the options pager=0 and debug=all in the grub.cfg as suggested by Marta Lewandowska, still nothing shows up in VMware, it is failing even before it.

Comment 4 Marta Lewandowska 2024-10-08 13:08:53 UTC
Thank you for trying that, Renata! It's too bad we don't have any normal debug from grub. If I would instrument a grub build for you-- a scratch build which would be incompatible with Secure Boot-- would you be able to put that into an image and try to boot it?

The behavior I'm seeing on VMware from f-41 (grub2-2.12) is pretty weird. I tried installing from fedora-coreos-41.20240922.1.0-live.x86_64.iso and while that installs fine on BIOS, when I switch the firmware to UEFI, the VM cannot boot and shows a message that says: No compatible bootloader found. Trying to install on UEFI does not work because the same message shows up.

I also tried installing Fedora-Server-netinst-x86_64-41_Beta-1.2.iso, and that one boots fine on BIOS, I can switch the firmware to UEFI, and that also boots, but it has the ia32 shim and grub binaries installed... when I install the x64 (64 bit versions) and remove the 32 bits ones, the system again can not boot and attempts to boot the installer instead.

Comment 5 Renata Ravanelli 2024-10-08 14:54:10 UTC
(In reply to Marta Lewandowska from comment #4)
> Thank you for trying that, Renata! It's too bad we don't have any normal
> debug from grub. If I would instrument a grub build for you-- a scratch
> build which would be incompatible with Secure Boot-- would you be able to
> put that into an image and try to boot it?
Yes, if you build that via bodhi, just send me link, or you can send me the rpms in other way.

Comment 6 Dusty Mabe 2024-10-08 17:15:38 UTC
(In reply to Renata Ravanelli from comment #5)

> Yes, if you build that via bodhi, just send me link, or you can send me the
> rpms in other way.

Doesn't even need to be bodhi, just a koji link to a scratch build should be good enough.

Comment 8 Adam Williamson 2024-10-13 16:40:12 UTC
-4 in https://pagure.io/fedora-qa/blocker-review/issue/1682 , rejected as a blocker. We should definitely track this and try and fix if it's on our side, though, and document if it can't be fixed.

Comment 9 Marta Lewandowska 2024-10-15 06:47:19 UTC
Renata and I have isolated it to /boot/grub/console.cfg, particularly the line: serial --speed=115200. She has tried a few variants of this line, specifying --port or --unit, and they appear to work. So now it's up to you all: what you need in that file in order for the VM to properly output grub content to the serial console, if that's what you want.

Comment 10 Dusty Mabe 2024-10-15 12:41:32 UTC
Hey Marta,

Thank you for the update. From an end user point of view this looks like a regression considering the exact same config works on the previous version of GRUB. Were we just relying on buggy behavior and the bug was fixed in the newer version?

This will pose a real problem for our existing users who choose to upgrade their bootloader. We would need to come up with some way to update the console settings using the migration script.

Another way to say it is: was this change in behavior intended? If we hit this problem I imagine there will (eventually) be customers who do as well and we'll need an answer for them.

Comment 11 Dusty Mabe 2024-10-15 12:44:38 UTC
Example on someone updating the bootloader and hitting this problem: https://github.com/coreos/fedora-coreos-tracker/issues/1802#issuecomment-2377851057

Comment 12 Renata Ravanelli 2024-10-15 14:08:11 UTC
What's also odd is that all the other platforms still only use `serial --speed=115200` without any issues, but VMware, when using UEFI, is the only one reporting problems.

Comment 13 Marta Lewandowska 2024-10-15 16:04:17 UTC
Hi Dusty,

I am not particularly knowledgeable about this, but all the examples in the upstream grub documentation: https://www.gnu.org/software/grub/manual/grub/grub.html#serial as well as our own beaker provisioning of (only) x86 UEFI machines-- check out the provisioning kickstart file and search for 'serial'-- have specified --port and/or --unit long before this version of grub was released, so maybe you were relying on buggy behavior?

I will look to see if the rebase touched the serial command.

Comment 14 Adam Williamson 2024-10-15 17:02:55 UTC
[adamw@xps13a grub (master)]$ git log --oneline grub-2.06..grub-2.12 grub-core/term/serial.c
f7a663c00 term/serial: Ensure proper NULL termination after grub_strncpy()
712309eaa term/serial: Use grub_strncpy() instead of grub_snprintf() when only copying string
8eb3d4df3 term/serial: Add support for PCI serial devices
35782e165 term/serial: Improve detection of duplicate serial ports
e37dbba66 term/serial: Avoid double lookup of serial ports
b73a44b28 term/serial: Replace usage of memcmp() with strncmp()
c4e801631 term/serial: Add ability to specify MMIO ports via "serial" command
7b192ec4c term/ns8250: Use ACPI SPCR table when available to configure serial
c2ef140a6 term/ns8250: Add configuration parameter when adding ports
b232f6f66 term: Remove trailing whitespaces

...so I'd say yeah, rebase from 2.06 to 2.12 definitely touched it. "Use ACPI SPCR table when available to configure serial", "Avoid double lookup of serial ports" and "Improve detection of duplicate serial ports" are the places I might start looking...

Comment 15 Adam Williamson 2024-10-15 17:21:03 UTC
Poking at the upstream grub code a bit, it does seem to explicitly handle the case where you don't specify a device:

static grub_err_t
grub_cmd_serial (grub_extcmd_context_t ctxt, int argc, char **args)
{
  struct grub_arg_list *state = ctxt->state;
  char pname[40];
  const char *name = NULL;
...
  if (!name)
    name = "auto";
...
  port = grub_serial_find (name);
...
}

now if we look at grub_serial_find:

struct grub_serial_port *
grub_serial_find (const char *name)
{
...
#if (defined(__i386__) || defined(__x86_64__)) && !defined(GRUB_MACHINE_IEEE1275) && !defined(GRUB_MACHINE_QEMU)
  if (grub_strcmp (name, "auto") == 0)
    {
      /* Look for an SPCR if any. If not, default to com0. */
      port = grub_ns8250_spcr_init ();
      if (port != NULL)
        return port;
      FOR_SERIAL_PORTS (port)
        if (grub_strcmp (port->name, "com0") == 0)
          return port;
    }
#endif
...
  return NULL;
}

so autodetection is supposed to try and find a port on x86 when not using qemu or OFW (I think that's what IEEE1275 is about anyway), and fall through to just returning NULL otherwise (which still seems like it shouldn't prevent boot). That code was touched extensively by c4e801631, 7b192ec4c and 35782e165f, so there's definitely change between 2.06 and 2.12 there. It seems like 7b192ec4cd7c4b3207db010202349dd283e72041 *introduced* this auto-detect code and before that not setting an explicit port would have just defaulted to doing `grub_serial_find ("com0");`, which would have returned any port named "com0", or NULL.

So...theory: this `grub_ns8250_spcr_init` autodetect code blows up on VMware for some reason?

Comment 16 Adam Williamson 2024-10-15 17:24:01 UTC
CCing the author of 7b192ec4cd7c4b3207db010202349dd283e72041 , who seems to be on RHBZ - hi Ben! Hope you don't mind the ping. Does my analysis above seem plausible? Thanks!

Comment 17 Adam Williamson 2024-10-15 17:31:55 UTC
Renata, I'm running a scratch build of grub2 with the ACPI SPCR serial port detection disabled by making grub_cmd_serial do `grub_serial_find ("com0");` like it used to, instead of `grub_serial_find ("auto");` - that should bypass the grub_ns8250_spcr_init call. It's running at https://koji.fedoraproject.org/koji/taskinfo?taskID=124858251 . Can you try it out when it's done and see if it avoids the bug?

Comment 18 Adam Williamson 2024-10-15 17:37:45 UTC
huh. I also see another problem with that change: it apparently unintentionally breaks the old "default to com0" behaviour on anything that fails the `#if (defined(__i386__) || defined(__x86_64__)) && !defined(GRUB_MACHINE_IEEE1275) && !defined(GRUB_MACHINE_QEMU)` check. it should probably be something like this:

  if (grub_strcmp (name, "auto") == 0)
    {
#if (defined(__i386__) || defined(__x86_64__)) && !defined(GRUB_MACHINE_IEEE1275) && !defined(GRUB_MACHINE_QEMU)
      /* Look for an SPCR if any. */
      port = grub_ns8250_spcr_init ();
      if (port != NULL)
        return port;
#endif
      /* If not, default to com0. */
      FOR_SERIAL_PORTS (port)
        if (grub_strcmp (port->name, "com0") == 0)
          return port;
    }

I think?

Comment 19 Adam Williamson 2024-10-15 17:46:59 UTC
ugh, actually, that's still inside a wider `#if (defined(__mips__) || defined (__i386__) || defined (__x86_64__)) && !defined(GRUB_MACHINE_EMU) && !defined(GRUB_MACHINE_ARC)` block so it's still not really right :/ we'd probably have to move the `if (grub_strcmp (name, "auto") == 0)` block up above the `#if (defined(__mips__) || defined (__i386__) || defined (__x86_64__)) && !defined(GRUB_MACHINE_EMU) && !defined(GRUB_MACHINE_ARC)` block to fix that. anyhow, it's a bit of a sidebar.

Comment 20 Benjamin Herrenschmidt 2024-10-15 22:07:30 UTC
Hrm... this code has been upstream for a while (and in Amazon Linux for even longer) and I've tested in VMware... give me a bit of time to dig into this and try to reproduce.

One thing I find confusing is that it seems this is being reproduced with UEFI ? I didn't think people would use the serial driver with UEFI instead of UEFI console, but I will look into it, I might have missed something.

Comment 21 Benjamin Herrenschmidt 2024-10-15 22:34:25 UTC
Hrm... Broadcom website is not letting me download anything until I file an account involving a "site ID" but the page to retreive this is broken and you can't open a support ticket unless an FAE helps you create a "Project" ! yay ! I'll continue trying to download vmware in the next few days but it doesn't look good. Sadly nothing unexpected from BCM ...

Comment 22 Adam Williamson 2024-10-15 22:53:58 UTC
ugh, well that's repellent. I hadn't reproduced myself (yet), just poked at the code to try and see what might be up. https://www.techspot.com/downloads/189-vmware-workstation-for-windows.html is...probably an okay way to try and get it without signing up for anything? I disclaim all responsibility for consequences. :D

Comment 23 Benjamin Herrenschmidt 2024-10-15 23:00:46 UTC
Ah ! I was going to try with the Linux version first :-) I'll give that a spin on a sandboxed machine thanks

Comment 24 Renata Ravanelli 2024-10-15 23:27:29 UTC
(In reply to Adam Williamson from comment #17)
> Renata, I'm running a scratch build of grub2 with the ACPI SPCR serial port
> detection disabled by making grub_cmd_serial do `grub_serial_find ("com0");`
> like it used to, instead of `grub_serial_find ("auto");` - that should
> bypass the grub_ns8250_spcr_init call. It's running at
> https://koji.fedoraproject.org/koji/taskinfo?taskID=124858251 . Can you try
> it out when it's done and see if it avoids the bug?

Yes, it does avoid the bug, I can boot and access it as expected.

Comment 25 Adam Williamson 2024-10-15 23:35:05 UTC
benjamin: the URL has windows in it, but the page looks like it has Windows and Linux downloads.

Comment 26 Adam Williamson 2024-10-15 23:37:44 UTC
...and for your reference, the change I had renata test was just this:

[adamw@xps13a grub (master)]$ git diff
diff --git a/grub-core/term/serial.c b/grub-core/term/serial.c
index 8260dcb7a..72a6927b4 100644
--- a/grub-core/term/serial.c
+++ b/grub-core/term/serial.c
@@ -271,7 +271,7 @@ grub_cmd_serial (grub_extcmd_context_t ctxt, int argc, char **args)
     name = args[0];
 
   if (!name)
-    name = "auto";
+    name = "com0";
 
   port = grub_serial_find (name);
   if (!port)

I didn't include the patch I sent to grub-devel which reorders stuff, it was simply that, which I expect causes us to find and return a com0 in the `FOR_SERIAL_PORTS` loop at the start of `grub_serial_find`, and thus bypass all the later code. This is, AFAICT, probably how things worked on 2.06.

Comment 27 Benjamin Herrenschmidt 2024-10-16 02:04:08 UTC
The auto-detection is necessary on ec2 and a number of other platforms. So I'm not too keen on taking it out. We need to understand why it's breaking on VMWare. I've managed to install Workstation 17.6.1 on my machine, I'll do some testing... one thing I noticed is that if I just create a VM and install f40 on it, it doesn't come with a serial port at all (even the kernel doesn't see a ttyS0). This is on a BIOS boot. I've installed f40, updated grub to f41, so far so good, added serial --speed 115200 to the grub.cfg, and I see an error message about "auto" not being found, but it boots. I haven't tested with UEFI yet.

Adding a serial port to the VM makes at least that bit work and not crash.

I can also reproduce the crash in UEFI. I'll setup something I can actually compile things with now :-) (Fedora Core OS blows my brains...) and can boot both UEFI and BIOS, and I'll debug it.

Comment 28 Adam Williamson 2024-10-16 04:22:29 UTC
"The auto-detection is necessary on ec2 and a number of other platforms. So I'm not too keen on taking it out."

Oh, yeah, I absolutely wasn't proposing that as a fix - I just did it as a triage step: the fact that the build "works" means that the problem is indeed where we thought it was.

Comment 29 Benjamin Herrenschmidt 2024-10-16 04:27:53 UTC
It's crashing inside grub_acpi_find_table() which is ... concerning

Comment 30 Benjamin Herrenschmidt 2024-10-16 05:00:21 UTC
Bug in the grub ACPI code. This should fix it. I'll do more tests and submit upstream:

Subject: [PATCH] acpi: Fix out of bounds access in grub_acpi_xsdt_find_table()

The calculation of the size of the table was incorrect (copy/pasta from
grub_acpi_rsdt_find_table() I assume...). The entries are 64-bit long.

This causes us to access beyond the end of the table which is causing
crashes during boot on some systems. Typically this is causing a crash
on VMWare when using UEFI and enabling serial autodetection, as

grub_acpi_find_table (GRUB_ACPI_SPCR_SIGNATURE);

Will goes past the end of the table (the SPCR table doesn't exits)

Signed-off-by: Benjamin Herrenschmidt <benh.org>
---
 grub-core/kern/acpi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/grub-core/kern/acpi.c b/grub-core/kern/acpi.c
index 48ded4e2e..8ff0835d5 100644
--- a/grub-core/kern/acpi.c
+++ b/grub-core/kern/acpi.c
@@ -75,7 +75,7 @@ grub_acpi_xsdt_find_table (struct grub_acpi_table_header *xsdt, const char *sig)
     return 0;
 
   ptr = (grub_unaligned_uint64_t *) (xsdt + 1);
-  s = (xsdt->length - sizeof (*xsdt)) / sizeof (grub_uint32_t);
+  s = (xsdt->length - sizeof (*xsdt)) / sizeof (grub_uint64_t);
   for (; s; s--, ptr++)
     {
       struct grub_acpi_table_header *tbl;

Comment 31 Adam Williamson 2024-10-16 05:50:41 UTC
That's awesome! Thanks so much for figuring that out so quickly, we really appreciate it.

Comment 32 Benjamin Herrenschmidt 2024-10-16 06:44:52 UTC
I don't like bugs :-)

Comment 33 Renata Ravanelli 2024-10-16 16:08:02 UTC
Thanks all for working on it ;) I really appreciate the quick fix and the time spent on it!

Comment 34 Adam Williamson 2024-10-16 16:42:00 UTC
Scratch build with Ben's proposed upstream fix is running at https://koji.fedoraproject.org/koji/taskinfo?taskID=124892097 - Renata, if you could test that when it's done that'd be great.

Comment 35 Renata Ravanelli 2024-10-16 18:42:33 UTC
(In reply to Adam Williamson from comment #34)
> Scratch build with Ben's proposed upstream fix is running at
> https://koji.fedoraproject.org/koji/taskinfo?taskID=124892097 - Renata, if
> you could test that when it's done that'd be great.

Thanks Adam, it worked as expected, booted fine.

Comment 36 Adam Williamson 2024-10-16 18:53:34 UTC
Thanks. Nicolas, could you possibly do official builds for Rawhide and F41, and an update for F41? It'd be great to have this fixed for F41 release.

Comment 37 Nicolas Frayer 2024-10-17 05:52:33 UTC
Sure, I'll do it today.

Comment 38 Fedora Update System 2024-10-17 09:05:17 UTC
FEDORA-2024-7d58433dd5 (grub2-2.12-10.fc41) has been submitted as an update to Fedora 41.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-7d58433dd5

Comment 39 Adam Williamson 2024-10-17 16:17:43 UTC
+3 FE in https://pagure.io/fedora-qa/blocker-review/issue/1682 , marking accepted FE.

Comment 40 Fedora Update System 2024-10-18 01:47:16 UTC
FEDORA-2024-7d58433dd5 has been pushed to the Fedora 41 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-7d58433dd5`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-7d58433dd5

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 41 Fedora Update System 2024-10-18 21:20:02 UTC
FEDORA-2024-7d58433dd5 (grub2-2.12-10.fc41) has been pushed to the Fedora 41 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.