Bug 1999577 - RHCOS live ISO can fail to boot in UEFI mode; drops to grub shell
Summary: RHCOS live ISO can fail to boot in UEFI mode; drops to grub shell
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.8
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: 4.9.0
Assignee: Benjamin Gilbert
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1981999
Blocks: 2000696
TreeView+ depends on / blocked
 
Reported: 2021-08-31 11:24 UTC by Bob Liu
Modified: 2023-09-15 01:14 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2000696 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:50:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Errors after booting from the USB stick (68.55 KB, image/jpeg)
2021-08-31 11:24 UTC, Bob Liu
no flags Details
Boot from RHEL 7.9 ISO (122.56 KB, image/jpeg)
2021-09-01 07:18 UTC, Bob Liu
no flags Details
Boot from RHEL 8.3 ISO (129.39 KB, image/jpeg)
2021-09-01 07:20 UTC, Bob Liu
no flags Details
Boot from latest Fedora CoreOS ISO (50.66 KB, image/jpeg)
2021-09-01 07:21 UTC, Bob Liu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github coreos coreos-assembler pull 2404 0 None None None 2021-09-04 14:08:12 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:50:25 UTC

Description Bob Liu 2021-08-31 11:24:12 UTC
Created attachment 1819329 [details]
Errors after booting from the USB stick

Description of problem:
Tried to use rhcos-live.x86_64.iso for installing bootstrap/master/worker nodes, but it failed to boot up on physical servers(Lenovo model x3650 M4 or x3650 M5) with errors. The same image can be worked on VMs.

Version-Release number of selected component (if applicable):
4.7.13 and 4.8.2

How reproducible:

Steps to Reproduce:
1. Download rhcos-live.x86_64.iso from https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.8/latest/
2. Use command "dd bs=4M if=rhcos-live.x86_64.iso.0 of=/dev/sdb && sync" to burn it to a USB stick
3. Try to boot OS from the USB stick on physical servers(Lenovo model x3650 M4 or x3650 M5) with UEFI or Legacy mode

Actual results:
Hit the error on Grub(see the attachment) or server restart without error, I have tried two different release images even two different USB stick but same symptom.

Expected results:
Can load CoreOS on the bootable USB stick

Additional info:
I tried to dd another image rhcos-metal.x86_64.raw.gz to the same USB stick, it can be bootable on the same server, but it required username/password on login prompt, so can't be used for installing bootstrap/master/worker nodes.

Comment 1 Zane Bitter 2021-08-31 13:52:58 UTC
It's difficult to imagine what this has to do with the cluster-baremetal-operator. Reassigning to RHCOS.

Comment 2 Micah Abbott 2021-08-31 14:19:52 UTC
The error in the screenshot says that grub hit an out of memory condition.  Looking at the code, it appears grub is trying to allocate an amount of memory:

https://git.savannah.gnu.org/cgit/grub.git/tree/grub-core/kern/mm.c#n319

How much memory is installed in these servers?  

Is it possible to try booting a RHEL8 ISO on these servers?  Or perhaps the latest Fedora CoreOS live ISO?

https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/34.20210808.3.0/x86_64/fedora-coreos-34.20210808.3.0-live.x86_64.iso

This might be an issue we need the grub team to look at, but let's gather some more info first.

Comment 3 Bob Liu 2021-09-01 07:17:22 UTC
@miabbott There are total 256GB memory on each server.

I tried to boot RHEL 8.3 ISO(dd to the same USB stick) and no problem, but the latest Fedora CoreOS live ISO(fedora-coreos-34.20210808.3.0-live.x86_64.iso) failed(same symptom as before).

By the way, I can boot RHEL 7.9 ISO on the same server and installing OS without any problem too. Will attach the screenshots in next comment. Thanks.

Comment 4 Bob Liu 2021-09-01 07:18:49 UTC
Created attachment 1819591 [details]
Boot from RHEL 7.9 ISO

Comment 5 Bob Liu 2021-09-01 07:20:51 UTC
Created attachment 1819592 [details]
Boot from RHEL 8.3 ISO

Comment 6 Bob Liu 2021-09-01 07:21:42 UTC
Created attachment 1819593 [details]
Boot from latest Fedora CoreOS ISO

Comment 8 Micah Abbott 2021-09-01 19:39:53 UTC
I used the same `dd` command to write out the latest RHCOS 4.8 live ISO (48.84.202107202156-0) and the latest FCOS stable release (34.20210808.3.0) to a USB stick (on separate tries, of course).

On an older desktop system with 16 GB of RAM, I was able to boot into the the live environment successfully using legacy BIOS mode using both versions of the live ISO.

However, when I tried booting the same ISOs in UEFI mode, I was dropped to a grub shell.

So there may be something amiss here with the live ISO + UEFI, but it doesn't appear the "out of memory" error is easily reproducible.

We'll continue to debug this.

Comment 9 Dusty Mabe 2021-09-01 20:22:57 UTC
Just tried to boot rhcos-4.8.2-x86_64-live.x86_64.iso on an intel NUC. In EFI mode I was dropped to grub shell.

Comment 10 Dusty Mabe 2021-09-01 20:57:16 UTC
Opened upstream issue against FCOS: https://github.com/coreos/fedora-coreos-tracker/issues/953

Comment 11 Bob Liu 2021-09-01 23:35:34 UTC
In my testing, UEFI mode always trigger system reset, Legacy mode will drop to a grub shell at most time, I just saw the "out of memory" detailed error one time after hitting the "Continue boot" option. By the way, I have upgraded the IMM2/UEFI firmware of the x3650 M5 servers to the latest version before/after testing, but same symptom. Fyi, thanks.

Comment 12 Benjamin Gilbert 2021-09-02 07:27:59 UTC
https://github.com/coreos/coreos-assembler/pull/2404 should solve the GRUB shell issue, though it's not clear if it will solve the original out-of-memory errors.  I'll post here once we have a fixed ISO image for testing.

Comment 13 Micah Abbott 2021-09-02 13:23:41 UTC
You can try a fixed version of the FCOS live ISO here - https://miabbott.fedorapeople.org/fedora-coreos-34.20210902.dev.0-live.x86_64.iso

Comment 15 Micah Abbott 2021-09-02 14:38:04 UTC
Retitling this BZ to cover the case of the live ISO dropping to grub shell

Marking as a blocker for 4.9

Comment 16 Micah Abbott 2021-09-02 15:17:43 UTC
@lbseraph if the "out of memory" problem is repeatable, please open a new BZ against RHEL 8/grub

Comment 17 RHCOS Bug Bot 2021-09-02 16:48:06 UTC
This bug has been reported fixed in a new RHCOS build.  Do not move this bug to MODIFIED until the fix has landed in a new bootimage.

Comment 18 Bob Liu 2021-09-03 05:19:36 UTC
@miabbott I tried the fedora-coreos-34.20210902.dev.0-live.x86_64.iso, only x3650 M4 can boot it in Legacy mode, x3650 M5 can boot it via "Boot From File" method only. UEFI mode is still not working on both servers. At least, a little better than before(rhcos-live.x86_64.iso). May I know which RHCOS build will add the fix? Thanks.

Comment 19 Benjamin Gilbert 2021-09-03 05:22:47 UTC
RHCOS 49.84.202109022216-0 should have the fix.  Are you still seeing the same out-of-memory error?

Comment 20 Bob Liu 2021-09-03 06:03:56 UTC
No, UEFI mode boot will trigger both servers reset directly, x3650 M5 boot in Legacy mode will drop to grub shell without any errors. The out-of-memory error might be corner case(e.g. hardware issue) here, I didn't see it again in last few attempts. Thanks.

Comment 21 Benjamin Gilbert 2021-09-03 06:09:17 UTC
Okay, thanks.  To be clear: in UEFI mode, do the servers reset repeatedly?  The expected pattern is: shim installs a boot entry, then the machine reboots and boots from that boot entry.

For the grub shell in legacy mode, could you type "set" and post a screenshot?

Comment 22 Bob Liu 2021-09-03 08:27:59 UTC
@bgilbert Thanks for your feedback. You are right, after I testing again in UEFI mode, the fix in fedora-coreos-34.20210902.dev.0-live.x86_64.iso is working now(I didn't wait the boot from that boot entry which shim installed before). But in legacy mode, I can't see it in grub shell in last few attempts due to the servers reset before stopping in grub shell.

By the way, will the fix cherry pick for the OCP 4.8? The build RHCOS 49.84.202109022216-0 you mentioned should be for OCP 4.9? Thanks.

Comment 23 Benjamin Gilbert 2021-09-03 08:40:51 UTC
> But in legacy mode, I can't see it in grub shell in last few attempts due to the servers reset before stopping in grub shell.

I'm not sure I understand.  Are you saying the machine just reboots when you boot in legacy mode?

> By the way, will the fix cherry pick for the OCP 4.8?

Yes, see bug 2000696.

Comment 24 Bob Liu 2021-09-03 10:38:48 UTC
> I'm not sure I understand.  Are you saying the machine just reboots when you boot in legacy mode?

Yes in my last few attempts. But in the original issue, it will drop to grub shell, no idea what happened. Anyway, UEFI mode working fine in new build can allow me to install RHCOS in physical servers. Thanks for the help.

Comment 27 Benjamin Gilbert 2021-09-03 18:26:47 UTC
@lbosh.com, great, thanks.  Please feel free to file a new BZ for the legacy boot issue.

Comment 28 RHCOS Bug Bot 2021-09-22 18:37:26 UTC
The fix for this bug has landed in a bootimage bump, as tracked in bug 1981999 (now in status MODIFIED).  Moving this bug to MODIFIED.

Comment 30 Michael Nguyen 2021-09-27 22:20:28 UTC
Verified on RHCOS 49.84.202109241334-0.

Downloaded the RHCOS live ISO and wrote to USB drive
dd bs=bs=4M if=rhcos-49.84.202109241334-0-live.x86_64.iso of=/dev/sdb oflag=sync status=progress

Booted USB stick in UEFI mode and verified boot was successful
Also booted in BIOS mode and verified boot was successful

Comment 33 errata-xmlrpc 2021-10-18 17:50:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Comment 34 Red Hat Bugzilla 2023-09-15 01:14:28 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.