Bug 1999577
Summary: | RHCOS live ISO can fail to boot in UEFI mode; drops to grub shell | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Bob Liu <lbosh> | ||||||||||
Component: | RHCOS | Assignee: | Benjamin Gilbert <bgilbert> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 4.8 | CC: | aos-bugs, bfournie, bgilbert, calfonso, dornelas, dustymabe, jligon, miabbott, mrussell, nstielau, smilner, tsedovic | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | 4.9.0 | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 2000696 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2021-10-18 17:50:21 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 1981999 | ||||||||||||
Bug Blocks: | 2000696 | ||||||||||||
Attachments: |
|
Description
Bob Liu
2021-08-31 11:24:12 UTC
It's difficult to imagine what this has to do with the cluster-baremetal-operator. Reassigning to RHCOS. The error in the screenshot says that grub hit an out of memory condition. Looking at the code, it appears grub is trying to allocate an amount of memory: https://git.savannah.gnu.org/cgit/grub.git/tree/grub-core/kern/mm.c#n319 How much memory is installed in these servers? Is it possible to try booting a RHEL8 ISO on these servers? Or perhaps the latest Fedora CoreOS live ISO? https://builds.coreos.fedoraproject.org/prod/streams/stable/builds/34.20210808.3.0/x86_64/fedora-coreos-34.20210808.3.0-live.x86_64.iso This might be an issue we need the grub team to look at, but let's gather some more info first. @miabbott There are total 256GB memory on each server. I tried to boot RHEL 8.3 ISO(dd to the same USB stick) and no problem, but the latest Fedora CoreOS live ISO(fedora-coreos-34.20210808.3.0-live.x86_64.iso) failed(same symptom as before). By the way, I can boot RHEL 7.9 ISO on the same server and installing OS without any problem too. Will attach the screenshots in next comment. Thanks. Created attachment 1819591 [details]
Boot from RHEL 7.9 ISO
Created attachment 1819592 [details]
Boot from RHEL 8.3 ISO
Created attachment 1819593 [details]
Boot from latest Fedora CoreOS ISO
I used the same `dd` command to write out the latest RHCOS 4.8 live ISO (48.84.202107202156-0) and the latest FCOS stable release (34.20210808.3.0) to a USB stick (on separate tries, of course). On an older desktop system with 16 GB of RAM, I was able to boot into the the live environment successfully using legacy BIOS mode using both versions of the live ISO. However, when I tried booting the same ISOs in UEFI mode, I was dropped to a grub shell. So there may be something amiss here with the live ISO + UEFI, but it doesn't appear the "out of memory" error is easily reproducible. We'll continue to debug this. Just tried to boot rhcos-4.8.2-x86_64-live.x86_64.iso on an intel NUC. In EFI mode I was dropped to grub shell. Opened upstream issue against FCOS: https://github.com/coreos/fedora-coreos-tracker/issues/953 In my testing, UEFI mode always trigger system reset, Legacy mode will drop to a grub shell at most time, I just saw the "out of memory" detailed error one time after hitting the "Continue boot" option. By the way, I have upgraded the IMM2/UEFI firmware of the x3650 M5 servers to the latest version before/after testing, but same symptom. Fyi, thanks. https://github.com/coreos/coreos-assembler/pull/2404 should solve the GRUB shell issue, though it's not clear if it will solve the original out-of-memory errors. I'll post here once we have a fixed ISO image for testing. You can try a fixed version of the FCOS live ISO here - https://miabbott.fedorapeople.org/fedora-coreos-34.20210902.dev.0-live.x86_64.iso Retitling this BZ to cover the case of the live ISO dropping to grub shell Marking as a blocker for 4.9 @lbseraph if the "out of memory" problem is repeatable, please open a new BZ against RHEL 8/grub This bug has been reported fixed in a new RHCOS build. Do not move this bug to MODIFIED until the fix has landed in a new bootimage. @miabbott I tried the fedora-coreos-34.20210902.dev.0-live.x86_64.iso, only x3650 M4 can boot it in Legacy mode, x3650 M5 can boot it via "Boot From File" method only. UEFI mode is still not working on both servers. At least, a little better than before(rhcos-live.x86_64.iso). May I know which RHCOS build will add the fix? Thanks. RHCOS 49.84.202109022216-0 should have the fix. Are you still seeing the same out-of-memory error? No, UEFI mode boot will trigger both servers reset directly, x3650 M5 boot in Legacy mode will drop to grub shell without any errors. The out-of-memory error might be corner case(e.g. hardware issue) here, I didn't see it again in last few attempts. Thanks. Okay, thanks. To be clear: in UEFI mode, do the servers reset repeatedly? The expected pattern is: shim installs a boot entry, then the machine reboots and boots from that boot entry. For the grub shell in legacy mode, could you type "set" and post a screenshot? @bgilbert Thanks for your feedback. You are right, after I testing again in UEFI mode, the fix in fedora-coreos-34.20210902.dev.0-live.x86_64.iso is working now(I didn't wait the boot from that boot entry which shim installed before). But in legacy mode, I can't see it in grub shell in last few attempts due to the servers reset before stopping in grub shell. By the way, will the fix cherry pick for the OCP 4.8? The build RHCOS 49.84.202109022216-0 you mentioned should be for OCP 4.9? Thanks. > But in legacy mode, I can't see it in grub shell in last few attempts due to the servers reset before stopping in grub shell. I'm not sure I understand. Are you saying the machine just reboots when you boot in legacy mode? > By the way, will the fix cherry pick for the OCP 4.8? Yes, see bug 2000696. > I'm not sure I understand. Are you saying the machine just reboots when you boot in legacy mode?
Yes in my last few attempts. But in the original issue, it will drop to grub shell, no idea what happened. Anyway, UEFI mode working fine in new build can allow me to install RHCOS in physical servers. Thanks for the help.
@lbosh.com, great, thanks. Please feel free to file a new BZ for the legacy boot issue. The fix for this bug has landed in a bootimage bump, as tracked in bug 1981999 (now in status MODIFIED). Moving this bug to MODIFIED. Verified on RHCOS 49.84.202109241334-0. Downloaded the RHCOS live ISO and wrote to USB drive dd bs=bs=4M if=rhcos-49.84.202109241334-0-live.x86_64.iso of=/dev/sdb oflag=sync status=progress Booted USB stick in UEFI mode and verified boot was successful Also booted in BIOS mode and verified boot was successful Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |