Bug 2011306 - SNO zero touch provisioning redeployment via virtual media sometimes fails due to modified boot table entry
Summary: SNO zero touch provisioning redeployment via virtual media sometimes fails du...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.9.0
Assignee: RHCOS Bug Triage
QA Contact: Michael Nguyen
URL:
Whiteboard:
: 1978314 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-06 11:31 UTC by Marius Cornea
Modified: 2022-03-15 18:00 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-15 18:00:22 UTC
Target Upstream Version:
Embargoed:
bfournie: needinfo-


Attachments (Terms of Use)
virtualmediaboot-2021-10-06_14.16.34.mkv (4.26 MB, application/x-matroska)
2021-10-06 11:31 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack Storyboard 2008763 0 None None None 2021-10-12 15:00:40 UTC

Description Marius Cornea 2021-10-06 11:31:56 UTC
Created attachment 1829804 [details]
virtualmediaboot-2021-10-06_14.16.34.mkv

Description of problem:

SNO deployment on HPE e910 blade fails because the node does not boot from virtualmedia and falls back to booting from previous installation on the internal drive.

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-10-05-004711

How reproducible:
100%

Steps to Reproduce:
1. Deploy SNO node by following the ZTP procedure

Actual results:

When the node defined in the BMH object first boots it does not boot from the ISO image attached via Virtual Media.

Expected results:

When the node defined in the BMH object first boots it boots from the ISO image attached via Virtual Media.

Additional info:

Attaching must-gather and video recording of the boot process.

Comment 4 Derek Higgins 2021-10-07 11:05:44 UTC
Looks like this entry in your boot table was be causing the instruction to boot from cd to do the wrong thing
    "PciRoot(0x0)/Pci(0x1C,0x4)/Pci(0x0,0x4)/USB(0x1,0x0)/CDROM(0x1)/\\EFI\\redhat\\shimx64.efi",

after removing it the host booted correctly, from iso

I've hit this before, https://storyboard.openstack.org/#!/story/2008763
I'm not sure what to do about it, ironic can't clean out the entry if it can't boot IPA to clean it...

I believe RHCOS created this entry when it boots from CD

Comment 6 Marius Cornea 2021-10-08 12:30:04 UTC
The installation process moves forward after manually removing the boot entry so lowering the severity.

Comment 7 Bob Fournier 2021-10-12 16:07:22 UTC
This can happen when booting a different image other than RHCOS, simply booting the RHCOS image consistently will not result in this problem. This should be a doc change to indicate that if there are any entries in boot table (see comment #4) they should be removed manually.

Comment 8 Marius Cornea 2021-10-15 08:25:14 UTC
Happened on a Dell machine as well used only for OCP deployments:

workaround:
efibootmgr -v | grep shimx64.efi | grep CDROM
Boot000A* Red Hat Enterprise Linux	PciRoot(0x0)/Pci(0x14,0x0)/USB(13,0)/USB(0,0)/USB(2,0)/Unit(0)/CDROM(1,0x221,0xd2a)/File(\EFI\redhat\shimx64.efi)
efibootmgr -B -b 000A

Comment 11 Ian Miller 2021-10-20 02:19:09 UTC
I believe this may be the same issue we are also tracking/investigating under BZ 1978314

Comment 13 Marius Cornea 2021-10-20 08:53:25 UTC
Adding another data point, the `CDROM(1,0x221,0xd2a)/File(\EFI\redhat\shimx64.efi)` boot entry gets created when deploying OCP 4.8, it is not created when deploying OCP 4.9 so this issue manifests when first deploying OCP 4.8 and then deploying OCP 4.9 on the same machine via ZTP process.

Comment 17 Ian Miller 2021-10-28 17:38:51 UTC
*** Bug 1978314 has been marked as a duplicate of this bug. ***

Comment 28 Ian Miller 2021-12-09 22:40:46 UTC
Testing with the latest 4.9 install, which includes backport of a fix for BZ 2004449 (https://github.com/coreos/coreos-assembler/pull/2436), did not experience this issue. Specifics of the test:

Starting with a clean UEFI boot order (deleted all entries with efibootmgr -B -b 000x)
# efibootmgr -v    
BootCurrent: 0002                                              
No BootOrder is set; firmware will attempt recovery      
MirroredPercentageAbove4G: 0.00                     
MirrorMemoryBelow4GB: false                                    
#                        

OCP 4.9.10 installed (first install)
# efibootmgr -v
BootCurrent: 0005
BootOrder: 0005,0000,0001,0003,0004
Boot0000* Virtual Floppy        PciRoot(0x0)/Pci(0x14,0x0)/USB(13,0)/USB(0,0)/USB(2,0)/Unit(1)
Boot0001* Virtual CD    PciRoot(0x0)/Pci(0x14,0x0)/USB(13,0)/USB(0,0)/USB(2,0)/Unit(0)
Boot0003* Integrated NIC 1 Port 1 Partition 1   VenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)
Boot0004* Integrated NIC 1 Port 3 Partition 1   VenHw(d227c733-f75f-4341-b749-4d1759ec8538)
Boot0005* Red Hat Enterprise Linux      HD(2,GPT,1e8869d4-1225-4915-866c-9e18550a9a72,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
MirroredPercentageAbove4G: 0.00
MirrorMemoryBelow4GB: false

# uname -a
Linux cnfocto1.ptp.lab.eng.bos.redhat.com 4.18.0-305.28.1.el8_4.x86_64 #1 SMP Mon Nov 8 07:45:47 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
# egrep Core /etc/motd 
Red Hat Enterprise Linux CoreOS 49.84.202111292103-0

Second install succeeded. System booted directly into the boot ISO.
# efibootmgr -v
BootCurrent: 0002
BootOrder: 0002,0000,0001,0003,0004
Boot0000* Virtual Floppy        PciRoot(0x0)/Pci(0x14,0x0)/USB(13,0)/USB(0,0)/USB(2,0)/Unit(1)
Boot0001* Virtual CD    PciRoot(0x0)/Pci(0x14,0x0)/USB(13,0)/USB(0,0)/USB(2,0)/Unit(0)
Boot0002* Red Hat Enterprise Linux      HD(2,GPT,1e8869d4-1225-4915-866c-9e18550a9a72,0x1000,0x3f800)/File(\EFI\redhat\shimx64.efi)
Boot0003* Integrated NIC 1 Port 1 Partition 1   VenHw(3a191845-5f86-4e78-8fce-c4cff59f9daa)
Boot0004* Integrated NIC 1 Port 3 Partition 1   VenHw(d227c733-f75f-4341-b749-4d1759ec8538)
MirroredPercentageAbove4G: 0.00
MirrorMemoryBelow4GB: false
# uname -a
Linux cnfocto1.ptp.lab.eng.bos.redhat.com 4.18.0-305.28.1.el8_4.x86_64 #1 SMP Mon Nov 8 07:45:47 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
# egrep Core /etc/motd 
Red Hat Enterprise Linux CoreOS 49.84.202111292103-0

Comment 29 Bob Fournier 2021-12-10 00:17:06 UTC
Ian - thanks, that's great news.

Marius - similar to what Ian did, could you clear out the entries that are causing the problem and try it with the recent version? If that works we could probably close this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=2004449.

Comment 30 Marius Cornea 2021-12-14 13:02:33 UTC
(In reply to Bob Fournier from comment #29)
> Ian - thanks, that's great news.
> 
> Marius - similar to what Ian did, could you clear out the entries that are
> causing the problem and try it with the recent version? If that works we
> could probably close this as a duplicate of
> https://bugzilla.redhat.com/show_bug.cgi?id=2004449.

The issue didn't reproduce with rhcos-49.84.202111292103-0 but an older image is still referenced in openshift-installer, assisted-service and openshift dependencies so I believe these need to be updated with the more recent rhcos build so we can consume the fix:

openshift-installer: https://github.com/openshift/installer/blob/release-4.9/data/data/rhcos.json#L118-L121
assisted-service: https://github.com/openshift/assisted-service/blob/master/data/default_os_images.json#L26-L28
dependencies: https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.9/4.9.0/

Comment 31 Bob Fournier 2021-12-14 13:24:34 UTC
(In reply to Marius Cornea from comment #30)
> (In reply to Bob Fournier from comment #29)
> > Ian - thanks, that's great news.
> > 
> > Marius - similar to what Ian did, could you clear out the entries that are
> > causing the problem and try it with the recent version? If that works we
> > could probably close this as a duplicate of
> > https://bugzilla.redhat.com/show_bug.cgi?id=2004449.
> 
> The issue didn't reproduce with rhcos-49.84.202111292103-0 but an older
> image is still referenced in openshift-installer, assisted-service and
> openshift dependencies so I believe these need to be updated with the more
> recent rhcos build so we can consume the fix:
> 
> openshift-installer:
> https://github.com/openshift/installer/blob/release-4.9/data/data/rhcos.
> json#L118-L121
> assisted-service:
> https://github.com/openshift/assisted-service/blob/master/data/
> default_os_images.json#L26-L28
> dependencies:
> https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.9/4.9.0/

Marius - great, thanks for checking that.

I changed the Product+Component to reflect that the build needs to be updated and that this isn't a shim issue.

Comment 33 Matthew Staebler 2022-03-14 15:08:01 UTC
From a brief perusal of this BZ, it looks like the action that needs to be taken is to update the RHCOS version in the stream data. I am sending this BZ to the CoreOS team as they own that.

Comment 34 Micah Abbott 2022-03-14 17:29:37 UTC
(In reply to Marius Cornea from comment #30)
> (In reply to Bob Fournier from comment #29)
> > Ian - thanks, that's great news.
> > 
> > Marius - similar to what Ian did, could you clear out the entries that are
> > causing the problem and try it with the recent version? If that works we
> > could probably close this as a duplicate of
> > https://bugzilla.redhat.com/show_bug.cgi?id=2004449.
> 
> The issue didn't reproduce with rhcos-49.84.202111292103-0 but an older
> image is still referenced in openshift-installer, assisted-service and
> openshift dependencies so I believe these need to be updated with the more
> recent rhcos build so we can consume the fix:
> 
> openshift-installer:
> https://github.com/openshift/installer/blob/release-4.9/data/data/rhcos.
> json#L118-L121
> assisted-service:
> https://github.com/openshift/assisted-service/blob/master/data/
> default_os_images.json#L26-L28
> dependencies:
> https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.9/4.9.0/

I'm a bit confused here...we fixed BZ#2004449 with a change to `coreos-assembler` (https://github.com/coreos/coreos-assembler/pull/2435) first by dropping the `shim` fallback.efi from the live ISO.  New bootimages were generated and then used as part of the update to `openshift-install` here https://github.com/openshift/installer/pull/5231.  (And then updated again in https://github.com/openshift/installer/pull/5279 to use 49.84.202110081407-0)

If this problem is the same as what is reported in BZ#2004449, then no further changes should be needed.

It's not clear to me if the images referenced in `openshift-install` are still experiencing a problem and if additional changes are needed.

@Marius could you please try to reproduce this issue with the latest 4.9 images found in https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.9/4.9.0/?

If the issue persists, then we need to understand what changed between 49.84.202110081407-0 (latest image referenced in openshift-installer) and 49.84.202111292103-0 (image reported fixed in comment #30)

Comment 35 Marius Cornea 2022-03-15 18:00:22 UTC
I was not able to reproduce the issue when using the latest 4.9 images found in https://mirror.openshift.com/pub/openshift-v4/dependencies/rhcos/4.9/4.9.0/ so closing this BZ


Note You need to log in before you can comment on or make changes to this bug.