Bug 2093511 - Install does not begin if secure boot was enabled for the first time
Summary: Install does not begin if secure boot was enabled for the first time
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.13.0
Assignee: Jacob Anders
QA Contact: yliu1
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-04 00:33 UTC by yliu1
Modified: 2023-03-09 01:20 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Release Note
Doc Text:
When installing OCP with the bootMode set to UEFISecureBoot on a node where SecureBoot is currently disabled the install will fail to start. A subsequent attempt to install with SecureBoot enabled will proceed normally.
Clone Of:
Environment:
Last Closed: 2023-03-09 01:20:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description yliu1 2022-06-04 00:33:47 UTC
Description of problem:
If secure boot is currently disabled, and user attempts to enable it via ZTP, install will not begin the first time ZTP was triggered. 

When secure boot is enabled viz ZTP, then boot options will be configured before virtual CD was attached, thus first boot will be booting into existing HD with secure boot on. Install will then get stuck because boot from CD was never triggered.


Version-Release number of selected component (if applicable):
4.10

How reproducible:
Always

Steps to Reproduce:
1. Secure boot is currently disabled in bios
2. Attempt to deploy a cluster with secure boot enabled via ZTP
3.

Actual results:
- spoke cluster got booted with secure boot option toggled, into existing HD
- spoke cluster did not boot into virtual CD, thus install never started.
- agentclusterinstall gets stuck here:
    State:       insufficient
    State Info:  Cluster is not ready for install


Expected results:
- installation started and completed successfully

Additional info:

Secure boot config used in ZTP siteconfig:
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/ff814164cdcd355ed980f1edf269dbc2afbe09aa/siteconfig/master-2.yaml#L40

Comment 2 jun 2022-06-20 12:31:14 UTC
Jim, can you triage this one while Ian is away?

Comment 3 Ian Miller 2022-06-23 22:04:04 UTC
Does this issue occur when the disk is blank, or does the installation happen correctly in that case (skipping over the blank disk and booting from the ISO)?

Comment 5 yliu1 2022-06-29 18:58:43 UTC
@imiller the disk was not blank.It booted into existing HD.

Comment 6 Ian Miller 2022-07-20 13:32:52 UTC
@yliu1 if the disk is erased prior to an installation where secureboot is enabled, does it succeed? What I am hoping to understand is whether this issue is limited to a re-deployment of a cluster (where secureboot is turned on) versus affecting deployment of new (blank) servers.

Comment 7 yliu1 2022-07-28 23:21:53 UTC
@imiller install started when disk was wiped. It attempted HD boot - failed, then fell back to CD boot.

Comment 8 Yuval Kashtan 2022-08-10 16:05:47 UTC
but then, it seems like the real issue is that ZTP does not set "boot from virtual CD once" to override any existing boot order.
in which case, I suspect similar issue will exist in any installation mode..

it should never boot with anything that exist on the disk, in any case.

Comment 9 Ian Miller 2022-08-10 17:40:04 UTC
I'm not sure what specific option is selected but the boot menu flashes a message "boot from virtual CD" and re-installs of clusters where there is an existing image already on disks is something we do regularly without problems.

Comment 10 Yuval Kashtan 2022-08-10 18:35:07 UTC
@yliu1
are you sure machine (BIOS) is set to UEFI?

Comment 11 Yuval Kashtan 2022-08-10 21:02:51 UTC
also can you add machine info?
is that a Dell machine? something else?
what's the iDRAC version?

Comment 14 Ian Miller 2022-10-03 20:46:31 UTC
Reproduced again. System had 4.11.6 installed without secureboot. Attempted ZTP (assisted service in ACM 2.5.2) installation of 4.10.30 with secureboot enabled. The BMH CR contents:

spec:
  automatedCleaningMode: disabled
  bmc:
    address: idrac-virtualmedia+https://X.X.X.X/redfish/v1/Systems/System.Embedded.1
    credentialsName: cnfocto1-master0
    disableCertificateVerification: true
  bootMACAddress: E4:43:4B:F6:12:E0
  bootMode: UEFISecureBoot
  online: true
  rootDeviceHints:
    hctl: "1:2:0:0"

The system booted back into the 4.11.6 installation already on disk.

The system is a Dell PowerEdge R740xd
idrac version: 4.40.40.00

Comment 16 yliu1 2022-11-01 14:55:36 UTC
@ykashtan yes it's set to UEFI.

It is a dell machine with following firmware. 
BIOS Version	2.14.2
iDRAC Firmware Version	5.10.10.00


And I believe @dpenney also encountered this issue with later idrac firmware, e.g., 5.10.30.00.

Comment 17 Don Penney 2022-11-01 16:36:05 UTC
There seems to be a separate issue with the latest iDRAC firmware, 5.10.50.00, where the response to BIOS config request to enable secure boot is seen as an error, leading to the bmh being put into a "deprovisioning" state and getting powered off without completing the BIOS config change. Downgrading to 5.10.30.00 results in the same behaviour described in above comments. It seems as though the BIOS config request to change secure boot value is seen as completed immediately, but it's only "pending" in iDRAC. The "first boot" is then used to actually perform the config change job and then the node is rebooted. After this reboot, then, the node goes back to the default boot order and boots from disk.

Similarly, if your node is already configured as secure boot and the bmh enters a "deprovisioning" state, another BIOS config change is triggered to set secure boot to false and then the node is powered off. However, this leaves the config change in a pending state, as well, and the bmh gets stuck in this "deprovisioning" state (as the secure boot state is "true", with pending config change, so it seems to keep trying to set it to "false").

As an odd side note, if I manually wipe the disk prior to triggering install with a change to enable secure boot, then the first boot after the config change completes does fail through to booting from the ISO, as expected. However, when the first "boot from disk" is attempted as part of the install, the node boots back to ISO again, as though the pre-wiped disk didn't get setup as a bootable disk prior to the reboot for some reason. At this point, the system blocks, requiring a manual reboot ("pending user action") to get it to boot from disk.

Comment 19 Dmitry Tantsur 2022-11-04 10:17:42 UTC
We're consulting with Dell engineers wrt path forward here. Meanwhile,

> There seems to be a separate issue with the latest iDRAC firmware, 5.10.50.00, where the response to BIOS config request to enable secure boot is seen as an error, leading to the bmh being put into a "deprovisioning" state and getting powered off without completing the BIOS config change

Could you file this bug separately, providing the logs? This is definitely worth fixing.

Comment 20 Dmitry Tantsur 2022-11-08 11:24:07 UTC
Got some information from a Dell engineer:

11:36 <dtantsur> ajya: could you check if SecureBootEnable is changed instantly or only after a reboot? If the latter, we can probably try rebooting and checking the updated value.
11:41 <ajya> dtantsur: only after reboot, PATCHing only creates a job in iDRAC that is is Scheduled state. Only during reboot it is started. The workflow is the same as for BIOS attribute update because it is a BIOS attribute change (probably, could do the same with BIOS clean step).
11:57 <dtantsur> ajya: okay, so if I update the redfish code to check if the value has changed immediately, reboot if not, and check again, it will work?
12:06 <ajya> dtantsur: yes, GET /SecureBoot returns old value until job is finished

Comment 21 Dmitry Tantsur 2022-11-08 11:29:44 UTC
Ian, if I prototype a patch, could we collaborate to test it on your environment? I have some Dell machines, but it may take me longer to reproduce the problem.

Comment 23 Ian Miller 2022-11-14 16:21:56 UTC
Dmitry, sure, I'm happy to provide access to my system for testing

Comment 37 Shiftzilla 2023-03-09 01:20:49 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9303


Note You need to log in before you can comment on or make changes to this bug.