Description of problem: Deployment using Redfish fails on Dell PowerEdge R640, with iDRAC version 4.10.10.10. Ironic can set up the server to boot from PXE without any error, but after writing RHCOS to disk, it attempts to set the set the boot device to the RAID or Hard Drive, but it fails with: 2020-04-27 22:20:35.769 1 ERROR ironic.drivers.modules.agent_base_vendor [req-fb2466b7-a334-4188-bcd4-82f0652e0ed8 - - - - -] Asynchronous exception: Node fai led to move to active state. Exception: Failed to change the boot device to disk when deploying node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: Redfish exception occurred. Error: Redfish set boot device failed for node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: HTTP PATCH https://10.75.112.61/redfish/v1/Systems /System.Embedded.1 returned code 400. Base.1.5.GeneralError: Unable to Process the request because the value entered for the parameter Continuous is not supported by the implementation. for node 6b498786-6a39-4972-8c62-8567f0f2275e: ironic.common.exception.InstanceDeployFailure: Failed to change the boot device to disk when deploying node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: Redfish exception occurred. Error: Redfish set boot device failed for node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: HTTP PATCH https://10.75.112.61/redfish/v1/Systems/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Unable to Process the request because the value entered for the parameter Continuous is not supported by the implementation. The deployment fails as the node is moved to Failed state. Version-Release number of selected component (if applicable): 4.4.0-rc10 How reproducible: Always Steps to Reproduce: 1. Deploy OpenShift cluster using Redfish on Dell r640 with firmware 4.10.10.10 2. Waiting until introspection is done and image is written to disk Actual results: Deployment fails before rebooting the nodes. Expected results: Deployment succeeds. Additional info:
This unfortunately is a known issue with some changes that were made to the Dell idrac firmware. Dell is aware that they created an incompatibility and were working to correct the issue, although I thought it was fixed in 4.10.10.10. I'm following up with our dell contacts to clarify.
Our Dell contacts indicate that they believe the fix is still pending release in firmware. They anticipate following up later today.
The suggested temporary workaround is to set the force_persistent_boot_device flag to True in a node's driver_info. There is, however, a huge caveat (that does not manifest itself in OpenStack context): if you want to ever reboot your node, you need to configure the boot sequence correctly (in BIOS, outside of ironic). You have to make sure that the local disk goes *first*, then goes network boot. It's an unusual configuration. Failure to do it will result in the node going into the introspection ramdisk on the next reboot.
> force_persistent_boot_device flag to True sorry, I meant "to Never"
So, after some discussions and clarity provided by our dell contacts as to the fix being available, it seems we're going to have to implement the workaround and I suspect go ahead and put up a giant warning. I should be able to whip up a baremetal operator and upstream documentation changes to address this after my next call.
A nicer workaround that is limited to broken nodes and handles reboots: https://review.opendev.org/725239
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196