Bug 1828885 - IPI deployment fails on Dell r640 nodes using redfish
Summary: IPI deployment fails on Dell r640 nodes using redfish
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.4
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Dmitry Tantsur
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On: 1841216
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-28 14:16 UTC by Michael Zamot
Modified: 2020-10-27 15:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Certain Dell firmware versions dropped support for configuring persistent boot via Redfish. A workaround has been provided to ensure successful deployment on such servers.
Clone Of:
: 1841216 (view as bug list)
Environment:
Last Closed: 2020-10-27 15:58:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 725239 0 None MERGED redfish: handle hardware that is unable to set persistent boot 2020-12-17 13:27:02 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:56 UTC

Description Michael Zamot 2020-04-28 14:16:55 UTC
Description of problem:

Deployment using Redfish fails on Dell PowerEdge R640, with iDRAC version  4.10.10.10. Ironic can set up the server to boot from PXE without any error, but after writing RHCOS to disk, it attempts to set the set the boot device to the RAID or Hard Drive, but it fails with:

2020-04-27 22:20:35.769 1 ERROR ironic.drivers.modules.agent_base_vendor [req-fb2466b7-a334-4188-bcd4-82f0652e0ed8 - - - - -] Asynchronous exception: Node fai
led to move to active state. Exception: Failed to change the boot device to disk when deploying node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: Redfish exception occurred. Error: Redfish set boot device failed for node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: HTTP PATCH https://10.75.112.61/redfish/v1/Systems
/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Unable to Process the request because the value entered for the parameter Continuous is not supported by the implementation. for node 6b498786-6a39-4972-8c62-8567f0f2275e: ironic.common.exception.InstanceDeployFailure: Failed to change the boot device to 
disk when deploying node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: Redfish exception occurred. Error: Redfish set boot device failed for node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: HTTP PATCH https://10.75.112.61/redfish/v1/Systems/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Unable to Process the request because the value entered for the parameter Continuous is not supported by the implementation. 

The deployment fails as the node is moved to Failed state.

Version-Release number of selected component (if applicable):
4.4.0-rc10

How reproducible:
Always

Steps to Reproduce:
1. Deploy OpenShift cluster using Redfish on Dell r640 with firmware 4.10.10.10
2. Waiting until introspection is done and image is written to disk


Actual results:
Deployment fails before rebooting the nodes.

Expected results:
Deployment succeeds. 

Additional info:

Comment 1 Julia Kreger 2020-04-28 14:43:07 UTC
This unfortunately is a known issue with some changes that were made to the Dell idrac firmware. Dell is aware that they created an incompatibility and were working to correct the issue, although I thought it was fixed in 4.10.10.10. I'm following up with our dell contacts to clarify.

Comment 2 Julia Kreger 2020-04-28 15:04:43 UTC
Our Dell contacts indicate that they believe the fix is still pending release in firmware. They anticipate following up later today.

Comment 3 Dmitry Tantsur 2020-04-28 15:15:05 UTC
The suggested temporary workaround is to set the force_persistent_boot_device flag to True in a node's driver_info. There is, however, a huge caveat (that does not manifest itself in OpenStack context): if you want to ever reboot your node, you need to configure the boot sequence correctly (in BIOS, outside of ironic). You have to make sure that the local disk goes *first*, then goes network boot. It's an unusual configuration. Failure to do it will result in the node going into the introspection ramdisk on the next reboot.

Comment 4 Dmitry Tantsur 2020-04-29 12:52:10 UTC
> force_persistent_boot_device flag to True

sorry, I meant "to Never"

Comment 5 Julia Kreger 2020-04-29 13:58:24 UTC
So, after some discussions and clarity provided by our dell contacts as to the fix being available, it seems we're going to have to implement the workaround and I suspect go ahead and put up a giant warning. I should be able to whip up a baremetal operator and upstream documentation changes to address this after my next call.

Comment 7 Dmitry Tantsur 2020-05-12 09:11:19 UTC
A nicer workaround that is limited to broken nodes and handles reboots: https://review.opendev.org/725239

Comment 18 errata-xmlrpc 2020-10-27 15:58:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.