Bug 1828885

Summary: IPI deployment fails on Dell r640 nodes using redfish
Product: OpenShift Container Platform Reporter: Michael Zamot <mzamot>
Component: Bare Metal Hardware ProvisioningAssignee: Dmitry Tantsur <dtantsur>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: beth.white, cpaquin, dtantsur, imelofer, jkreger, stbenjam
Version: 4.4Keywords: OtherQA, Triaged
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Certain Dell firmware versions dropped support for configuring persistent boot via Redfish. A workaround has been provided to ensure successful deployment on such servers.
Story Points: ---
Clone Of:
: 1841216 (view as bug list) Environment:
Last Closed: 2020-10-27 15:58:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1841216    
Bug Blocks:    

Description Michael Zamot 2020-04-28 14:16:55 UTC
Description of problem:

Deployment using Redfish fails on Dell PowerEdge R640, with iDRAC version  4.10.10.10. Ironic can set up the server to boot from PXE without any error, but after writing RHCOS to disk, it attempts to set the set the boot device to the RAID or Hard Drive, but it fails with:

2020-04-27 22:20:35.769 1 ERROR ironic.drivers.modules.agent_base_vendor [req-fb2466b7-a334-4188-bcd4-82f0652e0ed8 - - - - -] Asynchronous exception: Node fai
led to move to active state. Exception: Failed to change the boot device to disk when deploying node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: Redfish exception occurred. Error: Redfish set boot device failed for node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: HTTP PATCH https://10.75.112.61/redfish/v1/Systems
/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Unable to Process the request because the value entered for the parameter Continuous is not supported by the implementation. for node 6b498786-6a39-4972-8c62-8567f0f2275e: ironic.common.exception.InstanceDeployFailure: Failed to change the boot device to 
disk when deploying node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: Redfish exception occurred. Error: Redfish set boot device failed for node 6b498786-6a39-4972-8c62-8567f0f2275e. Error: HTTP PATCH https://10.75.112.61/redfish/v1/Systems/System.Embedded.1 returned code 400. Base.1.5.GeneralError: Unable to Process the request because the value entered for the parameter Continuous is not supported by the implementation. 

The deployment fails as the node is moved to Failed state.

Version-Release number of selected component (if applicable):
4.4.0-rc10

How reproducible:
Always

Steps to Reproduce:
1. Deploy OpenShift cluster using Redfish on Dell r640 with firmware 4.10.10.10
2. Waiting until introspection is done and image is written to disk


Actual results:
Deployment fails before rebooting the nodes.

Expected results:
Deployment succeeds. 

Additional info:

Comment 1 Julia Kreger 2020-04-28 14:43:07 UTC
This unfortunately is a known issue with some changes that were made to the Dell idrac firmware. Dell is aware that they created an incompatibility and were working to correct the issue, although I thought it was fixed in 4.10.10.10. I'm following up with our dell contacts to clarify.

Comment 2 Julia Kreger 2020-04-28 15:04:43 UTC
Our Dell contacts indicate that they believe the fix is still pending release in firmware. They anticipate following up later today.

Comment 3 Dmitry Tantsur 2020-04-28 15:15:05 UTC
The suggested temporary workaround is to set the force_persistent_boot_device flag to True in a node's driver_info. There is, however, a huge caveat (that does not manifest itself in OpenStack context): if you want to ever reboot your node, you need to configure the boot sequence correctly (in BIOS, outside of ironic). You have to make sure that the local disk goes *first*, then goes network boot. It's an unusual configuration. Failure to do it will result in the node going into the introspection ramdisk on the next reboot.

Comment 4 Dmitry Tantsur 2020-04-29 12:52:10 UTC
> force_persistent_boot_device flag to True

sorry, I meant "to Never"

Comment 5 Julia Kreger 2020-04-29 13:58:24 UTC
So, after some discussions and clarity provided by our dell contacts as to the fix being available, it seems we're going to have to implement the workaround and I suspect go ahead and put up a giant warning. I should be able to whip up a baremetal operator and upstream documentation changes to address this after my next call.

Comment 7 Dmitry Tantsur 2020-05-12 09:11:19 UTC
A nicer workaround that is limited to broken nodes and handles reboots: https://review.opendev.org/725239

Comment 18 errata-xmlrpc 2020-10-27 15:58:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196