Bug 1873305

Summary: Failed to power on /inspect node when using Redfish protocol
Product: OpenShift Container Platform Reporter: Sai Sindhur Malleni <smalleni>
Component: Bare Metal Hardware ProvisioningAssignee: Bob Fournier <bfournie>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Sai Sindhur Malleni <smalleni>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: beth.white, bfournie, cdearbor, dblack, dtantsur, jkreger, lshilin, mifiedle, pablo.iranzo, rbartal, rlopez, tsedovic
Version: 4.5Keywords: TestBlocker, Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The issue is that on a Dell FC640, the Redfish PowerState being reported was always 'On' and did not match the actual host power state. This was fixed in the Dell firmware release 4.22.0.53 for the FC640.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:16:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1831748    
Attachments:
Description Flags
One inspection log
none
output of redfish/v1/Systems/System.Embedded.1 on Dell FC640 none

Description Sai Sindhur Malleni 2020-08-27 19:51:46 UTC
Description of problem:
In 4.5.7, using redfish:// protocol during installation fails to inspect nodes and in fact, even power on nodes.

Seeing errors like

 2020-08-27 18:30:34.877 1 ERROR ironic.conductor.manager [req-a9c020ea-1500-4155-bf08-617ccb59ae9e - - - - -] Failed to inspect node c375af92-f2bf-4df7-acd8-3687b239f7dc: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node c375af92-f2bf-4df7-acd8-3687b239f7dc when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b04-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node c375af92-f2bf-4df7-acd8-3687b239f7dc when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b04-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.^[[00m

Trying to power on the node using redfish works (outside of the installer)

[smalleni@localhost arsenal]$ curl -k https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset -H "Content-Type: application/json" -i -X POST -u quads:XXXXX -d '{"ResetType": "On"}'
HTTP/1.1 204 No Content
Date: Thu, 27 Aug 2020 22:50:36 GMT
Server: Apache
OData-EntityId: /redfish/v1/Systems/System.Embedded.1
X-Frame-Options: DENY
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload


Version-Release number of selected component (if applicable):
4.5.7
iDRAC Firmware Version 4.22.00.00

How reproducible:
100% 

Steps to Reproduce:
1. Try an install with redfish:// in install-config
2. use iDrac version 4.22.00.00
3.

Actual results:
Install fails with nodes not even powering on

Expected results:
Install should succeed

Additional info:

Comment 2 Dmitry Tantsur 2020-08-31 16:14:17 UTC
I've looked at the logs, I don't see iDRAC reporting *any* PowerState value until we try changing the power state (and fail). Then it shows "On". I don't understand how to interpret this yet.

Comment 3 Sai Sindhur Malleni 2020-08-31 18:05:07 UTC
This is a Dell FC640 node, not sure if that matters. Downgrading firmware also results in the same error..
Firmware Version = 4.20.20.20 System BIOS Version = 2.8.1

So at this point I tried with 4.20.20.20 and 4.22.00.00

Comment 4 Tomas Sedovic 2020-09-01 10:08:59 UTC
Is this a regression or are you using HW/FW that hasn't been tested yet?

Comment 5 rlopez 2020-09-01 12:53:32 UTC
Tomas, I'd consider this a regression as RedFish should work with this system. Via IPI on BM docs, only requirement is that the system can run RHEL8 which these can and are listed in the RHEL Certified servers list: https://catalog.redhat.com/hardware/servers/search?p=1&c_version=Red%20Hat%20Enterprise%20Linux%208&ch_architecture=x86_64&q=fc640

Comment 7 Dmitry Tantsur 2020-09-03 12:53:40 UTC
Created attachment 1713600 [details]
One inspection log

Attached is an extract containing one inspection request for one node. The most surprising thing is that PowerState is missing from most of Redfish System representations. I'm not sure why ironic assumes they're powered on though.

I'll probably need to involve Dell folks to understand what is going on.

Comment 8 Dmitry Tantsur 2020-09-03 14:19:39 UTC
Could you try iDRAC firmware 4.10.10.10, assuming that version is available for FC640

Comment 9 Sai Sindhur Malleni 2020-09-03 16:23:44 UTC
Based on https://github.com/openshift-kni/baremetal-deploy/blob/master/ansible-ipi-install/roles/node-prep/tasks/10_validation.yml#L385 it looks we need version greater than /equal to 4.20.20.20 for redfish to be supported by the installer?

Comment 10 Sai Sindhur Malleni 2020-09-03 20:45:00 UTC
Did you still need 4.10.10.10 tested?

Comment 12 Sai Sindhur Malleni 2020-09-10 00:07:33 UTC
Roger,

Any comments on the firmware version being requested to be tested. I believe that minimum firmware version you mentioned that is working with redfish is greater than what is being requested here. Need some inputs from you.

Comment 13 rlopez 2020-09-10 02:03:57 UTC
Hey Sai,

I think its worth bring it down to iDRAC 4.10.10.10 (even though we recommend 4.20.20.20 or higher) so that Dell can narrow down the issue in the higher versions of firmware. The link to the firmware is here: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=ktc95&oscode=wst14&productcode=poweredge-fc640

Comment 14 Sai Sindhur Malleni 2020-09-10 18:50:01 UTC
OK, so with iDRAC at 4.10.10.10 the node atleast powers on. Earlier, even that was not happening. I don't have a successful deploy yet, since PXE seems to have failed and I'm still investigating. In the interests of time, I just wanted to give a confirmation here that nodes boot up with 4.10.10.10 and redfish.

Comment 15 Bob Fournier 2020-09-10 18:59:46 UTC
Thanks Sai.  When you have some downtime and things are stable could you grab the ironic log again?  We'd like to compare the PowerState being returned from the iDrac in this case.

Comment 16 Sai Sindhur Malleni 2020-09-11 15:40:27 UTC
So, I'm back with hopefully a more concrete datapoint. The one time the nodes did power on, it turns out the boot mode was set to UEFI. Somehow the firmware downgrade operation seemed to have caused the boot mode to change when going from 420.20.20 to 4.10.10.10. Reverting back to BIOS, the nodes don't power on. To clarify, all of the data on this BZ was with BIOS, except the one time with 4.10.10.10 in comment #14 when the nodes powered on.

Still seeing
2020-09-11 15:25:14.095 1 ERROR ironic.conductor.manager [req-7e06811d-f525-4f79-84ac-d5e3e5fcd2d3 - - - - -] Failed to inspect node b1083361-2adb-404a-bba3-701551529451: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node b1083361-2adb-404a-bba3-701551529451 when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node b1083361-2adb-404a-bba3-701551529451 when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.^[[00m

Comment 19 Sai Sindhur Malleni 2020-09-11 19:32:29 UTC
The version I'm using is 4.5.7. So it's not clear if this problem is expected in 4.5?

Comment 24 Bob Fournier 2020-09-21 20:29:50 UTC
Created attachment 1715588 [details]
output of redfish/v1/Systems/System.Embedded.1 on Dell FC640

This shows that the PowerState is On when server is off.

Comment 27 Raviv Bar-Tal 2020-10-19 05:44:26 UTC
Hey,
Can you please verify this BZ on your system?
Thanks
Raviv

Comment 28 Bob Fournier 2020-10-26 12:07:44 UTC
Need to update F/W on these Dells to 4.22.00.53 and retest in cluster.

Comment 30 Bob Fournier 2021-01-28 12:31:23 UTC
Hi Sai - I think this can be closed as its working after updating the firmware.

Comment 31 Sai Sindhur Malleni 2021-01-29 19:04:54 UTC
(In reply to Bob Fournier from comment #30)
> Hi Sai - I think this can be closed as its working after updating the
> firmware.

Ack, I did have deployment issues with redfish though even after the update. We can open a separate bug for that if needed. Good to close this.

Comment 32 Bob Fournier 2021-01-29 19:31:47 UTC
Thanks Sai. I will close this out as the power issue is resolved.  Let follow up with the next set of problems.

Comment 34 errata-xmlrpc 2021-02-24 15:16:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633