Bug 1873305 - Failed to power on /inspect node when using Redfish protocol
Summary: Failed to power on /inspect node when using Redfish protocol
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.5
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Bob Fournier
QA Contact: Sai Sindhur Malleni
URL:
Whiteboard:
Depends On:
Blocks: dit
TreeView+ depends on / blocked
 
Reported: 2020-08-27 19:51 UTC by Sai Sindhur Malleni
Modified: 2021-02-24 15:17 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
The issue is that on a Dell FC640, the Redfish PowerState being reported was always 'On' and did not match the actual host power state. This was fixed in the Dell firmware release 4.22.0.53 for the FC640.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:16:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
One inspection log (52.11 KB, text/plain)
2020-09-03 12:53 UTC, Dmitry Tantsur
no flags Details
output of redfish/v1/Systems/System.Embedded.1 on Dell FC640 (8.82 KB, text/plain)
2020-09-21 20:29 UTC, Bob Fournier
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:17:27 UTC

Description Sai Sindhur Malleni 2020-08-27 19:51:46 UTC
Description of problem:
In 4.5.7, using redfish:// protocol during installation fails to inspect nodes and in fact, even power on nodes.

Seeing errors like

 2020-08-27 18:30:34.877 1 ERROR ironic.conductor.manager [req-a9c020ea-1500-4155-bf08-617ccb59ae9e - - - - -] Failed to inspect node c375af92-f2bf-4df7-acd8-3687b239f7dc: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node c375af92-f2bf-4df7-acd8-3687b239f7dc when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b04-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node c375af92-f2bf-4df7-acd8-3687b239f7dc when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b04-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.^[[00m

Trying to power on the node using redfish works (outside of the installer)

[smalleni@localhost arsenal]$ curl -k https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset -H "Content-Type: application/json" -i -X POST -u quads:XXXXX -d '{"ResetType": "On"}'
HTTP/1.1 204 No Content
Date: Thu, 27 Aug 2020 22:50:36 GMT
Server: Apache
OData-EntityId: /redfish/v1/Systems/System.Embedded.1
X-Frame-Options: DENY
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload


Version-Release number of selected component (if applicable):
4.5.7
iDRAC Firmware Version 4.22.00.00

How reproducible:
100% 

Steps to Reproduce:
1. Try an install with redfish:// in install-config
2. use iDrac version 4.22.00.00
3.

Actual results:
Install fails with nodes not even powering on

Expected results:
Install should succeed

Additional info:

Comment 2 Dmitry Tantsur 2020-08-31 16:14:17 UTC
I've looked at the logs, I don't see iDRAC reporting *any* PowerState value until we try changing the power state (and fail). Then it shows "On". I don't understand how to interpret this yet.

Comment 3 Sai Sindhur Malleni 2020-08-31 18:05:07 UTC
This is a Dell FC640 node, not sure if that matters. Downgrading firmware also results in the same error..
Firmware Version = 4.20.20.20 System BIOS Version = 2.8.1

So at this point I tried with 4.20.20.20 and 4.22.00.00

Comment 4 Tomas Sedovic 2020-09-01 10:08:59 UTC
Is this a regression or are you using HW/FW that hasn't been tested yet?

Comment 5 rlopez 2020-09-01 12:53:32 UTC
Tomas, I'd consider this a regression as RedFish should work with this system. Via IPI on BM docs, only requirement is that the system can run RHEL8 which these can and are listed in the RHEL Certified servers list: https://catalog.redhat.com/hardware/servers/search?p=1&c_version=Red%20Hat%20Enterprise%20Linux%208&ch_architecture=x86_64&q=fc640

Comment 7 Dmitry Tantsur 2020-09-03 12:53:40 UTC
Created attachment 1713600 [details]
One inspection log

Attached is an extract containing one inspection request for one node. The most surprising thing is that PowerState is missing from most of Redfish System representations. I'm not sure why ironic assumes they're powered on though.

I'll probably need to involve Dell folks to understand what is going on.

Comment 8 Dmitry Tantsur 2020-09-03 14:19:39 UTC
Could you try iDRAC firmware 4.10.10.10, assuming that version is available for FC640

Comment 9 Sai Sindhur Malleni 2020-09-03 16:23:44 UTC
Based on https://github.com/openshift-kni/baremetal-deploy/blob/master/ansible-ipi-install/roles/node-prep/tasks/10_validation.yml#L385 it looks we need version greater than /equal to 4.20.20.20 for redfish to be supported by the installer?

Comment 10 Sai Sindhur Malleni 2020-09-03 20:45:00 UTC
Did you still need 4.10.10.10 tested?

Comment 12 Sai Sindhur Malleni 2020-09-10 00:07:33 UTC
Roger,

Any comments on the firmware version being requested to be tested. I believe that minimum firmware version you mentioned that is working with redfish is greater than what is being requested here. Need some inputs from you.

Comment 13 rlopez 2020-09-10 02:03:57 UTC
Hey Sai,

I think its worth bring it down to iDRAC 4.10.10.10 (even though we recommend 4.20.20.20 or higher) so that Dell can narrow down the issue in the higher versions of firmware. The link to the firmware is here: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=ktc95&oscode=wst14&productcode=poweredge-fc640

Comment 14 Sai Sindhur Malleni 2020-09-10 18:50:01 UTC
OK, so with iDRAC at 4.10.10.10 the node atleast powers on. Earlier, even that was not happening. I don't have a successful deploy yet, since PXE seems to have failed and I'm still investigating. In the interests of time, I just wanted to give a confirmation here that nodes boot up with 4.10.10.10 and redfish.

Comment 15 Bob Fournier 2020-09-10 18:59:46 UTC
Thanks Sai.  When you have some downtime and things are stable could you grab the ironic log again?  We'd like to compare the PowerState being returned from the iDrac in this case.

Comment 16 Sai Sindhur Malleni 2020-09-11 15:40:27 UTC
So, I'm back with hopefully a more concrete datapoint. The one time the nodes did power on, it turns out the boot mode was set to UEFI. Somehow the firmware downgrade operation seemed to have caused the boot mode to change when going from 420.20.20 to 4.10.10.10. Reverting back to BIOS, the nodes don't power on. To clarify, all of the data on this BZ was with BIOS, except the one time with 4.10.10.10 in comment #14 when the nodes powered on.

Still seeing
2020-09-11 15:25:14.095 1 ERROR ironic.conductor.manager [req-7e06811d-f525-4f79-84ac-d5e3e5fcd2d3 - - - - -] Failed to inspect node b1083361-2adb-404a-bba3-701551529451: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node b1083361-2adb-404a-bba3-701551529451 when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node b1083361-2adb-404a-bba3-701551529451 when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.^[[00m

Comment 19 Sai Sindhur Malleni 2020-09-11 19:32:29 UTC
The version I'm using is 4.5.7. So it's not clear if this problem is expected in 4.5?

Comment 24 Bob Fournier 2020-09-21 20:29:50 UTC
Created attachment 1715588 [details]
output of redfish/v1/Systems/System.Embedded.1 on Dell FC640

This shows that the PowerState is On when server is off.

Comment 27 Raviv Bar-Tal 2020-10-19 05:44:26 UTC
Hey,
Can you please verify this BZ on your system?
Thanks
Raviv

Comment 28 Bob Fournier 2020-10-26 12:07:44 UTC
Need to update F/W on these Dells to 4.22.00.53 and retest in cluster.

Comment 30 Bob Fournier 2021-01-28 12:31:23 UTC
Hi Sai - I think this can be closed as its working after updating the firmware.

Comment 31 Sai Sindhur Malleni 2021-01-29 19:04:54 UTC
(In reply to Bob Fournier from comment #30)
> Hi Sai - I think this can be closed as its working after updating the
> firmware.

Ack, I did have deployment issues with redfish though even after the update. We can open a separate bug for that if needed. Good to close this.

Comment 32 Bob Fournier 2021-01-29 19:31:47 UTC
Thanks Sai. I will close this out as the power issue is resolved.  Let follow up with the next set of problems.

Comment 34 errata-xmlrpc 2021-02-24 15:16:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.