Description of problem: In 4.5.7, using redfish:// protocol during installation fails to inspect nodes and in fact, even power on nodes. Seeing errors like 2020-08-27 18:30:34.877 1 ERROR ironic.conductor.manager [req-a9c020ea-1500-4155-bf08-617ccb59ae9e - - - - -] Failed to inspect node c375af92-f2bf-4df7-acd8-3687b239f7dc: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node c375af92-f2bf-4df7-acd8-3687b239f7dc when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b04-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node c375af92-f2bf-4df7-acd8-3687b239f7dc when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b04-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.^[[00m Trying to power on the node using redfish works (outside of the installer) [smalleni@localhost arsenal]$ curl -k https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset -H "Content-Type: application/json" -i -X POST -u quads:XXXXX -d '{"ResetType": "On"}' HTTP/1.1 204 No Content Date: Thu, 27 Aug 2020 22:50:36 GMT Server: Apache OData-EntityId: /redfish/v1/Systems/System.Embedded.1 X-Frame-Options: DENY Strict-Transport-Security: max-age=63072000; includeSubDomains; preload Version-Release number of selected component (if applicable): 4.5.7 iDRAC Firmware Version 4.22.00.00 How reproducible: 100% Steps to Reproduce: 1. Try an install with redfish:// in install-config 2. use iDrac version 4.22.00.00 3. Actual results: Install fails with nodes not even powering on Expected results: Install should succeed Additional info:
I've looked at the logs, I don't see iDRAC reporting *any* PowerState value until we try changing the power state (and fail). Then it shows "On". I don't understand how to interpret this yet.
This is a Dell FC640 node, not sure if that matters. Downgrading firmware also results in the same error.. Firmware Version = 4.20.20.20 System BIOS Version = 2.8.1 So at this point I tried with 4.20.20.20 and 4.22.00.00
Is this a regression or are you using HW/FW that hasn't been tested yet?
Tomas, I'd consider this a regression as RedFish should work with this system. Via IPI on BM docs, only requirement is that the system can run RHEL8 which these can and are listed in the RHEL Certified servers list: https://catalog.redhat.com/hardware/servers/search?p=1&c_version=Red%20Hat%20Enterprise%20Linux%208&ch_architecture=x86_64&q=fc640
Created attachment 1713600 [details] One inspection log Attached is an extract containing one inspection request for one node. The most surprising thing is that PowerState is missing from most of Redfish System representations. I'm not sure why ironic assumes they're powered on though. I'll probably need to involve Dell folks to understand what is going on.
Could you try iDRAC firmware 4.10.10.10, assuming that version is available for FC640
Based on https://github.com/openshift-kni/baremetal-deploy/blob/master/ansible-ipi-install/roles/node-prep/tasks/10_validation.yml#L385 it looks we need version greater than /equal to 4.20.20.20 for redfish to be supported by the installer?
Did you still need 4.10.10.10 tested?
Roger, Any comments on the firmware version being requested to be tested. I believe that minimum firmware version you mentioned that is working with redfish is greater than what is being requested here. Need some inputs from you.
Hey Sai, I think its worth bring it down to iDRAC 4.10.10.10 (even though we recommend 4.20.20.20 or higher) so that Dell can narrow down the issue in the higher versions of firmware. The link to the firmware is here: https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=ktc95&oscode=wst14&productcode=poweredge-fc640
OK, so with iDRAC at 4.10.10.10 the node atleast powers on. Earlier, even that was not happening. I don't have a successful deploy yet, since PXE seems to have failed and I'm still investigating. In the interests of time, I just wanted to give a confirmation here that nodes boot up with 4.10.10.10 and redfish.
Thanks Sai. When you have some downtime and things are stable could you grab the ironic log again? We'd like to compare the PowerState being returned from the iDrac in this case.
So, I'm back with hopefully a more concrete datapoint. The one time the nodes did power on, it turns out the boot mode was set to UEFI. Somehow the firmware downgrade operation seemed to have caused the boot mode to change when going from 420.20.20 to 4.10.10.10. Reverting back to BIOS, the nodes don't power on. To clarify, all of the data on this BZ was with BIOS, except the one time with 4.10.10.10 in comment #14 when the nodes powered on. Still seeing 2020-09-11 15:25:14.095 1 ERROR ironic.conductor.manager [req-7e06811d-f525-4f79-84ac-d5e3e5fcd2d3 - - - - -] Failed to inspect node b1083361-2adb-404a-bba3-701551529451: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node b1083361-2adb-404a-bba3-701551529451 when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.: ironic.common.exception.HardwareInspectionFailure: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Reboot failed for node b1083361-2adb-404a-bba3-701551529451 when setting power state to power off. Error: HTTP POST https://mgmt-e16-h12-b02-fc640.rdu2.scalelab.redhat.com/redfish/v1/Systems/System.Embedded.1/Actions/ComputerSystem.Reset returned code 409. Base.1.5.GeneralError: Server is already powered OFF.^[[00m
The version I'm using is 4.5.7. So it's not clear if this problem is expected in 4.5?
Created attachment 1715588 [details] output of redfish/v1/Systems/System.Embedded.1 on Dell FC640 This shows that the PowerState is On when server is off.
Hey, Can you please verify this BZ on your system? Thanks Raviv
Need to update F/W on these Dells to 4.22.00.53 and retest in cluster.
Hi Sai - I think this can be closed as its working after updating the firmware.
(In reply to Bob Fournier from comment #30) > Hi Sai - I think this can be closed as its working after updating the > firmware. Ack, I did have deployment issues with redfish though even after the update. We can open a separate bug for that if needed. Good to close this.
Thanks Sai. I will close this out as the power issue is resolved. Let follow up with the next set of problems.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633