Bug 1944246

Summary: Ironic fails to inspect and move node to "manageable' but get bmh remains in "inspecting"
Product: OpenShift Container Platform Reporter: Raviv Bar-Tal <rbartal>
Component: Bare Metal Hardware ProvisioningAssignee: Zane Bitter <zbitter>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, kiran, lshilin
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:56:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
conductor.log none

Description Raviv Bar-Tal 2021-03-29 15:21:38 UTC
Created attachment 1767392 [details]
conductor.log

Description of problem:
In this case I tried to provision node using redfish-virtualmedian,
but the node failed to mount the boot iso.
Ironic condactor move the node from inspecting to manage (see log for full error message).
"oved to provision state "inspect failed" from state "inspecting"; target provision state is "manageable":"

But 'oc get bmh' command show the nodes status as "inspecting"
this status does not change.

NAME                         STATE                    CONSUMER                         ONLINE   ERROR
openshift-dworker-titan112   inspecting                                                true     

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-03-15-105049

How reproducible:


Steps to Reproduce:
1. try to add node with redfish-virtualmedia that can node mount the iso
2. (in our lab try to add seal or titan)
3.

Actual results:


Expected results:
BMH  chane status to manage or error and show error message 

Additional info:
Ironic conductor log is attahced

Comment 1 Zane Bitter 2021-03-30 16:14:43 UTC
The provisioning state is as expected, but we should be reporting an error message and are not.

I noticed a bug here: https://github.com/metal3-io/baremetal-operator/pull/817#issuecomment-809824437 that means we are not reporting errors correctly in introspection. I need to check the logs, but there's a good chance this could be the cause.

Comment 2 Zane Bitter 2021-03-30 18:29:09 UTC
Without the baremetal-operator log it's impossible to say what's going on. However, the issue mentioned in my previous comment would have caused the baremetal-operator to begin another attempt at inspection immediately after detecting failure. While the error would be reported, it would also be cleared again once the retry was detected to be in progress. That's not inconsistent with what we see in the conductor log, so the linked fix should cause the reported error to remain around for longer as the retries get further apart.

According to the logs, the error is:

unable to start inspection: HTTP POST https://10.35.77.3/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia returned code 400.

If that's unexpected then let's open a separate bug to address it.

Comment 3 Zane Bitter 2021-05-11 15:47:51 UTC
Reporting fix is available in 4.8 branch.

Comment 7 errata-xmlrpc 2021-07-27 22:56:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438