Bug 1959749

Summary: sushy_oem_idrac fails during adding bmh
Product: OpenShift Container Platform Reporter: Polina Rabinovich <prabinov>
Component: Bare Metal Hardware ProvisioningAssignee: Tomas Sedovic <tsedovic>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Amit Ugol <augol>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: unspecified CC: imelofer, rbartal, rhalle, sasha
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Other   
Whiteboard: Telco
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-19 07:46:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
metal3-ironic-inspector logs
none
metal3-ironic conductor logs
none
metal3-baremetal-operator.log none

Description Polina Rabinovich 2021-05-12 09:33:21 UTC
Created attachment 1782320 [details]
metal3-ironic-inspector logs

Description of problem:
----------------
During adding of the bmh we got errors and problem in the IDRAC.

[kni@ocp-edge24 ~]$ oc get bmh -A -o yaml

status:
    errorCount: 1
    errorMessage: 'Failed to inspect hardware. Reason: unable to start inspection:
      Redfish exception occurred. Error: iDRAC Redfish set boot device failed for
      node ee79f546-4aff-4d91-9828-5ebf08450dbe, because system 4c4c4544-0037-3610-8050-b1c04f325732
      has no manager which could.'
    errorType: inspection error

And another problem was with sushy_oem_idrac couldn't proceed with the operation because of jobs that are running on IDRAC even after we clear the jobs.
Also we noticed there is a warning regarding "pending attribute that changed or pending job" but as we mentioned job has be cleared.

----------------
2021-05-12 06:47:33.604 1 ERROR sushy_oem_idrac.resources.manager.manager [req-720fe7ce-66d5-4bda-8081-94e04fd8a8d5 ironic-user - - - -] Too many (10) retries, bailing out.: sushy.exceptions.BadRequestError: HTTP POST https://<ip>/redfish/v1/Managers/iDRAC.Embedded.1/Actions/Oem/EID_674_Manager.ImportSystemConfiguration returned code 400. Base.1.7.GeneralError: Unable to perform the import or export operation because there are pending attribute changes or a configuration job is in progress. Extended information: [{'Message': 'Unable to perform the import or export operation because there are pending attribute changes or a configuration job is in progress.', 'MessageArgs': [], 'MessageArgs': 0, 'MessageId': 'IDRAC.2.2.LC068', 'RelatedProperties': [], 'RelatedProperties': 0, 'Resolution': 'Apply or cancel any pending attribute changes. Changes can be applied by creating a targeted configuration job, or the changes can be cancelled by invoking the DeletePendingConfiguration method. If a configuration job is in progress, wait until it is completed before retrying the import or export system configuration operation.', 'Severity': 'Warning'}]
----------------

Version-Release number of selected component (if applicable):
iDARC 4.40.0.0 Dell server R640

-----------------
How reproducible:
100%

Steps to Reproduce:
1. vi add-bmh.yaml
-----------------
apiVersion: v1
kind: Secret
metadata:
 name: rwn-5-secret
type: Opaque
data:
 username: <base-64-username>
 password: <base-64-password>
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
 name: openshift-worker-4
spec:
 online: true
 bmc:
   address: idrac-virtualmedia://<ip>/redfish/v1/Systems/System.Embedded.1
   credentialsName: rwn-5-secret
   disableCertificateVerification: True
   username: <username>
   password: <password>
 bootMACAddress: <mac-address>

-----------------
2. oc create -f add-bmh.yaml -n openshift-machine-api
3. oc get bmh openshift-worker-4 -n openshift-machine-api -o yaml

-----------------
Actual results:

NAME                 STATE                    CONSUMER                         ONLINE   ERROR
openshift-master-0   externally provisioned   ocp-edge3-ws94g-master-0         true
openshift-master-1   externally provisioned   ocp-edge3-ws94g-master-1         true
openshift-master-2   externally provisioned   ocp-edge3-ws94g-master-2         true
openshift-worker-0   provisioned              ocp-edge3-ws94g-worker-0-jkqn7   true
openshift-worker-1   provisioned              ocp-edge3-ws94g-worker-0-jhzcb   true
openshift-worker-4   inspecting                                                true     inspection error

-----------------
[kni@ocp-edge24 ~]$ oc get bmh openshift-worker-4 -A -o yaml
status:
    errorCount: 1
    errorMessage: 'Failed to inspect hardware. Reason: unable to start inspection:
      Redfish exception occurred. Error: iDRAC Redfish set boot device failed for
      node ee79f546-4aff-4d91-9828-5ebf08450dbe, because system 4c4c4544-0037-3610-8050-b1c04f325732
      has no manager which could.'
    errorType: inspection error
-----------------

Expected results:

Watch the "PROVISIONING STATUS" of the newly added bmh switching to "inspecting" and once finished to "ready" 
-----------------
NAME                 STATE                    CONSUMER                         ONLINE   ERROR
openshift-master-0   externally provisioned   ocp-edge3-ws94g-master-0         true
openshift-master-1   externally provisioned   ocp-edge3-ws94g-master-1         true
openshift-master-2   externally provisioned   ocp-edge3-ws94g-master-2         true
openshift-worker-0   provisioned              ocp-edge3-ws94g-worker-0-jkqn7   true
openshift-worker-1   provisioned              ocp-edge3-ws94g-worker-0-jhzcb   true
openshift-worker-4   ready                                                
-----------------
Additional info:
I attached logs from ironic containers.
metal3-ironic-inspector logs
metal3-ironic-conductor logs
metal3-baremetal-operator logs

Comment 1 Polina Rabinovich 2021-05-12 09:34:49 UTC
Created attachment 1782321 [details]
metal3-ironic conductor logs

Comment 2 Polina Rabinovich 2021-05-12 09:35:53 UTC
Created attachment 1782322 [details]
metal3-baremetal-operator.log

Comment 3 Dmitry Tantsur 2021-05-12 10:41:06 UTC
> even after we clear the jobs

Have you also tried a complete iDRAC reset?

Comment 4 Rei 2021-05-12 10:43:31 UTC
(In reply to Dmitry Tantsur from comment #3)
> > even after we clear the jobs
> 
> Have you also tried a complete iDRAC reset?

Yes, simple reset and configuration factory reset (except the NIC)

Comment 5 Rei 2021-05-12 10:43:51 UTC
(In reply to Dmitry Tantsur from comment #3)
> > even after we clear the jobs
> 
> Have you also tried a complete iDRAC reset?

Yes, simple reset and configuration factory reset (except the NIC)

Comment 6 Iury Gregory Melo Ferreira 2021-05-18 16:12:40 UTC
Can you try the following approach and tell us the result:
<ajya> iurygregory: ok, interesting. Can they do anything else with the system? That is, can they manually (iDRAC web) create a BIOS or RAID job? Can they do Import manually? (Configuration->Server Configuration Profile->Import).
<ajya> If it is possible manually, then there is something wrong going in sushy-oem-idrac, if they can't, then it's iDRAC

Comment 7 Polina Rabinovich 2021-05-19 07:12:19 UTC
Look like it was issue with upgrade process, it seem like the 4.40.0.0 installed while there was problem and it need re-install. After we re-install 4.40.0.0 it was move to the next issue (provision network and IPv6)

Comment 8 Iury Gregory Melo Ferreira 2021-05-19 07:30:25 UTC
Polina, thank you! Do you mind if I close this bz?

Comment 9 Polina Rabinovich 2021-05-19 07:41:35 UTC
I think yes, thank you!