Bug 1941636

Summary: BM worker nodes deployment with virtual media failed while trying to clean raid
Product: OpenShift Container Platform Reporter: Sabina Aledort <saledort>
Component: Bare Metal Hardware ProvisioningAssignee: Iury Gregory Melo Ferreira <imelofer>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Amit Ugol <augol>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bfournie, imelofer, lshilin, sasha, ykashtan, zbitter
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:54:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1935419    
Attachments:
Description Flags
metal3-baremetal-operator.log
none
kube-rbac-proxy.log none

Description Sabina Aledort 2021-03-22 14:14:08 UTC
Created attachment 1765288 [details]
metal3-baremetal-operator.log

Description of problem:
The BM worker nodes deployment failed while trying to cleanup raid.
The cluster was deployed with virtual media and the BMs are set to UEFI boot.
Were able to deploy the same cluster with ipmi successfully. 

[root@cnfdt07-installer ~]# oc get bmh -A
NAMESPACE               NAME               STATE                    CONSUMER                 ONLINE   ERROR
openshift-machine-api   cnfdt07-master-0   externally provisioned   cnfdt07-lghjm-master-0   true     
openshift-machine-api   cnfdt07-master-1   externally provisioned   cnfdt07-lghjm-master-1   true     
openshift-machine-api   cnfdt07-master-2   externally provisioned   cnfdt07-lghjm-master-2   true     
openshift-machine-api   cnfdt07-worker-0   preparing                                         true     preparation error
openshift-machine-api   cnfdt07-worker-1   preparing                                         true     preparation error

error message from 'oc describe bmh': "Node failed to start the first cleaning step. Error: node does not support this clean step: {'interface': 'raid', 'step': 'delete_configuration'}"

It seems that this cleanup stage was added is this PR:
https://github.com/metal3-io/baremetal-operator/pull/292/files

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-03-21-033801

How reproducible:
Deploy a cluster with virtual media and BMs set to UEFI boot.

Actual results:

[root@cnfdt07-installer ~]# oc get node
NAME                                  STATUS   ROLES            AGE   VERSION
dhcp-55-172.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe
dhcp-55-213.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe
dhcp-55-221.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe

[root@cnfdt07-installer ~]# oc get bmh -A
NAMESPACE               NAME               STATE                    CONSUMER                 ONLINE   ERROR
openshift-machine-api   cnfdt07-master-0   externally provisioned   cnfdt07-lghjm-master-0   true     
openshift-machine-api   cnfdt07-master-1   externally provisioned   cnfdt07-lghjm-master-1   true     
openshift-machine-api   cnfdt07-master-2   externally provisioned   cnfdt07-lghjm-master-2   true     
openshift-machine-api   cnfdt07-worker-0   preparing                                         true     preparation error
openshift-machine-api   cnfdt07-worker-1   preparing                                         true     preparation error

Expected results:
The bms should be provisioned successfully. 

Additional info:
Attached baremetal-operator logs and must-gather.

Comment 1 Sabina Aledort 2021-03-22 14:15:01 UTC
Created attachment 1765290 [details]
kube-rbac-proxy.log

Comment 3 Yuval Kashtan 2021-03-22 19:30:27 UTC
BM node is R640 with the PERC H740P Mini (Embedded)
FW Version: 50.9.4-3025

Comment 5 Zane Bitter 2021-03-23 20:14:30 UTC
It looks like the baremetal-operator now attempts that clean step whenever the BMC driver's RAIDInterface() method returns a non-empty string.

In this case the driver is idrac-virtualmedia and the RAID interface is set to "no-raid", so it attempts to run the clean step.

The regular iDRAC driver returns an empty string for the RAID interface, which is why it's not affected.

Comment 6 Iury Gregory Melo Ferreira 2021-03-23 20:30:23 UTC
Zane, yeah I've found that. I'm adding a check for the "no-raid" to BMO (will push the PR for it)

Comment 8 Iury Gregory Melo Ferreira 2021-03-29 13:39:13 UTC
Moving to

Comment 10 Lubov 2021-04-06 07:39:42 UTC
Could you verify the BZ, please? We don't have the same setup

Comment 11 Sabina Aledort 2021-04-06 10:25:12 UTC
The cluster was deployed successfully with 4.8.0-0.nightly-2021-04-03-092337

Comment 14 errata-xmlrpc 2021-07-27 22:54:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438