Bug 1941636 - BM worker nodes deployment with virtual media failed while trying to clean raid
Summary: BM worker nodes deployment with virtual media failed while trying to clean raid
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Iury Gregory Melo Ferreira
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks: 1935419
TreeView+ depends on / blocked
 
Reported: 2021-03-22 14:14 UTC by Sabina Aledort
Modified: 2021-07-27 22:54 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:54:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
metal3-baremetal-operator.log (13.20 MB, text/plain)
2021-03-22 14:14 UTC, Sabina Aledort
no flags Details
kube-rbac-proxy.log (517 bytes, text/plain)
2021-03-22 14:15 UTC, Sabina Aledort
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github metal3-io baremetal-operator pull 829 0 None closed Change default of RAIDInterface 2021-03-24 21:28:34 UTC
Github openshift baremetal-operator pull 138 0 None closed Bug 1941636 - BM worker nodes deployment with virtual media failed while trying to clean raid 2021-03-24 21:28:22 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:54:57 UTC

Description Sabina Aledort 2021-03-22 14:14:08 UTC
Created attachment 1765288 [details]
metal3-baremetal-operator.log

Description of problem:
The BM worker nodes deployment failed while trying to cleanup raid.
The cluster was deployed with virtual media and the BMs are set to UEFI boot.
Were able to deploy the same cluster with ipmi successfully. 

[root@cnfdt07-installer ~]# oc get bmh -A
NAMESPACE               NAME               STATE                    CONSUMER                 ONLINE   ERROR
openshift-machine-api   cnfdt07-master-0   externally provisioned   cnfdt07-lghjm-master-0   true     
openshift-machine-api   cnfdt07-master-1   externally provisioned   cnfdt07-lghjm-master-1   true     
openshift-machine-api   cnfdt07-master-2   externally provisioned   cnfdt07-lghjm-master-2   true     
openshift-machine-api   cnfdt07-worker-0   preparing                                         true     preparation error
openshift-machine-api   cnfdt07-worker-1   preparing                                         true     preparation error

error message from 'oc describe bmh': "Node failed to start the first cleaning step. Error: node does not support this clean step: {'interface': 'raid', 'step': 'delete_configuration'}"

It seems that this cleanup stage was added is this PR:
https://github.com/metal3-io/baremetal-operator/pull/292/files

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-03-21-033801

How reproducible:
Deploy a cluster with virtual media and BMs set to UEFI boot.

Actual results:

[root@cnfdt07-installer ~]# oc get node
NAME                                  STATUS   ROLES            AGE   VERSION
dhcp-55-172.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe
dhcp-55-213.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe
dhcp-55-221.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe

[root@cnfdt07-installer ~]# oc get bmh -A
NAMESPACE               NAME               STATE                    CONSUMER                 ONLINE   ERROR
openshift-machine-api   cnfdt07-master-0   externally provisioned   cnfdt07-lghjm-master-0   true     
openshift-machine-api   cnfdt07-master-1   externally provisioned   cnfdt07-lghjm-master-1   true     
openshift-machine-api   cnfdt07-master-2   externally provisioned   cnfdt07-lghjm-master-2   true     
openshift-machine-api   cnfdt07-worker-0   preparing                                         true     preparation error
openshift-machine-api   cnfdt07-worker-1   preparing                                         true     preparation error

Expected results:
The bms should be provisioned successfully. 

Additional info:
Attached baremetal-operator logs and must-gather.

Comment 1 Sabina Aledort 2021-03-22 14:15:01 UTC
Created attachment 1765290 [details]
kube-rbac-proxy.log

Comment 3 Yuval Kashtan 2021-03-22 19:30:27 UTC
BM node is R640 with the PERC H740P Mini (Embedded)
FW Version: 50.9.4-3025

Comment 5 Zane Bitter 2021-03-23 20:14:30 UTC
It looks like the baremetal-operator now attempts that clean step whenever the BMC driver's RAIDInterface() method returns a non-empty string.

In this case the driver is idrac-virtualmedia and the RAID interface is set to "no-raid", so it attempts to run the clean step.

The regular iDRAC driver returns an empty string for the RAID interface, which is why it's not affected.

Comment 6 Iury Gregory Melo Ferreira 2021-03-23 20:30:23 UTC
Zane, yeah I've found that. I'm adding a check for the "no-raid" to BMO (will push the PR for it)

Comment 8 Iury Gregory Melo Ferreira 2021-03-29 13:39:13 UTC
Moving to

Comment 10 Lubov 2021-04-06 07:39:42 UTC
Could you verify the BZ, please? We don't have the same setup

Comment 11 Sabina Aledort 2021-04-06 10:25:12 UTC
The cluster was deployed successfully with 4.8.0-0.nightly-2021-04-03-092337

Comment 14 errata-xmlrpc 2021-07-27 22:54:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.