1941636 – BM worker nodes deployment with virtual media failed while trying to clean raid

Bug 1941636 - BM worker nodes deployment with virtual media failed while trying to clean raid

Summary: BM worker nodes deployment with virtual media failed while trying to clean raid

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Iury Gregory Melo Ferreira
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1935419
TreeView+	depends on / blocked

Reported:	2021-03-22 14:14 UTC by Sabina Aledort
Modified:	2021-07-27 22:54 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:54:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
metal3-baremetal-operator.log (13.20 MB, text/plain) 2021-03-22 14:14 UTC, Sabina Aledort	no flags	Details
kube-rbac-proxy.log (517 bytes, text/plain) 2021-03-22 14:15 UTC, Sabina Aledort	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	metal3-io baremetal-operator pull 829	None	closed	Change default of RAIDInterface	2021-03-24 21:28:34 UTC
Github	openshift baremetal-operator pull 138	None	closed	Bug 1941636 - BM worker nodes deployment with virtual media failed while trying to clean raid	2021-03-24 21:28:22 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:54:57 UTC

Description Sabina Aledort 2021-03-22 14:14:08 UTC

Created attachment 1765288 [details]
metal3-baremetal-operator.log

Description of problem:
The BM worker nodes deployment failed while trying to cleanup raid.
The cluster was deployed with virtual media and the BMs are set to UEFI boot.
Were able to deploy the same cluster with ipmi successfully. 

[root@cnfdt07-installer ~]# oc get bmh -A
NAMESPACE               NAME               STATE                    CONSUMER                 ONLINE   ERROR
openshift-machine-api   cnfdt07-master-0   externally provisioned   cnfdt07-lghjm-master-0   true     
openshift-machine-api   cnfdt07-master-1   externally provisioned   cnfdt07-lghjm-master-1   true     
openshift-machine-api   cnfdt07-master-2   externally provisioned   cnfdt07-lghjm-master-2   true     
openshift-machine-api   cnfdt07-worker-0   preparing                                         true     preparation error
openshift-machine-api   cnfdt07-worker-1   preparing                                         true     preparation error

error message from 'oc describe bmh': "Node failed to start the first cleaning step. Error: node does not support this clean step: {'interface': 'raid', 'step': 'delete_configuration'}"

It seems that this cleanup stage was added is this PR:
https://github.com/metal3-io/baremetal-operator/pull/292/files

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-03-21-033801

How reproducible:
Deploy a cluster with virtual media and BMs set to UEFI boot.

Actual results:

[root@cnfdt07-installer ~]# oc get node
NAME                                  STATUS   ROLES            AGE   VERSION
dhcp-55-172.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe
dhcp-55-213.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe
dhcp-55-221.lab.eng.tlv2.redhat.com   Ready    master,virtual   24h   v1.20.0+39c0afe

[root@cnfdt07-installer ~]# oc get bmh -A
NAMESPACE               NAME               STATE                    CONSUMER                 ONLINE   ERROR
openshift-machine-api   cnfdt07-master-0   externally provisioned   cnfdt07-lghjm-master-0   true     
openshift-machine-api   cnfdt07-master-1   externally provisioned   cnfdt07-lghjm-master-1   true     
openshift-machine-api   cnfdt07-master-2   externally provisioned   cnfdt07-lghjm-master-2   true     
openshift-machine-api   cnfdt07-worker-0   preparing                                         true     preparation error
openshift-machine-api   cnfdt07-worker-1   preparing                                         true     preparation error

Expected results:
The bms should be provisioned successfully. 

Additional info:
Attached baremetal-operator logs and must-gather.

Comment 1 Sabina Aledort 2021-03-22 14:15:01 UTC

Created attachment 1765290 [details]
kube-rbac-proxy.log

Comment 3 Yuval Kashtan 2021-03-22 19:30:27 UTC

BM node is R640 with the PERC H740P Mini (Embedded)
FW Version: 50.9.4-3025

Comment 5 Zane Bitter 2021-03-23 20:14:30 UTC

It looks like the baremetal-operator now attempts that clean step whenever the BMC driver's RAIDInterface() method returns a non-empty string.

In this case the driver is idrac-virtualmedia and the RAID interface is set to "no-raid", so it attempts to run the clean step.

The regular iDRAC driver returns an empty string for the RAID interface, which is why it's not affected.

Comment 6 Iury Gregory Melo Ferreira 2021-03-23 20:30:23 UTC

Zane, yeah I've found that. I'm adding a check for the "no-raid" to BMO (will push the PR for it)

Comment 8 Iury Gregory Melo Ferreira 2021-03-29 13:39:13 UTC

Moving to

Comment 10 Lubov 2021-04-06 07:39:42 UTC

Could you verify the BZ, please? We don't have the same setup

Comment 11 Sabina Aledort 2021-04-06 10:25:12 UTC

The cluster was deployed successfully with 4.8.0-0.nightly-2021-04-03-092337

Comment 14 errata-xmlrpc 2021-07-27 22:54:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.