Bug 1986656

Summary: [OCP4.9 Bug] Ironic node enters the clean failed state when the target node doesn't have a RAID controller.
Product: OpenShift Container Platform Reporter: Fujitsu container team <fj-lsoft-rh-cnt>
Component: Bare Metal Hardware ProvisioningAssignee: Steven Hardy <shardy>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: ecosystem-partners-infrastructure, fj-lsoft-bm, hfukumot, imelofer, janders, jniu, kahara, rbartal, shardy, tsedovic
Version: 4.9Keywords: Triaged
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:42:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1920358    

Description Fujitsu container team 2021-07-28 02:32:21 UTC
Description of Problem:

 Note: This article is published in the public state at the request of RedHat.

  Ironic node enters the clean failed state after delete_configuration clean step failure.
  This is caused when hardwareRAIDVolumes is nil and the target node doesn't have a RAID controller.
  The cause is that BuildRAIDCleanSteps function does not consider such case and always do delete_configration.

  We have already created PR in Metal3 community to address such case.
  https://github.com/metal3-io/baremetal-operator/pull/942

  This PR adds the case that when hardwareRAIDVolumes is nil, keep the actual RAID configuration(does not do delete_configration).

Version-Release number of selected component:

  This issue was detected in the Pre-GA version.

  Red Hat OpenShift Container Platform Version Number: 4.9.0-0.nightly-2021-07-26-071921
  Release Number: 4.9
  Kubernetes Version: 1.21
  Cri-o Version: 0.1.0
  Related Component: None
  Related Middleware/Application: None
  Underlying RHCOS Release Number: 4.9
  Underlying RHCOS Architecture: x86_64
  Underlying RHCOS Kernel Version: 4.18.0

Drivers or hardware or architecture dependency:

  This error occurs when the target node doesn't have a RAID controller.

How reproducible:

  Always

Step to Reproduce:

  1. Create install-config.yaml in clusterconfigs:
     Worker machine does not install raid card.
     $ vim ~/clusterconfigs/install-config.yaml

  2. Create manifests:

     $ openshift-baremetal-install --dir ~/clusterconfigs create manifests

  3. Create cluster:

     $ openshift-baremetal-install --dir ~/clusterconfigs --log-level debug create cluster

Actual Results:

  Ironic node enters the clean failed state.

Expected Results:

  Ironic node does not enter the clean failed state.

Summary of actions taken to resolve issue:

  We need to merge upstream(Metal3) and downstream(RHOCP) PRs.
  - Upstream: https://github.com/metal3-io/baremetal-operator/pull/942
  - Downstream: https://github.com/openshift/baremetal-operator/pull/170

Location of diagnostic data:

  None

Hardware configuration:

  Model: RX2540 M4

Target Release:

  RHOCP4.9

Additional Info:

  None

Comment 8 Lubov 2021-08-23 15:58:34 UTC
Could you, please, verify this bz. We don't have Fujitsu machines to verify.

Comment 9 Fujitsu container team 2021-08-24 02:28:07 UTC
Hi, Lubov

Yes, Fujitsu is going to verify it, please wait.

Best Regards,
Yasuhiro Futakawa

Comment 10 Fujitsu container team 2021-08-26 01:27:07 UTC
Hi, Lubov,

Fujitsu verified that it works correctly with 4.9.0-0.nightly-2021-08-23-192406.
We also confirmed the fix of this BZ was included in this nightly build.

Best Regards,
Yasuhiro Futakawa

Comment 11 Lubov 2021-08-26 12:15:23 UTC
Good news, closing

Comment 14 errata-xmlrpc 2021-10-18 17:42:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759