Bug 1988879

Summary: Virtual media based deployment fails on Dell servers due to pending Lifecycle Controller jobs
Product: OpenShift Container Platform Reporter: Jacob Anders <janders>
Component: Bare Metal Hardware ProvisioningAssignee: Jacob Anders <janders>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: high CC: agurenko, dphillip, lshilin, tsedovic
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Release Note text: Previously, it was observed that virtual media based deployments of OpenShift may intermittently fail on iDRAC hardware types, if there are outstanding Lifecycle Controller jobs. This is fixed by adding an automated step of purging any existing Lifecycle Controller jobs while registering iDRAC hardware. (BZ#1988879) Cause: On iDRAC hardware type, outstanding Lifecycle Controller jobs may clash with virtual media configuration requests. Consequence: Virtual media based OpenShift installation may intermittently fail. Fix: Automatically purge Lifecycle Controller job queue prior to deployment. Result: Virtual media based deployments on iDRAC hardware types
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:36:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jacob Anders 2021-08-02 00:03:55 UTC
Description of problem:

We are observing intermittent deploy failures on Dell machines caused by virtual media failing to attach. An example of this (at the inspection stage):

Failed to inspect hardware. Reason: unable to start inspection: HTTP POST https://10.19.0.84/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia returned code 500. Base.1.2.GeneralError: The request failed due to an internal service error.  The service is still operational. Extended information: [{'Message': 'The request failed due to an internal service error.  The service is still operational.', 'MessageArgs': [], 'MessageArgs': 0, 'MessageId': 'Base.1.2.InternalError', 'RelatedProperties': [], 'RelatedProperties': 0, 'Resolution': 'Resubmit the request.  If the problem persists, consider resetting the service.', 'Severity': 'Critical'}]

Version-Release number of selected component (if applicable):

OpenShift 4.8

How reproducible:

The issue isn't easily reproducible, but it happens regularly. It seems correlated with pending jobs stuck in Lifecycle controller. On a high level, it could be described as iDRAC ending up in corrupt state.

Clearing Lifecycle Controller jobs and resetting iDRAC seems to reliably resolve this issue.


Steps to Reproduce:
1.
2.
3.

Actual results:

Virtual media based deployment fails.


Expected results:


Virtual media based deployment succeeds.

Additional info:

Work to address this in Ironic has been undertaken upstream by both Dell and Red Hat. There is existing Ironic code contributed by Dell which can be used to automatically clear Lifecycle Controller jobs and reset iDRAC prior to deployment to ensure a known good state. This is essentially a programmatic way of applying the exact same fix we have been applying manually so far to help resolve this issue when it's observed.  Reference upstream stories are linked to the BZ.

Comment 3 Jacob Anders 2021-10-05 03:51:32 UTC
Ironic patches implementing a fix are up for review.

Comment 4 Jacob Anders 2021-10-15 12:05:49 UTC
https://github.com/metal3-io/ironic-image/pull/311 has merged, now it needs to be included in ocp/4.10 (through cherry pick or regular sync).

Comment 5 Jacob Anders 2021-10-31 22:59:45 UTC
https://github.com/openshift/ironic-image/pull/224 has also merged, including this change in OCP4.10.

Setting status to MODIFIED.

Comment 8 Jacob Anders 2021-11-30 00:48:35 UTC
*** Bug 2022426 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2022-03-12 04:36:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056