Bug 1988879 - Virtual media based deployment fails on Dell servers due to pending Lifecycle Controller jobs
Summary: Virtual media based deployment fails on Dell servers due to pending Lifecycle...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: x86_64
OS: All
high
medium
Target Milestone: ---
: 4.10.0
Assignee: Jacob Anders
QA Contact: Lubov
URL:
Whiteboard:
: 2022426 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-02 00:03 UTC by Jacob Anders
Modified: 2022-03-12 04:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Release Note text: Previously, it was observed that virtual media based deployments of OpenShift may intermittently fail on iDRAC hardware types, if there are outstanding Lifecycle Controller jobs. This is fixed by adding an automated step of purging any existing Lifecycle Controller jobs while registering iDRAC hardware. (BZ#1988879) Cause: On iDRAC hardware type, outstanding Lifecycle Controller jobs may clash with virtual media configuration requests. Consequence: Virtual media based OpenShift installation may intermittently fail. Fix: Automatically purge Lifecycle Controller job queue prior to deployment. Result: Virtual media based deployments on iDRAC hardware types
Clone Of:
Environment:
Last Closed: 2022-03-12 04:36:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github metal3-io ironic-image pull 311 0 None Merged Enable Lifecycle Controller job queue clear by default 2021-10-08 08:29:31 UTC
Github openshift ironic-image pull 224 0 None Merged Add support for Verify steps and Lifecycle Controller reset 2021-10-31 22:58:19 UTC
OpenStack Storyboard 2007617 0 None None None 2021-08-02 00:03:55 UTC
OpenStack Storyboard 2009025 0 None Closed How to clean up and reinstall a failed Self-hosted Engine if SHE Hosts are running further VMs 2022-05-20 09:23:15 UTC
OpenStack gerrit 800001 0 None MERGED Add support for verify steps 2021-10-31 22:58:52 UTC
OpenStack gerrit 804032 0 None MERGED Make iDRAC management steps verify steps 2021-10-31 22:58:53 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:37:05 UTC

Description Jacob Anders 2021-08-02 00:03:55 UTC
Description of problem:

We are observing intermittent deploy failures on Dell machines caused by virtual media failing to attach. An example of this (at the inspection stage):

Failed to inspect hardware. Reason: unable to start inspection: HTTP POST https://10.19.0.84/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia returned code 500. Base.1.2.GeneralError: The request failed due to an internal service error.  The service is still operational. Extended information: [{'Message': 'The request failed due to an internal service error.  The service is still operational.', 'MessageArgs': [], 'MessageArgs': 0, 'MessageId': 'Base.1.2.InternalError', 'RelatedProperties': [], 'RelatedProperties': 0, 'Resolution': 'Resubmit the request.  If the problem persists, consider resetting the service.', 'Severity': 'Critical'}]

Version-Release number of selected component (if applicable):

OpenShift 4.8

How reproducible:

The issue isn't easily reproducible, but it happens regularly. It seems correlated with pending jobs stuck in Lifecycle controller. On a high level, it could be described as iDRAC ending up in corrupt state.

Clearing Lifecycle Controller jobs and resetting iDRAC seems to reliably resolve this issue.


Steps to Reproduce:
1.
2.
3.

Actual results:

Virtual media based deployment fails.


Expected results:


Virtual media based deployment succeeds.

Additional info:

Work to address this in Ironic has been undertaken upstream by both Dell and Red Hat. There is existing Ironic code contributed by Dell which can be used to automatically clear Lifecycle Controller jobs and reset iDRAC prior to deployment to ensure a known good state. This is essentially a programmatic way of applying the exact same fix we have been applying manually so far to help resolve this issue when it's observed.  Reference upstream stories are linked to the BZ.

Comment 3 Jacob Anders 2021-10-05 03:51:32 UTC
Ironic patches implementing a fix are up for review.

Comment 4 Jacob Anders 2021-10-15 12:05:49 UTC
https://github.com/metal3-io/ironic-image/pull/311 has merged, now it needs to be included in ocp/4.10 (through cherry pick or regular sync).

Comment 5 Jacob Anders 2021-10-31 22:59:45 UTC
https://github.com/openshift/ironic-image/pull/224 has also merged, including this change in OCP4.10.

Setting status to MODIFIED.

Comment 8 Jacob Anders 2021-11-30 00:48:35 UTC
*** Bug 2022426 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2022-03-12 04:36:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.