1988879 – Virtual media based deployment fails on Dell servers due to pending Lifecycle Controller jobs

Bug 1988879 - Virtual media based deployment fails on Dell servers due to pending Lifecycle Controller jobs

Summary: Virtual media based deployment fails on Dell servers due to pending Lifecycle...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	All
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jacob Anders
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2022426 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-02 00:03 UTC by Jacob Anders
Modified:	2022-03-12 04:37 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Release Note text: Previously, it was observed that virtual media based deployments of OpenShift may intermittently fail on iDRAC hardware types, if there are outstanding Lifecycle Controller jobs. This is fixed by adding an automated step of purging any existing Lifecycle Controller jobs while registering iDRAC hardware. (BZ#1988879) Cause: On iDRAC hardware type, outstanding Lifecycle Controller jobs may clash with virtual media configuration requests. Consequence: Virtual media based OpenShift installation may intermittently fail. Fix: Automatically purge Lifecycle Controller job queue prior to deployment. Result: Virtual media based deployments on iDRAC hardware types
Clone Of:
Environment:
Last Closed:	2022-03-12 04:36:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	metal3-io ironic-image pull 311	None	Merged	Enable Lifecycle Controller job queue clear by default	2021-10-08 08:29:31 UTC
Github	openshift ironic-image pull 224	None	Merged	Add support for Verify steps and Lifecycle Controller reset	2021-10-31 22:58:19 UTC
OpenStack Storyboard	2007617	None	None	None	2021-08-02 00:03:55 UTC
OpenStack Storyboard	2009025	None	Closed	How to clean up and reinstall a failed Self-hosted Engine if SHE Hosts are running further VMs	2022-05-20 09:23:15 UTC
OpenStack gerrit	800001	None	MERGED	Add support for verify steps	2021-10-31 22:58:52 UTC
OpenStack gerrit	804032	None	MERGED	Make iDRAC management steps verify steps	2021-10-31 22:58:53 UTC
Red Hat Product Errata	RHSA-2022:0056	None	None	None	2022-03-12 04:37:05 UTC

Description Jacob Anders 2021-08-02 00:03:55 UTC

Description of problem:

We are observing intermittent deploy failures on Dell machines caused by virtual media failing to attach. An example of this (at the inspection stage):

Failed to inspect hardware. Reason: unable to start inspection: HTTP POST https://10.19.0.84/redfish/v1/Managers/iDRAC.Embedded.1/VirtualMedia/CD/Actions/VirtualMedia.InsertMedia returned code 500. Base.1.2.GeneralError: The request failed due to an internal service error. The service is still operational. Extended information: [{'Message': 'The request failed due to an internal service error. The service is still operational.', 'MessageArgs': [], 'MessageArgs': 0, 'MessageId': 'Base.1.2.InternalError', 'RelatedProperties': [], 'RelatedProperties': 0, 'Resolution': 'Resubmit the request. If the problem persists, consider resetting the service.', 'Severity': 'Critical'}]

Version-Release number of selected component (if applicable):

OpenShift 4.8

How reproducible:

The issue isn't easily reproducible, but it happens regularly. It seems correlated with pending jobs stuck in Lifecycle controller. On a high level, it could be described as iDRAC ending up in corrupt state.

Clearing Lifecycle Controller jobs and resetting iDRAC seems to reliably resolve this issue.

Steps to Reproduce:
1.
2.
3.

Actual results:

Virtual media based deployment fails.

Expected results:

Virtual media based deployment succeeds.

Additional info:

Work to address this in Ironic has been undertaken upstream by both Dell and Red Hat. There is existing Ironic code contributed by Dell which can be used to automatically clear Lifecycle Controller jobs and reset iDRAC prior to deployment to ensure a known good state. This is essentially a programmatic way of applying the exact same fix we have been applying manually so far to help resolve this issue when it's observed. Reference upstream stories are linked to the BZ.

Comment 3 Jacob Anders 2021-10-05 03:51:32 UTC

Ironic patches implementing a fix are up for review.

Comment 4 Jacob Anders 2021-10-15 12:05:49 UTC

https://github.com/metal3-io/ironic-image/pull/311 has merged, now it needs to be included in ocp/4.10 (through cherry pick or regular sync).

Comment 5 Jacob Anders 2021-10-31 22:59:45 UTC

https://github.com/openshift/ironic-image/pull/224 has also merged, including this change in OCP4.10.

Setting status to MODIFIED.

Comment 8 Jacob Anders 2021-11-30 00:48:35 UTC

*** Bug 2022426 has been marked as a duplicate of this bug. ***

Comment 11 errata-xmlrpc 2022-03-12 04:36:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.