1873014 – Installer should fail when one of the workers fails to boot

Bug 1873014 - Installer should fail when one of the workers fails to boot

Summary: Installer should fail when one of the workers fails to boot

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Gal Zaidman
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-27 07:29 UTC by Jan Zmeskal
Modified:	2020-12-09 15:37 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-09 15:37:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jan Zmeskal 2020-08-27 07:29:59 UTC

Description of problem:
It happened to me recently that I was deploying OCP4.6. The installer finished successfully, yet one of the worker machine got stuck in Provisioning state forever.

# oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
primary-dzszz-master-0 Running 14h
primary-dzszz-master-1 Running 14h
primary-dzszz-master-2 Running 14h
primary-dzszz-worker-0-ks4x7 Provisioning 14h
primary-dzszz-worker-0-sj47r Running 14h
primary-dzszz-worker-0-wdvnq Running 14h

After some investigation, we found out that the machine while having Up status in RHV manager never actually finished booting process. I believe that in such case the installer should fail. Otherwise there might be some nasty surprise if the user does not manually check oc get machine after the installation has finished.

See this and two following comments for details: https://bugzilla.redhat.com/show_bug.cgi?id=1817853#c32

Version-Release number of the following components:
4.6.0-0.nightly-2020-08-26-032807
RHV 4.3.11.2-0.1.el7

How reproducible:
Happened to me once

Steps to Reproduce:
1. Start OCP cluster deployment
2. Prevent one of the worker machines from successfully booting
3. Check the exit status of the installer

Actual results:
Installer finishes successfully

Expected results:
Installer should fail

Comment 1 Sandro Bonazzola 2020-09-01 11:42:55 UTC


*** This bug has been marked as a duplicate of bug 1871795 ***

Comment 3 Sandro Bonazzola 2020-09-14 09:56:19 UTC

no capacity in current sprint.

Comment 4 Sandro Bonazzola 2020-10-22 11:38:57 UTC

due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 5 Sandro Bonazzola 2020-12-03 12:53:46 UTC

May be related to missing event on node shutdown, needs further investigation.

Comment 6 Gal Zaidman 2020-12-09 15:37:27 UTC

(In reply to Sandro Bonazzola from comment #5)
> May be related to missing event on node shutdown, needs further
> investigation.

Closing this bug, the installer was successful because 2 workers are up and that is the requirement of the installer - if the cluster and operators are finished and the cluster is stable the installation will finish.
Not related to the event bug

Note You need to log in before you can comment on or make changes to this bug.