Bug 1873014

Summary:	Installer should fail when one of the workers fails to boot
Product:	OpenShift Container Platform	Reporter:	Jan Zmeskal <jzmeskal>
Component:	Installer	Assignee:	Gal Zaidman <gzaidman>
Installer sub component:	OpenShift on RHV	QA Contact:	Lucie Leistnerova <lleistne>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	low
Priority:	low
Version:	4.6
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-09 15:37:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Zmeskal 2020-08-27 07:29:59 UTC

Description of problem:
It happened to me recently that I was deploying OCP4.6. The installer finished successfully, yet one of the worker machine got stuck in Provisioning state forever.

# oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
primary-dzszz-master-0 Running 14h
primary-dzszz-master-1 Running 14h
primary-dzszz-master-2 Running 14h
primary-dzszz-worker-0-ks4x7 Provisioning 14h
primary-dzszz-worker-0-sj47r Running 14h
primary-dzszz-worker-0-wdvnq Running 14h

After some investigation, we found out that the machine while having Up status in RHV manager never actually finished booting process. I believe that in such case the installer should fail. Otherwise there might be some nasty surprise if the user does not manually check oc get machine after the installation has finished.

See this and two following comments for details: https://bugzilla.redhat.com/show_bug.cgi?id=1817853#c32

Version-Release number of the following components:
4.6.0-0.nightly-2020-08-26-032807
RHV 4.3.11.2-0.1.el7

How reproducible:
Happened to me once

Steps to Reproduce:
1. Start OCP cluster deployment
2. Prevent one of the worker machines from successfully booting
3. Check the exit status of the installer

Actual results:
Installer finishes successfully

Expected results:
Installer should fail

Comment 1 Sandro Bonazzola 2020-09-01 11:42:55 UTC


*** This bug has been marked as a duplicate of bug 1871795 ***

Comment 3 Sandro Bonazzola 2020-09-14 09:56:19 UTC

no capacity in current sprint.

Comment 4 Sandro Bonazzola 2020-10-22 11:38:57 UTC

due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 5 Sandro Bonazzola 2020-12-03 12:53:46 UTC

May be related to missing event on node shutdown, needs further investigation.

Comment 6 Gal Zaidman 2020-12-09 15:37:27 UTC

(In reply to Sandro Bonazzola from comment #5)
> May be related to missing event on node shutdown, needs further
> investigation.

Closing this bug, the installer was successful because 2 workers are up and that is the requirement of the installer - if the cluster and operators are finished and the cluster is stable the installation will finish.
Not related to the event bug