Bug 1873014

Summary: Installer should fail when one of the workers fails to boot
Product: OpenShift Container Platform Reporter: Jan Zmeskal <jzmeskal>
Component: InstallerAssignee: Gal Zaidman <gzaidman>
Installer sub component: OpenShift on RHV QA Contact: Lucie Leistnerova <lleistne>
Status: CLOSED NOTABUG Docs Contact:
Severity: low    
Priority: low    
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-09 15:37:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Zmeskal 2020-08-27 07:29:59 UTC
Description of problem:
It happened to me recently that I was deploying OCP4.6. The installer finished successfully, yet one of the worker machine got stuck in Provisioning state forever.

# oc get machine -n openshift-machine-api
NAME                           PHASE          TYPE   REGION   ZONE   AGE
primary-dzszz-master-0         Running                               14h
primary-dzszz-master-1         Running                               14h
primary-dzszz-master-2         Running                               14h
primary-dzszz-worker-0-ks4x7   Provisioning                          14h
primary-dzszz-worker-0-sj47r   Running                               14h
primary-dzszz-worker-0-wdvnq   Running                               14h

After some investigation, we found out that the machine while having Up status in RHV manager never actually finished booting process. I believe that in such case the installer should fail. Otherwise there might be some nasty surprise if the user does not manually check oc get machine after the installation has finished. 

See this and two following comments for details: https://bugzilla.redhat.com/show_bug.cgi?id=1817853#c32


Version-Release number of the following components:
4.6.0-0.nightly-2020-08-26-032807
RHV 4.3.11.2-0.1.el7

How reproducible:
Happened to me once

Steps to Reproduce:
1. Start OCP cluster deployment
2. Prevent one of the worker machines from successfully booting
3. Check the exit status of the installer

Actual results:
Installer finishes successfully

Expected results:
Installer should fail

Comment 1 Sandro Bonazzola 2020-09-01 11:42:55 UTC

*** This bug has been marked as a duplicate of bug 1871795 ***

Comment 3 Sandro Bonazzola 2020-09-14 09:56:19 UTC
no capacity in current sprint.

Comment 4 Sandro Bonazzola 2020-10-22 11:38:57 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 5 Sandro Bonazzola 2020-12-03 12:53:46 UTC
May be related to missing event on node shutdown, needs further investigation.

Comment 6 Gal Zaidman 2020-12-09 15:37:27 UTC
(In reply to Sandro Bonazzola from comment #5)
> May be related to missing event on node shutdown, needs further
> investigation.

Closing this bug, the installer was successful because 2 workers are up and that is the requirement of the installer - if the cluster and operators are finished and the cluster is stable the installation will finish.
Not related to the event bug