Bug 1873014 - Installer should fail when one of the workers fails to boot
Summary: Installer should fail when one of the workers fails to boot
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.0
Assignee: Gal Zaidman
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-27 07:29 UTC by Jan Zmeskal
Modified: 2020-12-09 15:37 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-09 15:37:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jan Zmeskal 2020-08-27 07:29:59 UTC
Description of problem:
It happened to me recently that I was deploying OCP4.6. The installer finished successfully, yet one of the worker machine got stuck in Provisioning state forever.

# oc get machine -n openshift-machine-api
NAME                           PHASE          TYPE   REGION   ZONE   AGE
primary-dzszz-master-0         Running                               14h
primary-dzszz-master-1         Running                               14h
primary-dzszz-master-2         Running                               14h
primary-dzszz-worker-0-ks4x7   Provisioning                          14h
primary-dzszz-worker-0-sj47r   Running                               14h
primary-dzszz-worker-0-wdvnq   Running                               14h

After some investigation, we found out that the machine while having Up status in RHV manager never actually finished booting process. I believe that in such case the installer should fail. Otherwise there might be some nasty surprise if the user does not manually check oc get machine after the installation has finished. 

See this and two following comments for details: https://bugzilla.redhat.com/show_bug.cgi?id=1817853#c32


Version-Release number of the following components:
4.6.0-0.nightly-2020-08-26-032807
RHV 4.3.11.2-0.1.el7

How reproducible:
Happened to me once

Steps to Reproduce:
1. Start OCP cluster deployment
2. Prevent one of the worker machines from successfully booting
3. Check the exit status of the installer

Actual results:
Installer finishes successfully

Expected results:
Installer should fail

Comment 1 Sandro Bonazzola 2020-09-01 11:42:55 UTC

*** This bug has been marked as a duplicate of bug 1871795 ***

Comment 3 Sandro Bonazzola 2020-09-14 09:56:19 UTC
no capacity in current sprint.

Comment 4 Sandro Bonazzola 2020-10-22 11:38:57 UTC
due to capacity constraints we will be revisiting this bug in the upcoming sprint

Comment 5 Sandro Bonazzola 2020-12-03 12:53:46 UTC
May be related to missing event on node shutdown, needs further investigation.

Comment 6 Gal Zaidman 2020-12-09 15:37:27 UTC
(In reply to Sandro Bonazzola from comment #5)
> May be related to missing event on node shutdown, needs further
> investigation.

Closing this bug, the installer was successful because 2 workers are up and that is the requirement of the installer - if the cluster and operators are finished and the cluster is stable the installation will finish.
Not related to the event bug


Note You need to log in before you can comment on or make changes to this bug.