Bug 1965092

Summary: [Assisted-4.7] [Staging][OLM] Operators deployments start before all workers finished installation
Product: OpenShift Container Platform Reporter: Lital Alon <lalon>
Component: assisted-installerAssignee: Piotr Kliczewski <pkliczew>
assisted-installer sub component: assisted-service QA Contact: Yuri Obshansky <yobshans>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: achernet, alazar, aos-bugs
Version: 4.7   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Projects AI-Cloud
Fixed In Version: v1.0.21.3 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:10:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
example
none
cluster_logs
none
must-gather none

Description Lital Alon 2021-05-26 19:44:50 UTC
Created attachment 1787358 [details]
example

Description of problem:
I was installing a cluster of 3 masters, 3 workers.
I also chose to deploy OCS and CNV operators.
Installation finished for all nodes except worker-0-1 which was hang on stage 7/8.

The issue is, that while worker-0-1 was installing, both OCS and CNV operator deployment failed (meaning, operators deployments started before all workers were fully joined the cluster). 
OCS must have 3 worker nodes, so we need to make sure all workers are available before deploying operators

attached must gather, and cluster logs

Version-Release number of selected component (if applicable):
Staging v1.0.20.3
OCS 4.7

Steps to Reproduce:
1. install OCS CNV 3m 3w cluster
2. During installation, simulate a failure in 1 worker node (i.e kill installer)
3. wait for cluster to complete


Actual results:
operators deployment kicked and failed before all workers finished installation

Expected results:
wait for all workers to be installed before deploying olm operators

Additional info:
1. No insights regarding why this worker failed to deploy
2. Seems like console and CVO operators kicked off before the entire cluster is done. 2 minutes after they were installed successfully, OCS and CNV failed:

CVO status_updated_at::2021-05-26T13:32:08.455Z
OCS status_updated_at: 2021-05-26T13:34:40.289Z
CNV status_updated_at: 2021-05-26T13:34:40.041Z


5/26/2021, 5:16:42 PM	
error Host worker-0-1: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s)
5/26/2021, 4:34:47 PM	Successfully finished installing cluster edge34-cluster-cnv-ocs-0
5/26/2021, 4:32:08 PM	Cluster version status: available message: Done applying 4.7.9
5/26/2021, 4:24:08 PM	Cluster version status: progressing message: Unable to apply 4.7.9: the cluster operator authentication has not yet successfully rolled out
5/26/2021, 4:21:08 PM	Cluster version status: progressing message: Unable to apply 4.7.9: some cluster operators have not yet rolled out
5/26/2021, 4:19:47 PM	Updated status of cluster edge34-cluster-cnv-ocs-0 to finalizing

Comment 1 Lital Alon 2021-05-26 19:45:26 UTC
Created attachment 1787360 [details]
cluster_logs

Comment 2 Lital Alon 2021-05-26 20:02:47 UTC
Created attachment 1787373 [details]
must-gather

Comment 3 Piotr Kliczewski 2021-05-31 13:12:12 UTC
This bug should be fixed by changes done for https://issues.redhat.com/browse/MGMT-4668

Comment 4 Piotr Kliczewski 2021-06-07 09:23:45 UTC
The epic mentioned in the comment #3 is Done. Please retest to make sure it works now.

Comment 5 Lital Alon 2021-06-07 10:47:34 UTC
will be verified, please move the bug to ON_QA and add fix_in_version

Comment 6 Lital Alon 2021-06-08 09:50:55 UTC
Verified in Staging, v1.0.21.3

Comment 9 errata-xmlrpc 2021-07-27 23:10:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438