Bug 1965092 - [Assisted-4.7] [Staging][OLM] Operators deployments start before all workers finished installation
Summary: [Assisted-4.7] [Staging][OLM] Operators deployments start before all workers ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.8.0
Assignee: Piotr Kliczewski
QA Contact: Yuri Obshansky
URL:
Whiteboard: AI-Team-Projects AI-Cloud
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-26 19:44 UTC by Lital Alon
Modified: 2021-07-27 23:10 UTC (History)
3 users (show)

Fixed In Version: v1.0.21.3
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:10:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
example (137.64 KB, image/png)
2021-05-26 19:44 UTC, Lital Alon
no flags Details
cluster_logs (126.00 KB, application/x-tar)
2021-05-26 19:45 UTC, Lital Alon
no flags Details
must-gather (19.25 MB, application/gzip)
2021-05-26 20:02 UTC, Lital Alon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:10:40 UTC

Description Lital Alon 2021-05-26 19:44:50 UTC
Created attachment 1787358 [details]
example

Description of problem:
I was installing a cluster of 3 masters, 3 workers.
I also chose to deploy OCS and CNV operators.
Installation finished for all nodes except worker-0-1 which was hang on stage 7/8.

The issue is, that while worker-0-1 was installing, both OCS and CNV operator deployment failed (meaning, operators deployments started before all workers were fully joined the cluster). 
OCS must have 3 worker nodes, so we need to make sure all workers are available before deploying operators

attached must gather, and cluster logs

Version-Release number of selected component (if applicable):
Staging v1.0.20.3
OCS 4.7

Steps to Reproduce:
1. install OCS CNV 3m 3w cluster
2. During installation, simulate a failure in 1 worker node (i.e kill installer)
3. wait for cluster to complete


Actual results:
operators deployment kicked and failed before all workers finished installation

Expected results:
wait for all workers to be installed before deploying olm operators

Additional info:
1. No insights regarding why this worker failed to deploy
2. Seems like console and CVO operators kicked off before the entire cluster is done. 2 minutes after they were installed successfully, OCS and CNV failed:

CVO status_updated_at::2021-05-26T13:32:08.455Z
OCS status_updated_at: 2021-05-26T13:34:40.289Z
CNV status_updated_at: 2021-05-26T13:34:40.041Z


5/26/2021, 5:16:42 PM	
error Host worker-0-1: updated status from "installing-in-progress" to "error" (Host failed to install because its installation stage Joined took longer than expected 1h0m0s)
5/26/2021, 4:34:47 PM	Successfully finished installing cluster edge34-cluster-cnv-ocs-0
5/26/2021, 4:32:08 PM	Cluster version status: available message: Done applying 4.7.9
5/26/2021, 4:24:08 PM	Cluster version status: progressing message: Unable to apply 4.7.9: the cluster operator authentication has not yet successfully rolled out
5/26/2021, 4:21:08 PM	Cluster version status: progressing message: Unable to apply 4.7.9: some cluster operators have not yet rolled out
5/26/2021, 4:19:47 PM	Updated status of cluster edge34-cluster-cnv-ocs-0 to finalizing

Comment 1 Lital Alon 2021-05-26 19:45:26 UTC
Created attachment 1787360 [details]
cluster_logs

Comment 2 Lital Alon 2021-05-26 20:02:47 UTC
Created attachment 1787373 [details]
must-gather

Comment 3 Piotr Kliczewski 2021-05-31 13:12:12 UTC
This bug should be fixed by changes done for https://issues.redhat.com/browse/MGMT-4668

Comment 4 Piotr Kliczewski 2021-06-07 09:23:45 UTC
The epic mentioned in the comment #3 is Done. Please retest to make sure it works now.

Comment 5 Lital Alon 2021-06-07 10:47:34 UTC
will be verified, please move the bug to ON_QA and add fix_in_version

Comment 6 Lital Alon 2021-06-08 09:50:55 UTC
Verified in Staging, v1.0.21.3

Comment 9 errata-xmlrpc 2021-07-27 23:10:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.