Bug 1919040 - [Assisted][Staging] Cluster deployment failed by Reason: Timeout while waiting for cluster version to be available [NEEDINFO]
Summary: [Assisted][Staging] Cluster deployment failed by Reason: Timeout while waitin...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Igal Tsoiref
QA Contact: Udi Kalifon
URL:
Whiteboard:
: 1889813 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-22 00:48 UTC by Yuri Obshansky
Modified: 2021-02-16 19:19 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-16 19:19:14 UTC
Target Upstream Version:
ercohen: needinfo? (itsoiref)


Attachments (Terms of Use)
installation logs (95.50 KB, application/x-tar)
2021-01-22 00:48 UTC, Yuri Obshansky
no flags Details
installation logs (85.50 KB, application/x-tar)
2021-01-22 13:16 UTC, Yuri Obshansky
no flags Details

Description Yuri Obshansky 2021-01-22 00:48:30 UTC
Created attachment 1749596 [details]
installation logs

Description of problem:
Cluster with OCP 4.7 image deployment failed
1/21/2021, 7:40:06 PM	
error Host worker-0-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
1/21/2021, 7:40:05 PM	
error Host worker-0-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
1/21/2021, 7:40:05 PM	
error Host master-0-2: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
1/21/2021, 7:40:05 PM	
error Host master-0-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
1/21/2021, 7:40:05 PM	
error Host master-0-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
1/21/2021, 7:39:16 PM	
critical Failed installing cluster ocp-cluster-f20-h22-0. Reason: Timeout while waiting for cluster version to be available
1/21/2021, 7:38:16 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.0-fc.0: the cluster operator console is degraded 

Version-Release number of selected component (if applicable):
v1.0.15.1
Assisted-ui-lib version:  1.5.4

How reproducible:
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/f584f16f-0199-48ed-9a50-1116b5a71c41
user:nshidlin-aiqe1-u1
password:L7uzs7oUcRJ/SgY4qi9Aupk7u425cFa2

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-01-22 13:16:01 UTC
Created attachment 1749775 [details]
installation logs

Comment 2 Yuri Obshansky 2021-01-22 13:16:14 UTC
Reproduced with OCP image 4.6 as well
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/fd380a0c-3a14-4b2a-94b8-7ab0e899ed46
Attached logs files

Comment 3 Ronnie Lazar 2021-01-24 17:02:31 UTC
yobshans@redhat.com Did this happen only on the scale test machines? or did it happen on non-scale scenarios?

Comment 5 Igal Tsoiref 2021-01-24 17:28:29 UTC
current timeout is 2 hours and looks due to lack of resources it just takes much more

Comment 6 Omri Hochman 2021-01-25 14:14:54 UTC
(In reply to Igal Tsoiref from comment #5)
> current timeout is 2 hours and looks due to lack of resources it just takes
> much more

we running tests on virt-env according to the minimal requirement specified: 
Boot the Discovery ISO on hardware that should become part of this bare metal cluster. Hosts connected to the internet will be inspected and automatically appear below. Three master hosts are required with at least 4 CPU cores, 16 GB of RAM, and 20 GB of filesystem storage each. Two or more additional worker hosts are recommended with at least 2 CPU cores, 8 GB of RAM, and 20GB of filesystem storage each.

Comment 7 Igal Tsoiref 2021-01-25 19:03:55 UTC
@ohochman@redhat.com the problem is mainly not vm but host that those vms are running. If you set 4vcpu per vm but host has only 8cpu, it means that on high load some vms will not get cpu at all.

Comment 8 Omri Hochman 2021-01-26 15:21:27 UTC
Note: 
- the issue reproduced with 1 cluster deployed on the same physical host.  
- QE should attempt to reproduce the issue without CVO as it's been disabled.  
  
maybe we need to adjust the requirement - should be discussed with PM.

Comment 9 Eran Cohen 2021-01-27 08:23:11 UTC
As @itsoiref@redhat.com said the issue isn't with the VMs spec, the issue is with the host running the VMs.

Comment 10 Igal Tsoiref 2021-01-31 20:33:56 UTC
*** Bug 1889813 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.