Bug 1935159 - [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiting for console to become available
Summary: [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiti...
Keywords:
Status: CLOSED DUPLICATE of bug 1931505
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Igal Tsoiref
QA Contact: Udi Kalifon
URL:
Whiteboard: AI-Team-Core
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-04 13:17 UTC by Yuri Obshansky
Modified: 2021-04-27 14:01 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-27 14:01:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
installation logs (69.00 KB, application/x-tar)
2021-03-04 13:17 UTC, Yuri Obshansky
no flags Details
must-gather (12.57 MB, application/gzip)
2021-03-11 20:57 UTC, Yuri Obshansky
no flags Details
sos-report master 0 (3.18 MB, application/x-xz)
2021-03-11 20:58 UTC, Yuri Obshansky
no flags Details
sos-report master 1 (3.89 MB, application/x-xz)
2021-03-11 20:59 UTC, Yuri Obshansky
no flags Details
sos-report master 2 (7.17 MB, application/x-xz)
2021-03-11 20:59 UTC, Yuri Obshansky
no flags Details
sos-report worker 0 (2.54 MB, application/x-xz)
2021-03-11 21:00 UTC, Yuri Obshansky
no flags Details
sos-report worker 1 (2.29 MB, application/x-xz)
2021-03-11 21:00 UTC, Yuri Obshansky
no flags Details
installation logs (12.81 MB, application/x-tar)
2021-03-11 21:01 UTC, Yuri Obshansky
no flags Details
NEW must-gather (12.83 MB, application/gzip)
2021-03-24 11:39 UTC, Yuri Obshansky
no flags Details
NEW sos-report master 0 (4.69 MB, application/x-xz)
2021-03-24 11:40 UTC, Yuri Obshansky
no flags Details
NEW sos-report master 1 (5.29 MB, application/x-xz)
2021-03-24 11:42 UTC, Yuri Obshansky
no flags Details
NEW sos-report master 2 (6.60 MB, application/x-xz)
2021-03-24 11:42 UTC, Yuri Obshansky
no flags Details
NEW installation logs (13.04 MB, application/x-tar)
2021-03-24 11:43 UTC, Yuri Obshansky
no flags Details

Description Yuri Obshansky 2021-03-04 13:17:53 UTC
Created attachment 1760681 [details]
installation logs

Description of problem:
Assisted Service on Staging env 
Cluster Events:
3/3/2021, 6:04:35 PM	
error Host worker-0-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/3/2021, 6:04:35 PM	
error Host master-0-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/3/2021, 6:04:35 PM	
error Host master-0-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/3/2021, 6:04:35 PM	
error Host worker-0-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/3/2021, 6:04:35 PM	
error Host master-0-2: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/3/2021, 6:03:09 PM	
critical Failed installing cluster ocp-cluster-f13-h05-0. Reason: Timeout while waiting for console to become available
3/3/2021, 4:52:07 PM	Updated status of cluster ocp-cluster-f13-h05-0 to finalizing

Version-Release number of selected component (if applicable):
v1.0.17.1

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-03-11 20:56:43 UTC
@itsoiref
@ronnie.lazar

Another failed cluster deployment
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/64c67213-46ea-4dcb-b675-8fd6faed3148
with the same error
3/11/2021, 2:30:53 PM	error Host master-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/11/2021, 2:30:52 PM	error Host master-2-2: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/11/2021, 2:30:52 PM	error Host master-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/11/2021, 2:30:52 PM	error Host worker-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/11/2021, 2:30:52 PM	error Host worker-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/11/2021, 2:30:48 PM	critical Failed installing cluster ocp-cluster-f13-h06-2. Reason: Timeout while waiting for console to become available

I was able to run must-gather and sos reports for nodes this time.
See attachements

Comment 2 Yuri Obshansky 2021-03-11 20:57:26 UTC
Created attachment 1762827 [details]
must-gather

Comment 3 Yuri Obshansky 2021-03-11 20:58:33 UTC
Created attachment 1762829 [details]
sos-report master 0

Comment 4 Yuri Obshansky 2021-03-11 20:59:03 UTC
Created attachment 1762830 [details]
sos-report master 1

Comment 5 Yuri Obshansky 2021-03-11 20:59:39 UTC
Created attachment 1762831 [details]
sos-report master 2

Comment 6 Yuri Obshansky 2021-03-11 21:00:13 UTC
Created attachment 1762832 [details]
sos-report worker 0

Comment 7 Yuri Obshansky 2021-03-11 21:00:46 UTC
Created attachment 1762833 [details]
sos-report worker 1

Comment 8 Yuri Obshansky 2021-03-11 21:01:34 UTC
Created attachment 1762835 [details]
installation logs

Comment 11 Igal Tsoiref 2021-03-21 11:56:56 UTC
@yobshans sos-reports are from another installation. There is another ip inside those vms. There is a way for sos-report of this run?

Comment 12 Igal Tsoiref 2021-03-21 12:25:46 UTC
On master-2-0 console pod didn't start cause of 
          E0311 19:31:38.279726       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp-cluster-f13-h06-2.rdu2.scalelab.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.ocp-cluster-f13-h06-2.rdu2.scalelab.redhat.com": dial tcp 192.168.125.10:443: connect: connection refused

On 2 other masters everything looks ok.
Didn't find any errors relevant to it.

Comment 13 Igal Tsoiref 2021-03-21 12:56:14 UTC
On master-2-0 console pod didn't start cause of 
          E0311 19:31:38.279726       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp-cluster-f13-h06-2.rdu2.scalelab.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.ocp-cluster-f13-h06-2.rdu2.scalelab.redhat.com": dial tcp 192.168.125.10:443: connect: connection refused

On 2 other masters everything looks ok.
Didn't find any errors relevant to it.

Comment 14 Igal Tsoiref 2021-03-21 14:56:43 UTC
192.168.125.10 is ingress ip.

Console is stuck in progressing 
Progressing True 2021-03-11 18:24:32 +0000 UTC SyncLoopRefresh_InProgress SyncLoopRefreshProgressing: Working toward version 4.7.0} {Available False 2021-03-11 18:14:39 +0000 UTC Deployment_FailedUpdate DeploymentAvailable: 2 replicas ready at version 4.7.0} {Upgradeable True 2021-03-11 18:10:49 +0000 UTC AsExpected All is well

We can see that in deployment there are 2 Ready replicas from 2. Why there is 3ird replica? rollout?

Comment 15 Yuri Obshansky 2021-03-24 11:38:00 UTC
@itsoiref
@ronnie.lazar

I'm adding new logs from the last failure 
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/59167322-0675-41ff-b60b-4a8f7bbcef1d
The environment is running. Ping me on slack to get access.

Comment 16 Yuri Obshansky 2021-03-24 11:39:46 UTC
Created attachment 1765899 [details]
NEW must-gather

Comment 17 Yuri Obshansky 2021-03-24 11:40:19 UTC
Created attachment 1765900 [details]
NEW sos-report master 0

Comment 18 Yuri Obshansky 2021-03-24 11:42:14 UTC
Created attachment 1765901 [details]
NEW sos-report master 1

Comment 19 Yuri Obshansky 2021-03-24 11:42:47 UTC
Created attachment 1765902 [details]
NEW sos-report master 2

Comment 20 Yuri Obshansky 2021-03-24 11:43:38 UTC
Created attachment 1765903 [details]
NEW installation logs

Comment 21 Igal Tsoiref 2021-03-29 12:04:55 UTC
It looks like some env issue to say the truth. Everything looks up and running but there is a problem in console and ingress operator canary check. They got connection refused why trying to reach ingress_vip:443. Sounds like something is dropping there calls. Or vip somehow is not configured on worker(though looks like yes) or there is some external LB that is doing something wrong.
@yobshans i will be very glad to connect to setup if possible. 
Regarding sos reports will be nice if we will use one with networking commands and workers sos are nice to have too cause ingress always runs on worker

Comment 23 Igal Tsoiref 2021-03-30 06:58:15 UTC
We found a reason for current failure, it is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1931505

Comment 24 Igal Tsoiref 2021-04-27 14:01:47 UTC

*** This bug has been marked as a duplicate of bug 1931505 ***


Note You need to log in before you can comment on or make changes to this bug.