1945619 – [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiting for cluster version to be available

Bug 1945619 - [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiting for cluster version to be available

Summary: [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiti...

Keywords:
Status:	CLOSED DUPLICATE of bug 1957015
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	aos-network-edge-staff
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:	AI-Team-Core
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-01 12:56 UTC by Yuri Obshansky
Modified:	2022-08-04 22:32 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-05 22:10:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
installation logs (79.00 KB, application/x-tar) 2021-04-01 12:56 UTC, Yuri Obshansky	no flags	Details
must-gather (12.58 MB, application/gzip) 2021-04-01 12:57 UTC, Yuri Obshansky	no flags	Details
View All

Description Yuri Obshansky 2021-04-01 12:56:29 UTC

Created attachment 1768248 [details]
installation logs

Description of problem:
The issue detected under performance test running vs Staging service
6 failed cluster deployments

3/31/2021, 10:13:05 PM	error Host worker-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-2: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host worker-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:10:58 PM	critical Failed installing cluster ocp-cluster-f34-h18-2. Reason: Timeout while waiting for cluster version to be available
3/31/2021, 9:10:58 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: the cluster operator console has not yet successfully rolled out
3/31/2021, 9:07:58 PM	Update cluster installation progress: Cluster version is available: false , message: Working towards 4.7.2: 654 of 668 done (97% complete)
3/31/2021, 9:01:59 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: some cluster operators have not yet rolled out

Version-Release number of selected component (if applicable):
v1.0.18.1

How reproducible:
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/5e46337e-d0c3-4f11-8600-45e8f41671d3

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-04-01 12:57:15 UTC

Created attachment 1768249 [details]
must-gather

Comment 2 Igal Tsoiref 2021-05-02 09:02:29 UTC

Console failed to contact OAuth :

2021-04-01T02:08:33.467259692Z E0401 02:08:33.467186       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com": dial tcp 192.168.125.10:443: connect: connection refused

Didn't find something to point to.

Comment 3 Igal Tsoiref 2021-05-02 09:03:09 UTC

@slaznick maybe you have an idea why we got connection refused?

Comment 4 Standa Laznicka 2021-05-03 07:07:14 UTC

The authentication operator is reporting healthy and the pods are running, which means that there exists a route where the connection can be successful. Console most probably got routed improperly, I'd propose start with checking DNS being correct?

Comment 5 Igal Tsoiref 2021-05-04 06:41:55 UTC

@bnemec can it the problem with ingress ip not freed from the first master? 
It looks the case to say the truth but i am not sure 100%.

Comment 6 Standa Laznicka 2021-05-04 08:24:53 UTC

Moving this to routing as they might be more helpful then me when it comes to ingresses/DNS. This seems like a platform-dependent bug though.

Comment 7 Stephen Greene 2021-05-04 14:36:58 UTC

Please include a must-gather as well as more details about the cluster's platform. Has this been reproduced on a later 4.7.z?

Comment 8 Ben Nemec 2021-05-04 21:00:35 UTC

This does sound like https://bugzilla.redhat.com/show_bug.cgi?id=1931505. We haven't been seeing that on 4.7, but I'm not aware of any reason it couldn't happen. We had already planned to backport that fix to 4.7 anyway, so I'm going to mark this as a duplicate. Feel free to reopen if it continues to happen after the backport merges.

*** This bug has been marked as a duplicate of bug 1957015 ***

Comment 10 Omri Hochman 2021-05-05 22:10:03 UTC


*** This bug has been marked as a duplicate of bug 1957015 ***

Note You need to log in before you can comment on or make changes to this bug.