Bug 1945619 - [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiting for cluster version to be available
Summary: [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiti...
Keywords:
Status: CLOSED DUPLICATE of bug 1957015
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: aos-network-edge-staff
QA Contact: Hongan Li
URL:
Whiteboard: AI-Team-Core
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-01 12:56 UTC by Yuri Obshansky
Modified: 2022-08-04 22:32 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-05 22:10:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
installation logs (79.00 KB, application/x-tar)
2021-04-01 12:56 UTC, Yuri Obshansky
no flags Details
must-gather (12.58 MB, application/gzip)
2021-04-01 12:57 UTC, Yuri Obshansky
no flags Details

Description Yuri Obshansky 2021-04-01 12:56:29 UTC
Created attachment 1768248 [details]
installation logs

Description of problem:
The issue detected under performance test running vs Staging service
6 failed cluster deployments

3/31/2021, 10:13:05 PM	error Host worker-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-2: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host worker-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:10:58 PM	critical Failed installing cluster ocp-cluster-f34-h18-2. Reason: Timeout while waiting for cluster version to be available
3/31/2021, 9:10:58 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: the cluster operator console has not yet successfully rolled out
3/31/2021, 9:07:58 PM	Update cluster installation progress: Cluster version is available: false , message: Working towards 4.7.2: 654 of 668 done (97% complete)
3/31/2021, 9:01:59 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: some cluster operators have not yet rolled out

Version-Release number of selected component (if applicable):
v1.0.18.1

How reproducible:
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/5e46337e-d0c3-4f11-8600-45e8f41671d3

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-04-01 12:57:15 UTC
Created attachment 1768249 [details]
must-gather

Comment 2 Igal Tsoiref 2021-05-02 09:02:29 UTC
Console failed to contact OAuth :

2021-04-01T02:08:33.467259692Z E0401 02:08:33.467186       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com": dial tcp 192.168.125.10:443: connect: connection refused

Didn't find something to point to.

Comment 3 Igal Tsoiref 2021-05-02 09:03:09 UTC
@slaznick maybe you have an idea why we got connection refused?

Comment 4 Standa Laznicka 2021-05-03 07:07:14 UTC
The authentication operator is reporting healthy and the pods are running, which means that there exists a route where the connection can be successful. Console most probably got routed improperly, I'd propose start with checking DNS being correct?

Comment 5 Igal Tsoiref 2021-05-04 06:41:55 UTC
@bnemec can it the problem with ingress ip not freed from the first master? 
It looks the case to say the truth but i am not sure 100%.

Comment 6 Standa Laznicka 2021-05-04 08:24:53 UTC
Moving this to routing as they might be more helpful then me when it comes to ingresses/DNS. This seems like a platform-dependent bug though.

Comment 7 Stephen Greene 2021-05-04 14:36:58 UTC
Please include a must-gather as well as more details about the cluster's platform. Has this been reproduced on a later 4.7.z?

Comment 8 Ben Nemec 2021-05-04 21:00:35 UTC
This does sound like https://bugzilla.redhat.com/show_bug.cgi?id=1931505. We haven't been seeing that on 4.7, but I'm not aware of any reason it couldn't happen. We had already planned to backport that fix to 4.7 anyway, so I'm going to mark this as a duplicate. Feel free to reopen if it continues to happen after the backport merges.

*** This bug has been marked as a duplicate of bug 1957015 ***

Comment 10 Omri Hochman 2021-05-05 22:10:03 UTC

*** This bug has been marked as a duplicate of bug 1957015 ***


Note You need to log in before you can comment on or make changes to this bug.