Bug 1945619

Summary: [Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiting for cluster version to be available
Product: OpenShift Container Platform Reporter: Yuri Obshansky <yobshans>
Component: NetworkingAssignee: aos-network-edge-staff <aos-network-edge-staff>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: unspecified CC: alazar, aos-bugs, bnemec, itsoiref, mfojtik, ohochman, sgreene, slaznick, sttts
Version: 4.7Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-05 22:10:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
installation logs
none
must-gather none

Description Yuri Obshansky 2021-04-01 12:56:29 UTC
Created attachment 1768248 [details]
installation logs

Description of problem:
The issue detected under performance test running vs Staging service
6 failed cluster deployments

3/31/2021, 10:13:05 PM	error Host worker-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-2: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host worker-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:10:58 PM	critical Failed installing cluster ocp-cluster-f34-h18-2. Reason: Timeout while waiting for cluster version to be available
3/31/2021, 9:10:58 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: the cluster operator console has not yet successfully rolled out
3/31/2021, 9:07:58 PM	Update cluster installation progress: Cluster version is available: false , message: Working towards 4.7.2: 654 of 668 done (97% complete)
3/31/2021, 9:01:59 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: some cluster operators have not yet rolled out

Version-Release number of selected component (if applicable):
v1.0.18.1

How reproducible:
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/5e46337e-d0c3-4f11-8600-45e8f41671d3

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-04-01 12:57:15 UTC
Created attachment 1768249 [details]
must-gather

Comment 2 Igal Tsoiref 2021-05-02 09:02:29 UTC
Console failed to contact OAuth :

2021-04-01T02:08:33.467259692Z E0401 02:08:33.467186       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com": dial tcp 192.168.125.10:443: connect: connection refused

Didn't find something to point to.

Comment 3 Igal Tsoiref 2021-05-02 09:03:09 UTC
@slaznick maybe you have an idea why we got connection refused?

Comment 4 Standa Laznicka 2021-05-03 07:07:14 UTC
The authentication operator is reporting healthy and the pods are running, which means that there exists a route where the connection can be successful. Console most probably got routed improperly, I'd propose start with checking DNS being correct?

Comment 5 Igal Tsoiref 2021-05-04 06:41:55 UTC
@bnemec can it the problem with ingress ip not freed from the first master? 
It looks the case to say the truth but i am not sure 100%.

Comment 6 Standa Laznicka 2021-05-04 08:24:53 UTC
Moving this to routing as they might be more helpful then me when it comes to ingresses/DNS. This seems like a platform-dependent bug though.

Comment 7 Stephen Greene 2021-05-04 14:36:58 UTC
Please include a must-gather as well as more details about the cluster's platform. Has this been reproduced on a later 4.7.z?

Comment 8 Ben Nemec 2021-05-04 21:00:35 UTC
This does sound like https://bugzilla.redhat.com/show_bug.cgi?id=1931505. We haven't been seeing that on 4.7, but I'm not aware of any reason it couldn't happen. We had already planned to backport that fix to 4.7 anyway, so I'm going to mark this as a duplicate. Feel free to reopen if it continues to happen after the backport merges.

*** This bug has been marked as a duplicate of bug 1957015 ***

Comment 10 Omri Hochman 2021-05-05 22:10:03 UTC

*** This bug has been marked as a duplicate of bug 1957015 ***