Bug 1945619

Summary:

[Assisted-4.7][Staging] Cluster deployment failed Reason: Timeout while waiting for cluster version to be available

Product:

OpenShift Container Platform

Reporter:

Yuri Obshansky <yobshans>

Component:

Networking

Assignee:

aos-network-edge-staff <aos-network-edge-staff>

Networking sub component:

router

QA Contact:

Hongan Li <hongli>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

medium

Priority:

unspecified

CC:

alazar, aos-bugs, bnemec, itsoiref, mfojtik, ohochman, sgreene, slaznick, sttts

Version:

4.7

Keywords:

Reopened

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

AI-Team-Core

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-05-05 22:10:03 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
installation logs	none
must-gather	none

Description Yuri Obshansky 2021-04-01 12:56:29 UTC

Created attachment 1768248 [details]
installation logs

Description of problem:
The issue detected under performance test running vs Staging service
6 failed cluster deployments

3/31/2021, 10:13:05 PM	error Host worker-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-1: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host master-2-2: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:13:05 PM	error Host worker-2-0: updated status from "installed" to "error" (Host is part of a cluster that failed to install)
3/31/2021, 10:10:58 PM	critical Failed installing cluster ocp-cluster-f34-h18-2. Reason: Timeout while waiting for cluster version to be available
3/31/2021, 9:10:58 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: the cluster operator console has not yet successfully rolled out
3/31/2021, 9:07:58 PM	Update cluster installation progress: Cluster version is available: false , message: Working towards 4.7.2: 654 of 668 done (97% complete)
3/31/2021, 9:01:59 PM	Update cluster installation progress: Cluster version is available: false , message: Unable to apply 4.7.2: some cluster operators have not yet rolled out

Version-Release number of selected component (if applicable):
v1.0.18.1

How reproducible:
https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters/5e46337e-d0c3-4f11-8600-45e8f41671d3

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yuri Obshansky 2021-04-01 12:57:15 UTC

Created attachment 1768249 [details]
must-gather

Comment 2 Igal Tsoiref 2021-05-02 09:02:29 UTC

Console failed to contact OAuth :

2021-04-01T02:08:33.467259692Z E0401 02:08:33.467186       1 auth.go:235] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.ocp-cluster-f34-h18-2.rdu2.scalelab.redhat.com": dial tcp 192.168.125.10:443: connect: connection refused

Didn't find something to point to.

Comment 3 Igal Tsoiref 2021-05-02 09:03:09 UTC

@slaznick maybe you have an idea why we got connection refused?

Comment 4 Standa Laznicka 2021-05-03 07:07:14 UTC

The authentication operator is reporting healthy and the pods are running, which means that there exists a route where the connection can be successful. Console most probably got routed improperly, I'd propose start with checking DNS being correct?

Comment 5 Igal Tsoiref 2021-05-04 06:41:55 UTC

@bnemec can it the problem with ingress ip not freed from the first master? 
It looks the case to say the truth but i am not sure 100%.

Comment 6 Standa Laznicka 2021-05-04 08:24:53 UTC

Moving this to routing as they might be more helpful then me when it comes to ingresses/DNS. This seems like a platform-dependent bug though.

Comment 7 Stephen Greene 2021-05-04 14:36:58 UTC

Please include a must-gather as well as more details about the cluster's platform. Has this been reproduced on a later 4.7.z?

Comment 8 Ben Nemec 2021-05-04 21:00:35 UTC

This does sound like https://bugzilla.redhat.com/show_bug.cgi?id=1931505. We haven't been seeing that on 4.7, but I'm not aware of any reason it couldn't happen. We had already planned to backport that fix to 4.7 anyway, so I'm going to mark this as a duplicate. Feel free to reopen if it continues to happen after the backport merges.

*** This bug has been marked as a duplicate of bug 1957015 ***

Comment 10 Omri Hochman 2021-05-05 22:10:03 UTC


*** This bug has been marked as a duplicate of bug 1957015 ***