1679791 – OCP 4.0: Install on AWS for 4.0.0-0.nightly-2019-02-20-194410 fails consistently with error: "waiting for Grafana Route to become ready" error

Bug 1679791 - OCP 4.0: Install on AWS for 4.0.0-0.nightly-2019-02-20-194410 fails consistently with error: "waiting for Grafana Route to become ready" error

Summary: OCP 4.0: Install on AWS for 4.0.0-0.nightly-2019-02-20-194410 fails consisten...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Frederic Branczyk
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-21 21:09 UTC by Walid A.
Modified:	2019-06-04 10:44 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:44:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:44:27 UTC

Description Walid A. 2019-02-21 21:09:32 UTC

Description of problem:
I am getting back to back failures trying to install an OCP 4.0 cluster on AWS with payload "4.0.0-0.nightly-2019-02-20-194410", which is marked "Accepted" on "https://openshift-release.svc.ci.openshift.org/":

The 3 master and 3 worker nodes get created but installer does not complete and bails with Error:

time="2019-02-21T16:09:38Z" level=debug msg="Destroy complete! Resources: 11 destroyed."
time="2019-02-21T16:09:38Z" level=info msg="Waiting up to 30m0s for the cluster to initialize..."
time="2019-02-21T16:09:38Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:09:51Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:10:06Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:10:21Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:11:36Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:12:36Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:12:51Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:13:06Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:13:21Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:15:34Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:16:06Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: waiting for Grafana Route to become ready failed: waiting for RouteReady of grafana: an error on the server (\"Internal Server Error: \\\"/apis/route.openshift.io/v1/namespaces/openshift-monitoring/routes/grafana\\\": Post https://172.30.0.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 172.30.0.1:443: connect: connection refused\") has prevented the request from succeeding (get routes.route.openshift.io grafana)"



Version-Release number of the following components:
# ./openshift-install version
./openshift-install v4.0.0-0.177.0.1-dirty
payload:  4.0.0-0.nightly-2019-02-20-194410


How reproducible:
Twice in row with destroy cluster after each failed install

Steps to Reproduce:
1. Create a new OCP 4.0 cluster with nightly payload:
2. export BUILD_VERSION=4.0.0-0.nightly-2019-02-20-194410
3. oc adm release info --pullspecs registry.svc.ci.openshift.org/ocp/release:${BUILD_VERSION} | grep installer
4. export CONTAINER_ID=$(docker create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ccc80b91aad3a03440d784a09f21ea89d7007a73b9f01210d6c5925340b2650)
5. mkdir installer_4.0.0-0.nightly-2019-02-20-194410
6. cd installer_4.0.0-0.nightly-2019-02-20-194410
7. docker cp $CONTAINER_ID:/usr/bin/openshift-install .
8. export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-20-194410
9.  export _OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-20-194410
10. ./openshift-install --dir=$(pwd) create cluster --log-level=debug

Actual results:
3 masters and 3 worker nodes get created but installer does not complete and bails with error:
  "Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: waiting for Grafana Route to become ready failed: waiting for RouteReady of grafana: an error on the server .... "

Expected results:
Successful install with KUBECONFIG and login info

Additional info:
Install logs will be provided in next private comment

Comment 2 Johnny Liu 2019-02-22 11:24:21 UTC

Seem like this is some component operator issue in this specific payload image. I tried 4.0.0-0.nightly-2019-02-21-034936 and 4.0.0-0.nightly-2019-02-21-215247, did not hit such issue.

Comment 3 Hongkai Liu 2019-02-22 15:35:29 UTC

Hit the same problem with registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-20-194410

Comment 5 Alex Crawford 2019-02-23 18:41:19 UTC

Moving to the monitoring team since it is their component that is reporting this error.

Comment 6 Junqi Zhao 2019-02-25 01:03:21 UTC

Did not hit the issue with the same payload 4.0.0-0.nightly-2019-02-20-194410

Comment 7 Frederic Branczyk 2019-02-25 09:32:12 UTC

This looks like a transient error, as we are just a consumer of this API. As QE verified this works in latest payloads, I'm suspecting something was fixed in the OpenShift apiserver, or router. Moving to master component, but might be a better fit in "routing".

Comment 9 Junqi Zhao 2019-03-07 10:30:48 UTC

no such issue with 4.0.0-0.nightly-2019-03-06-074438, grafana pods could be create successfully

Comment 12 errata-xmlrpc 2019-06-04 10:44:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.