Bug 1679791 - OCP 4.0: Install on AWS for 4.0.0-0.nightly-2019-02-20-194410 fails consistently with error: "waiting for Grafana Route to become ready" error
Summary: OCP 4.0: Install on AWS for 4.0.0-0.nightly-2019-02-20-194410 fails consisten...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: 4.1.0
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-21 21:09 UTC by Walid A.
Modified: 2019-06-04 10:44 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:44:19 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:44:27 UTC

Description Walid A. 2019-02-21 21:09:32 UTC
Description of problem:
I am getting back to back failures trying to install an OCP 4.0 cluster on AWS with payload "4.0.0-0.nightly-2019-02-20-194410", which is marked "Accepted" on "https://openshift-release.svc.ci.openshift.org/":

The 3 master and 3 worker nodes get created but installer does not complete and bails with Error:

time="2019-02-21T16:09:38Z" level=debug msg="Destroy complete! Resources: 11 destroyed."
time="2019-02-21T16:09:38Z" level=info msg="Waiting up to 30m0s for the cluster to initialize..."
time="2019-02-21T16:09:38Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:09:51Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:10:06Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:10:21Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:11:36Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:12:36Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:12:51Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:13:06Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:13:21Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:15:34Z" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-02-21T16:16:06Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: waiting for Grafana Route to become ready failed: waiting for RouteReady of grafana: an error on the server (\"Internal Server Error: \\\"/apis/route.openshift.io/v1/namespaces/openshift-monitoring/routes/grafana\\\": Post https://172.30.0.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 172.30.0.1:443: connect: connection refused\") has prevented the request from succeeding (get routes.route.openshift.io grafana)"



Version-Release number of the following components:
# ./openshift-install version
./openshift-install v4.0.0-0.177.0.1-dirty
payload:  4.0.0-0.nightly-2019-02-20-194410


How reproducible:
Twice in row with destroy cluster after each failed install

Steps to Reproduce:
1. Create a new OCP 4.0 cluster with nightly payload:
2. export BUILD_VERSION=4.0.0-0.nightly-2019-02-20-194410
3. oc adm release info --pullspecs registry.svc.ci.openshift.org/ocp/release:${BUILD_VERSION} | grep installer
4. export CONTAINER_ID=$(docker create quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ccc80b91aad3a03440d784a09f21ea89d7007a73b9f01210d6c5925340b2650)
5. mkdir installer_4.0.0-0.nightly-2019-02-20-194410
6. cd installer_4.0.0-0.nightly-2019-02-20-194410
7. docker cp $CONTAINER_ID:/usr/bin/openshift-install .
8. export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-20-194410
9.  export _OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-20-194410
10. ./openshift-install --dir=$(pwd) create cluster --log-level=debug

Actual results:
3 masters and 3 worker nodes get created but installer does not complete and bails with error:
  "Still waiting for the cluster to initialize: Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating Grafana failed: waiting for Grafana Route to become ready failed: waiting for RouteReady of grafana: an error on the server .... "

Expected results:
Successful install with KUBECONFIG and login info

Additional info:
Install logs will be provided in next private comment

Comment 2 Johnny Liu 2019-02-22 11:24:21 UTC
Seem like this is some component operator issue in this specific payload image. I tried 4.0.0-0.nightly-2019-02-21-034936 and 4.0.0-0.nightly-2019-02-21-215247, did not hit such issue.

Comment 3 Hongkai Liu 2019-02-22 15:35:29 UTC
Hit the same problem with registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-20-194410

Comment 5 Alex Crawford 2019-02-23 18:41:19 UTC
Moving to the monitoring team since it is their component that is reporting this error.

Comment 6 Junqi Zhao 2019-02-25 01:03:21 UTC
Did not hit the issue with the same payload 4.0.0-0.nightly-2019-02-20-194410

Comment 7 Frederic Branczyk 2019-02-25 09:32:12 UTC
This looks like a transient error, as we are just a consumer of this API. As QE verified this works in latest payloads, I'm suspecting something was fixed in the OpenShift apiserver, or router. Moving to master component, but might be a better fit in "routing".

Comment 9 Junqi Zhao 2019-03-07 10:30:48 UTC
no such issue with 4.0.0-0.nightly-2019-03-06-074438, grafana pods could be create successfully

Comment 12 errata-xmlrpc 2019-06-04 10:44:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.