Bug 1729355 - The console-operator in updating state causes installer to timeout in some CI jobs
Summary: The console-operator in updating state causes installer to timeout in some CI...
Keywords:
Status: CLOSED DUPLICATE of bug 1729356
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Management Console
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Jakub Hadvig
QA Contact: Yadan Pei
URL:
Whiteboard: buildcop
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-12 03:48 UTC by Nick Hale
Modified: 2019-07-15 21:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-13 11:48:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nick Hale 2019-07-12 03:48:42 UTC
Description of problem:

The console-operator is one of those stuck in an updating state which causes the installer to timeout in a number of CI jobs:

Installing from release registry.svc.ci.openshift.org/ci-op-38g9c92q/release@sha256:e6d6fa46a8805ee52eaa0e36ec1b9e2a296df6f31725175d74194d5078d0c7d8
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.14.0+696110f up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console"

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1965/pull-ci-openshift-installer-master-e2e-aws/6307/

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/23360/pull-ci-openshift-origin-master-e2e-aws/10939/

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1971/pull-ci-openshift-installer-master-e2e-aws/6344/?log#log

This may be caused by another component, so there could exist a dupe BZ.

Comment 1 Nick Hale 2019-07-12 04:03:56 UTC
Added BZ for the authentication-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1729356

Comment 2 Samuel Padgett 2019-07-12 20:16:07 UTC
The console will not become ready until the OAuth server is ready.

Nick, did you see any instances where it was just the console pending and not the authentication operator?

Comment 3 Samuel Padgett 2019-07-12 20:18:06 UTC
Currently the console container won't pass its readiness check until it can discover the OAuth metadata, which won't happen until the OAuth server is available.

Comment 4 Nick Hale 2019-07-12 21:58:02 UTC
> Nick, did you see any instances where it was just the console pending and not the authentication operator?

I did not. It looks like all occurrences of console pending were accompanied by authentication - however, the converse is not true; authentication shows up without console in several other jobs (see the auth sibling BZ listed above).

Comment 5 Nick Hale 2019-07-12 21:58:32 UTC
> Nick, did you see any instances where it was just the console pending and not the authentication operator?

I did not. It looks like all occurrences of console pending were accompanied by authentication - however, the converse is not true; authentication shows up without console in several other jobs (see the auth sibling BZ listed above).

Comment 6 Samuel Padgett 2019-07-13 11:48:48 UTC
I checked the logs for the 3 linked jobs, and the console is failing on connecting to the OAuth server in all of them. This is likely networking related, but again the console won't report ready until it can discover the OAuth metadata, which depends on the OAuth server.

We've talked about delaying OAuth discovery until the first user logs in. That might be a good change, but it would only change the console reporting in this case. It wouldn't actually fix the problem here since you can't use console without being able to log in.

Given the information in the logs, this is a duplicate of the second bug you've opened against auth, bug 1729356, which is blocking console startup.


https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1965/pull-ci-openshift-installer-master-e2e-aws/6307/

2019/07/10 17:07:29 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-56z3m9tz-1d3f3.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-56z3m9tz-1d3f3.origin-ci-int-aws.dev.rhcloud.com: EOF


https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/23360/pull-ci-openshift-origin-master-e2e-aws/10939/

2019/07/11 16:52:47 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found


https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1971/pull-ci-openshift-installer-master-e2e-aws/6344/?log#log

2019/07/11 19:43:04 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com: EOF

*** This bug has been marked as a duplicate of bug 1729356 ***

Comment 7 Ben Parees 2019-07-13 14:47:25 UTC
Is the console reporting a clear reason on its clusteroperator status conditions that makes it obvious oauth is the issue so we can identify this from telemeter w/o further data gathering?  And/or anything else the console operator could have done to make the analysis you had to do here simpler?

Comment 8 Samuel Padgett 2019-07-15 20:49:31 UTC
(In reply to Ben Parees from comment #7)
> Is the console reporting a clear reason on its clusteroperator status
> conditions that makes it obvious oauth is the issue so we can identify this
> from telemeter w/o further data gathering?  And/or anything else the console
> operator could have done to make the analysis you had to do here simpler?

Great point. Currently, we don't report a clear reason. The problem is that the operator doesn't necessarily know. I'm not sure if there is a good way for console to communicate it back. If you have thoughts on this, I'm definitely open to ideas.

It might be easier to skip the OAuth metadata check before reporting ready and only read OAuth metadata when the user logs in. Then it's not an issue.

Comment 9 Ben Parees 2019-07-15 20:54:21 UTC
The operator could check if the auth clusteroperator is reporting available, right?

Skipping it entirely is also ok as long as you can give a meaningful error to the user(who will probably have to then forward it to their cluster admin) if things fail during login because oauth is not actually available yet.

Comment 10 Samuel Padgett 2019-07-15 21:11:09 UTC
(In reply to Ben Parees from comment #9)
> The operator could check if the auth clusteroperator is reporting available, right?

We could. It feels weird to be watching the clusteroperator resource for a different operator. The console could also be failing for another unrelated reason when the auth operator happens to be unavailable.

Presumably, we have the auth operator status from telemeter, so it's not giving us new information. (Admittedly, it makes it more obvious, though.)

Comment 11 Ben Parees 2019-07-15 21:15:51 UTC
yeah, I don't feel strongly and I agree watching another operator is slightly weird.

The key goals here should be that:

1) you don't fail unless you need to fail
2) when you fail, it's clear why you're failing (e.g. because of an unmet pre-req in this case).

My concern w/ the current situation is that we see both console + auth operators in degraded or unavailable states.  From telemeter that's not enough information to know if console is failing because of auth, or for its own reasons, so you're going to get pinged a lot.

So anything that address that is ok with me.


Note You need to log in before you can comment on or make changes to this bug.