Description of problem: The console-operator is one of those stuck in an updating state which causes the installer to timeout in a number of CI jobs: Installing from release registry.svc.ci.openshift.org/ci-op-38g9c92q/release@sha256:e6d6fa46a8805ee52eaa0e36ec1b9e2a296df6f31725175d74194d5078d0c7d8 level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised" level=info msg="Consuming \"Install Config\" from target directory" level=info msg="Creating infrastructure resources..." level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443..." level=info msg="API v1.14.0+696110f up" level=info msg="Waiting up to 30m0s for bootstrapping to complete..." level=info msg="Destroying the bootstrap resources..." level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..." level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console" https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1965/pull-ci-openshift-installer-master-e2e-aws/6307/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/23360/pull-ci-openshift-origin-master-e2e-aws/10939/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1971/pull-ci-openshift-installer-master-e2e-aws/6344/?log#log This may be caused by another component, so there could exist a dupe BZ.
Added BZ for the authentication-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1729356
The console will not become ready until the OAuth server is ready. Nick, did you see any instances where it was just the console pending and not the authentication operator?
Currently the console container won't pass its readiness check until it can discover the OAuth metadata, which won't happen until the OAuth server is available.
> Nick, did you see any instances where it was just the console pending and not the authentication operator? I did not. It looks like all occurrences of console pending were accompanied by authentication - however, the converse is not true; authentication shows up without console in several other jobs (see the auth sibling BZ listed above).
I checked the logs for the 3 linked jobs, and the console is failing on connecting to the OAuth server in all of them. This is likely networking related, but again the console won't report ready until it can discover the OAuth metadata, which depends on the OAuth server. We've talked about delaying OAuth discovery until the first user logs in. That might be a good change, but it would only change the console reporting in this case. It wouldn't actually fix the problem here since you can't use console without being able to log in. Given the information in the logs, this is a duplicate of the second bug you've opened against auth, bug 1729356, which is blocking console startup. https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1965/pull-ci-openshift-installer-master-e2e-aws/6307/ 2019/07/10 17:07:29 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-56z3m9tz-1d3f3.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-56z3m9tz-1d3f3.origin-ci-int-aws.dev.rhcloud.com: EOF https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/23360/pull-ci-openshift-origin-master-e2e-aws/10939/ 2019/07/11 16:52:47 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1971/pull-ci-openshift-installer-master-e2e-aws/6344/?log#log 2019/07/11 19:43:04 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com/oauth/token failed: Head https://oauth-openshift.apps.ci-op-38g9c92q-1d3f3.origin-ci-int-aws.dev.rhcloud.com: EOF *** This bug has been marked as a duplicate of bug 1729356 ***
Is the console reporting a clear reason on its clusteroperator status conditions that makes it obvious oauth is the issue so we can identify this from telemeter w/o further data gathering? And/or anything else the console operator could have done to make the analysis you had to do here simpler?
(In reply to Ben Parees from comment #7) > Is the console reporting a clear reason on its clusteroperator status > conditions that makes it obvious oauth is the issue so we can identify this > from telemeter w/o further data gathering? And/or anything else the console > operator could have done to make the analysis you had to do here simpler? Great point. Currently, we don't report a clear reason. The problem is that the operator doesn't necessarily know. I'm not sure if there is a good way for console to communicate it back. If you have thoughts on this, I'm definitely open to ideas. It might be easier to skip the OAuth metadata check before reporting ready and only read OAuth metadata when the user logs in. Then it's not an issue.
The operator could check if the auth clusteroperator is reporting available, right? Skipping it entirely is also ok as long as you can give a meaningful error to the user(who will probably have to then forward it to their cluster admin) if things fail during login because oauth is not actually available yet.
(In reply to Ben Parees from comment #9) > The operator could check if the auth clusteroperator is reporting available, right? We could. It feels weird to be watching the clusteroperator resource for a different operator. The console could also be failing for another unrelated reason when the auth operator happens to be unavailable. Presumably, we have the auth operator status from telemeter, so it's not giving us new information. (Admittedly, it makes it more obvious, though.)
yeah, I don't feel strongly and I agree watching another operator is slightly weird. The key goals here should be that: 1) you don't fail unless you need to fail 2) when you fail, it's clear why you're failing (e.g. because of an unmet pre-req in this case). My concern w/ the current situation is that we see both console + auth operators in degraded or unavailable states. From telemeter that's not enough information to know if console is failing because of auth, or for its own reasons, so you're going to get pinged a lot. So anything that address that is ok with me.