1694184 – Console never becomes available in some installs

Bug 1694184 - Console never becomes available in some installs

Summary: Console never becomes available in some installs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Management Console
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	bpeterse
QA Contact:	Yadan Pei
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-29 17:38 UTC by Clayton Coleman
Modified:	2019-06-04 10:46 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:46:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:46:47 UTC

Description Clayton Coleman 2019-03-29 17:38:07 UTC

The console regresses to being "unavailable" after the CVO completes install.

fail [github.com/openshift/origin/test/extended/operators/operators.go:164]: Mar 28 21:06:31.162: Some cluster operators never became available <nil>/console

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1488/pull-ci-openshift-installer-master-e2e-aws/4766/

Comment 1 Samuel Padgett 2019-03-29 17:53:27 UTC

We can't connect to well-known:

2019/03/28 21:06:03 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found

cc Mo

Comment 2 Samuel Padgett 2019-03-29 17:59:55 UTC

This should be fixed if we create separate readiness and liveness probes for console

Comment 3 bpeterse 2019-03-29 20:39:51 UTC

Further investigation, this may be in operator logic, I'll take the bug back.

Comment 4 shahan 2019-04-03 06:29:18 UTC

We usually also met following error, atm console cannot be accessed atm. It seems that the console is waiting for well-known related stuff ready from the authentication server. 
The console will work well after several minutes.
$ oc logs console-5484f684b9-t985l -n openshift-console
2019/04/2 06:49:07 auth: error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://openshift-authentication-openshift-authentication.apps.zitang-401.qe.devcluster.openshift.com/oauth/token failed: Head https://openshift-authentication-openshift-authentication.apps.zitang-401.qe.devcluster.openshift.com: EOF
And the pod's events like Readiness probe failed: Get https://10.128.0.65:8443/health: dial tcp 10.128.0.65:8443: connect: connection refused”

Comment 5 Samuel Padgett 2019-04-03 14:37:30 UTC

(In reply to shahan from comment #4)
> We usually also met following error, atm console cannot be accessed atm. It
> seems that the console is waiting for well-known related stuff ready from
> the authentication server. 
> The console will work well after several minutes.
> $ oc logs console-5484f684b9-t985l -n openshift-console
> 2019/04/2 06:49:07 auth: error contacting auth provider (retrying in 10s):
> request to OAuth issuer endpoint
> https://openshift-authentication-openshift-authentication.apps.zitang-401.qe.
> devcluster.openshift.com/oauth/token failed: Head
> https://openshift-authentication-openshift-authentication.apps.zitang-401.qe.
> devcluster.openshift.com: EOF

This is normal, transient error during a new install. Console blocks until the OAuth server is ready. Are you seeing it go back into this state after a install completes successfully? Or only when first installing?

Comment 6 shahan 2019-04-04 03:02:58 UTC

I met this issue when an install completes successfully. After a long time(Maybe more ten minutes, I can not remember the specific time), console came back. If it is not a issue on the console, it might be for OAuth server i think. In short, it is not our expected behavior.

Comment 7 Samuel Padgett 2019-04-04 13:32:20 UTC

OK, you could be seeing a symptom of this bug. I believe console is reporting available before it's really ready in some circumstances.

Comment 8 W. Trevor King 2019-04-18 00:01:19 UTC

New instance of this [1].  Reconstructed timeline:

18:56:45.342354 kube-apiserver Available
18:58:36.101073 console Available, but "route ingress not yet ready for console"
18:59:04.395264 console not Available, "Operator sync loop failed to complete" and "No pods available for console deployment"
18:59:14        console-7485df4cdb-kdz85 started
18:59:19        console-7485df4cdb-cpj52 started
19:01:11.071431 authentication not Available, OAuth 404
19:01:14        Install complete
19.01.13.248047 authentication Available
19:02:10        console-7485df4cdb-kdz85 killed, OAuth certificate signed by unknown authority
19:02:10        console-7485df4cdb-kdz85 restarted
19:02:32.179    failing test, console unavailable
19:02:41.006344 console Available
19:02:57        console-7485df4cdb-cpj52 killed, OAuth certificate signed by unknown authority and 404s
19:02:59        console-7485df4cdb-cpj52 restarted
19:03:01.265096 authentication not Available, OAuth 404
19:03:03.252500 authentication Available
19:03:05.312084 authentication not Available, OAuth 404
19:03:07.279044 authentication Available

CVO stops checking you on bootstrap if you claim Available and set a version, so the early flap means we didn't block "Install complete".  Of course, fixing the OAuth issues would be nice too, but there's still a bug in the console operator's reporting.


[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1646/pull-ci-openshift-installer-master-e2e-aws/5422

Comment 9 Samuel Padgett 2019-04-18 00:45:40 UTC

The error seems to be occurring here. We incorrectly sync status twice. The second request fails because the object has been modified. Somewhere Available is defaulted to True even though we're not explicitly setting it.

time="2019-04-17T18:58:36Z" level=info msg=Operator.Status.Conditions
time="2019-04-17T18:58:36Z" level=info msg="Status.Condition.UnsupportedConfigOverridesUpgradeable: True"
time="2019-04-17T18:58:36Z" level=info msg="Status.Condition.Available: Unknown | (SyncError) \"oauth\": route ingress not yet ready for console\n"
time="2019-04-17T18:58:36Z" level=info msg="Status.Condition.Progressing: True | (SyncError) \"oauth\": route ingress not yet ready for console\n"
time="2019-04-17T18:58:36Z" level=info msg="Status.Condition.Failing: True | (SyncError) \"oauth\": route ingress not yet ready for console\n"
I0417 18:58:36.093332       1 option.go:62] Console: handling update /cluster: /apis/operator.openshift.io/v1/consoles/cluster
time="2019-04-17T18:58:36Z" level=info msg=Operator.Status.Conditions
time="2019-04-17T18:58:36Z" level=info msg="Status.Condition.UnsupportedConfigOverridesUpgradeable: True"
time="2019-04-17T18:58:36Z" level=info msg="Status.Condition.Failing: True | (SyncLoopError) Operator sync loop failed to complete."
I0417 18:58:36.093775       1 status_controller.go:159] clusteroperator/console diff {"status":{"conditions":[{"lastTransitionTime":"2019-04-17T18:58:36Z","message":"Failing: \"oauth\": route ingress not yet ready for console\nFailing: ","reason":"AsExpected","status":"False","type":"Failing"},{"lastTransitionTime":"2019-04-17T18:58:36Z","message":"Progressing: \"oauth\": route ingress not yet ready for console\nProgressing: ","reason":"ProgressingSyncError","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-04-17T18:58:36Z","message":"Available: \"oauth\": route ingress not yet ready for console\nAvailable: ","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2019-04-17T18:58:34Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
time="2019-04-17T18:58:36Z" level=error msg="status update error: Operation cannot be fulfilled on consoles.operator.openshift.io \"cluster\": the object has been modified; please apply your changes to the latest version and try again \n"
time="2019-04-17T18:58:36Z" level=info msg="finished syncing operator \"cluster\" (35.247µs) \n\n"
E0417 18:58:36.098020       1 controller.go:130] {🐼 🐼} failed with: route ingress not yet ready for console
I0417 18:58:36.101073       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-console-operator", Name:"console-operator", UID:"bf4421e7-6142-11e9-ad06-1204721f74ca", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator console changed: Failing changed from Unknown to False ("Failing: \"oauth\": route ingress not yet ready for console\nFailing: "),Progressing changed from Unknown to True ("Progressing: \"oauth\": route ingress not yet ready for console\nProgressing: "),Available changed from Unknown to True ("Available: \"oauth\": route ingress not yet ready for console\nAvailable: ")

Comment 10 Samuel Padgett 2019-04-23 14:43:42 UTC

https://github.com/openshift/console-operator/pull/195

Comment 11 Yanping Zhang 2019-04-25 06:23:17 UTC

4.1.0-0.nightly-2019-04-24-221323
console operator image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e6d7c17581d4c264aa7a9d9c6374e57f1fc4d3d849cff532adc308795a85f582
io.openshift.build.commit.url=https://github.com/openshift/console-operator/commit/58eb2fd3a5938c662c904441ada55c925e31ceec

Checked on ocp4.0 env with above image version, console installation completed successfully. Console could be accessed. 
Check console pod log and console operator pod log:
Console pod log shows auth error:
$ oc logs pod/console-5c74978b4d-j8jp4 -n openshift-console --config=ui25.kubeconfig 
2019/04/25 00:56:32 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/25 00:56:42 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/25 00:56:52 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/25 00:57:02 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/25 00:57:12 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/25 00:57:22 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found
2019/04/25 00:57:32 auth: error contacting auth provider (retrying in 10s): discovery through endpoint https://172.30.0.1:443/.well-known/oauth-authorization-server failed: 404 Not Found

Console operator pod log: http://pastebin.test.redhat.com/758248

@Ben, there is auth error in console log, is it expected? the fix pr https://github.com/openshift/console-operator/pull/195 is for console operator, and the console operator pod log looks normal.

Comment 12 Samuel Padgett 2019-04-25 13:56:58 UTC

The auth error is expected and harmless as long as it can eventually connect. Several components are starting at once. This simply indicates the OAuth server isn't ready yet.

The key thing to check for this bug is that console operator doesn't report

  Available: True -> Available: False -> Available: True

during initial install. Once it becomes available, it should stay that way. (Operator will print its status to the pod log.)

Comment 13 Yanping Zhang 2019-04-28 03:44:21 UTC

4.1.0-0.nightly-2019-04-25-121505
console operator image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e6d7c17581d4c264aa7a9d9c6374e57f1fc4d3d849cff532adc308795a85f582
io.openshift.build.commit.url=https://github.com/openshift/console-operator/commit/58eb2fd3a5938c662c904441ada55c925e31ceec
Install on OCP4.0 env with above version.
Checked console operator pod log from begin to end, could find that the status changed from False to True:
time="2019-04-28T02:40:17Z" level=info msg="Status.Condition.Available: False | (NoPodsAvailable) 0 pods available for console deployment"
time="2019-04-28T02:43:01Z" level=info msg="Status.Condition.Available: True"

The bug has been fixed, so move it to Verified.

Comment 15 errata-xmlrpc 2019-06-04 10:46:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.