Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1625776

Summary:	Admin console does not report failures when OAuth discovery fails
Product:	OpenShift Container Platform	Reporter:	Justin Pierce <jupierce>
Component:	Management Console	Assignee:	Samuel Padgett <spadgett>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Yadan Pei <yapei>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	aos-bugs, jokerman, mmccomas, spadgett, yapei
Target Milestone:	---	Keywords:	Reopened
Target Release:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-17 13:13:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Justin Pierce 2018-09-05 20:30:01 UTC

Description of problem:
After upgrade to from 3.10 to 3.11, admin-console would partially load and then report 404 error (it would then reload and repeat this process continuously). The default log level from the console pods was insufficient to diagnose the issue.

[root@starter-ca-central-1-master-692e9 ~]# oc logs console-765498587-8t458
2018/09/4 14:42:30 cmd/main: cookies are secure!
2018/09/4 14:42:36 cmd/main: Binding to 0.0.0.0:8443...
2018/09/4 14:42:36 cmd/main: using TLS
2018/09/4 14:45:34 http: TLS handshake error from 10.129.16.1:51664: EOF
2018/09/4 14:52:24 http: TLS handshake error from 10.131.0.1:54804: EOF
2018/09/4 21:30:09 http: TLS handshake error from 10.129.16.1:35142: EOF
2018/09/4 22:24:55 http: TLS handshake error from 10.129.16.1:53138: EOF
2018/09/4 23:30:50 http: TLS handshake error from 10.128.2.1:46880: EOF
2018/09/5 13:25:54 http: TLS handshake error from 10.128.2.1:47212: EOF
2018/09/5 14:42:30 http: TLS handshake error from 10.128.2.1:41478: EOF
2018/09/5 16:24:38 http: TLS handshake error from 10.128.2.1:48464: EOF
2018/09/5 16:26:07 http: TLS handshake error from 10.128.2.1:37192: EOF
2018/09/5 17:21:57 http: TLS handshake error from 10.128.2.1:41024: EOF
2018/09/5 17:52:35 http: TLS handshake error from 10.128.2.1:58118: EOF

When an attempt was made to increase the logging level, the problem resolved itself. So this BZ is to ensure the default logging is increased to help analyze any future occurrence. 

Version-Release number of selected component (if applicable):
3.11.0-0.21.0

Comment 1 Samuel Padgett 2018-09-05 20:36:33 UTC

We probably want to default to error rather than info to avoid noise in the logs.

Comment 2 Samuel Padgett 2018-09-05 23:23:06 UTC

https://github.com/openshift/console/pull/498

Comment 3 Samuel Padgett 2018-09-06 20:16:48 UTC

(In reply to Samuel Padgett from comment #2)
> https://github.com/openshift/console/pull/498

498 has been replaced by https://github.com/openshift/console/pull/506

Comment 4 Yadan Pei 2018-09-10 09:21:04 UTC

How could we mock the failure to see if admin console pods give enough logs?

Comment 5 Samuel Padgett 2018-09-10 15:38:57 UTC

The easiest way I know is to run off-cluster with an invalid `k8s-mode-off-cluster-endpoint` value.

It's harder to test on-cluster since it should not be possible to configure incorrectly, although it would fail if the API server can't be reached from the container.

Comment 6 Yadan Pei 2018-09-11 06:17:43 UTC

1. cd /path/to/console & git pull latest code
2. export OPENSHIFT_API="https://127.0.0.1:8446"
3. Run bridge in off-cluster mode, we could see error info printed

$ ./examples/run-bridge.sh
2018/09/11 13:59:44 cmd/main: cookies are not secure because base-address is not https!
2018/09/11 13:59:44 auth: error contacting openid connect provider (retrying in 2s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 13:59:46 auth: error contacting openid connect provider (retrying in 4s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 13:59:50 auth: error contacting openid connect provider (retrying in 8s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 13:59:58 auth: error contacting openid connect provider (retrying in 16s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:00:14 auth: error contacting openid connect provider (retrying in 32s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:00:46 auth: error contacting openid connect provider: Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:00:46 cmd/main: Error initializing OIDC authenticator: Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused


Then I reset to a commit which doesn't contain fix PR "#506"
$ git reset --hard 6a062bc7da88d3aa38bd7b44dd1fe9150dfabb1c
HEAD is now at 6a062bc7d Merge pull request #502 from rhamilto/console-54
$ git log --pretty="%h %cd - %s" | grep '#506'  (doesn't container our #506)
8c767904d Thu Mar 3 09:50:49 2016 -0800 - Merge pull request #506 from coreos-inc/have-curl-around-just-in-case

And repeat step2 - 3 above, I also got the same errors as above.
./examples/run-bridge.sh 
2018/09/11 14:06:19 cmd/main: cookies are not secure because base-address is not https!
2018/09/11 14:06:19 auth: error contacting openid connect provider (retrying in 2s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:06:21 auth: error contacting openid connect provider (retrying in 4s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:06:25 auth: error contacting openid connect provider (retrying in 8s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:06:33 auth: error contacting openid connect provider (retrying in 16s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:06:49 auth: error contacting openid connect provider (retrying in 32s): Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:07:21 auth: error contacting openid connect provider: Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused
2018/09/11 14:07:21 cmd/main: Error initializing OIDC authenticator: Get https://127.0.0.1:8446/.well-known/oauth-authorization-server: dial tcp 127.0.0.1:8446: connect: connection refused


In both conditions, err info is printed as code: 
log.Errorf("error contacting openid connect provider (retrying in %s): %v", backoff, err)

Hi Sam, could you help confirm if this is what we expect? IMO it should not.

Comment 7 Samuel Padgett 2018-09-11 13:24:16 UTC

Did you run `./build-backend.sh` after changing commits? You need to rebuild the Go code to test.

Comment 8 Yadan Pei 2018-09-12 03:02:21 UTC

You're right, after resetting commit and run ./build.sh, the error info is different

$ git reset --hard 2e53f3c102eec5821d94b65fa75a72b0ee50f805
$ ./build.sh
$ export OPENSHIFT_API="https://127.0.0.1:8446"
$ ./examples/run-bridge.sh
2018/09/12 10:47:23 cmd/main: cookies are not secure because base-address is not https!
2018/09/12 10:47:23 cmd/main: Binding to 127.0.0.1:9000...
2018/09/12 10:47:23 cmd/main: not using TLS

error info is not printed

So I will verify the bug based on comparison between old and current behavior

Comment 9 Samuel Padgett 2018-09-13 14:31:32 UTC

*** Bug 1628573 has been marked as a duplicate of this bug. ***

Comment 10 Samuel Padgett 2018-09-13 20:04:06 UTC

yapei - can you verify that the admin console login works correctly on a 3.10 -> 3.11 upgrade? The upgrade itself previously wouldn't fail because of this bug, but login didn't work. This was the original problem we were trying to fix, so it would be good to validate that scenario.

See discussion in bug 1628573.

Comment 11 Yadan Pei 2018-09-14 09:38:12 UTC

Sam, upgrade testing is blocked by bug 1628730, will give another try once upgrade blocker is fixed