Bug 1732585

Summary:	Kibana shows 500 Internal Server Error after cluster reboot
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	aos-bugs, clasohm, jcantril, mfuruta, mzali, rmeggins
Target Milestone:	---	Flags:	mzali: needinfo-
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1745182 (view as bug list)		Environment:
Last Closed:	2019-10-16 06:30:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1745182

Description Steven Walter 2019-07-23 19:34:14 UTC

Description of problem:


# oc logs -n openshift-logging -c kibana-proxy kibana-6688c97646-8kgpj
2019/06/10 21:37:29 oauthproxy.go:646: error redeeming code (client:241.0.6.74:34706): got 400 from "https://oauth.example.com/oauth/token" {"error":"unauthorized_client","error_description":"The client is not authorized to request a token using this method."}
2019/06/10 21:37:29 oauthproxy.go:439: ErrorPage 500 Internal Error Internal Error

# oc logs -n openshift-authentication oauth-openshift-74578fc7d4-tgnqr
E0610 21:37:29.040927       1 access.go:543] osin: error=unauthorized_client, internal_error=<nil> get_client=client check failed, client_id=kibana-proxy


Workaround is to delete the Kibana pod and letting it restart.


Version-Release number of selected component (if applicable):
4.1



Steps to Reproduce:
1. Customer shuts down the whole cluster (i.e. shut down all their AWS instances) overnight
2. When everything is booted back up in the morning all the logging pods come up but kibana errors when connecting to ES.





Additional info:
As suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1724053  opened a new bug to track 4.1 as it might not be the same issue.

Comment 1 Steven Walter 2019-07-23 19:35:11 UTC

Customer notes:
It's happening in every OCP 4.1 cluster where we stop the EC2 instances in the evening.

OCP 4.1 clusters that keep running continuously don't seem to have this problem.

Comment 2 Jeff Cantrill 2019-07-30 14:46:08 UTC

This may be similar/same as https://bugzilla.redhat.com/show_bug.cgi?id=1724053.  Is it possible to get Kibana back into that conditions and compare the oauthclient.secret (plain) to the kibana secret entry oauthsecret(base64 encoded)

Comment 4 Jeff Cantrill 2019-07-30 19:08:47 UTC

(In reply to Carsten Clasohm from comment #3)
> We have a 4.1.4 cluster where Kibana is in this condition (500 Internal
> Error after login to Kibana).
> 
Can you attach the yaml of the kibana deployment, oauthclient, and kibana secret when you see the issue?  Speculating if maybe the operator regens these objects but the container already loaded the secret and they are no longer in sync.

Comment 9 Carsten Clasohm 2019-07-31 14:14:30 UTC

(In reply to Jeff Cantrill from comment #4)
> Can you attach the yaml of the kibana deployment, oauthclient, and kibana
> secret when you see the issue?  Speculating if maybe the operator regens
> these objects but the container already loaded the secret and they are no
> longer in sync.

The private attachments I added were taken after the cluster had been switched off over night. Kibana logins give us the 500 Internal Error at the moment.

Let me know if you need any information from within the running Kibana pod.

Comment 10 Masaki Furuta ( RH ) 2019-08-06 10:06:06 UTC

(In reply to Carsten Clasohm from comment #9)
> (In reply to Jeff Cantrill from comment #4)
> > Can you attach the yaml of the kibana deployment, oauthclient, and kibana
> > secret when you see the issue?  Speculating if maybe the operator regens
> > these objects but the container already loaded the secret and they are no
> > longer in sync.
> 
> The private attachments I added were taken after the cluster had been
> switched off over night. Kibana logins give us the 500 Internal Error at the
> moment.
> 
> Let me know if you need any information from within the running Kibana pod.

Hi Jeff Cantrill,

Another customer has also reported similar issue on sfdc#02434370.
Would you want to grab another information from them too ?

Because I do not want to confuse you with similar information from different customer, I will hold untill receivning response from you.

Thank your for your help and your investigation.

Thank you,

BR,
Masaki

Comment 11 Jeff Cantrill 2019-08-13 17:04:24 UTC

Suggestions to try from the security team as this is related to the oauth-proxy:

* Fresh browser session or even an 'incognito' browser to ensure its not related to browser caching
* deleting oauth client authorizations:  `oc -n openshift-logging delete oauthclientauthorizations`

Given this only occurs when you shutdown the cluster at night we should consider lowering the priority

Comment 14 Masaki Furuta ( RH ) 2019-08-23 07:06:03 UTC

(In reply to Masaki Furuta from comment #13)

Hi Jeff,

Thank you for your continued help and support.
Would you please find feedback from customer at #12?

In order to investigate this issue further, would you please let me know whether there is anything else I would request customer to verify , or what would I ask them to collect additionally ?

I am grateful for your help and your suggestion.

Thank you,

BR,
Masaki

Comment 17 Anping Li 2019-09-16 08:37:44 UTC

Verified using v4.2.0-201909151553.  After the kibana can be access after start from stoppping status.    note: oauthclient/kibana-proxy have been dropped in 4.2.

Comment 20 Jeff Cantrill 2019-09-27 13:18:49 UTC

(In reply to Muhammad Aizuddin Zali from comment #18)
> I hit the same issue with customer updating 4.1.16 -> 4.1.17 involving node
> reboot. Do we need to tell customer you need to reinstall logging instance
> for now(since I believed no workaround atm) and wait for 4.2?

Note this issue is logged for 4.1 here: https://bugzilla.redhat.com/show_bug.cgi?id=1745182

Comment 21 errata-xmlrpc 2019-10-16 06:30:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922