Bug 1732585 - Kibana shows 500 Internal Server Error after cluster reboot
Summary: Kibana shows 500 Internal Server Error after cluster reboot
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks: 1745182
TreeView+ depends on / blocked
 
Reported: 2019-07-23 19:34 UTC by Steven Walter
Modified: 2019-12-27 07:39 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1745182 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:30:48 UTC
Target Upstream Version:
Embargoed:
mzali: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 234 0 'None' closed Bug 1732585: Modify kibana-proxy to rely on ServiceAccount as oauthcl… 2021-01-29 16:07:36 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:31:00 UTC

Description Steven Walter 2019-07-23 19:34:14 UTC
Description of problem:


# oc logs -n openshift-logging -c kibana-proxy kibana-6688c97646-8kgpj
2019/06/10 21:37:29 oauthproxy.go:646: error redeeming code (client:241.0.6.74:34706): got 400 from "https://oauth.example.com/oauth/token" {"error":"unauthorized_client","error_description":"The client is not authorized to request a token using this method."}
2019/06/10 21:37:29 oauthproxy.go:439: ErrorPage 500 Internal Error Internal Error

# oc logs -n openshift-authentication oauth-openshift-74578fc7d4-tgnqr
E0610 21:37:29.040927       1 access.go:543] osin: error=unauthorized_client, internal_error=<nil> get_client=client check failed, client_id=kibana-proxy


Workaround is to delete the Kibana pod and letting it restart.


Version-Release number of selected component (if applicable):
4.1



Steps to Reproduce:
1. Customer shuts down the whole cluster (i.e. shut down all their AWS instances) overnight
2. When everything is booted back up in the morning all the logging pods come up but kibana errors when connecting to ES.





Additional info:
As suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1724053  opened a new bug to track 4.1 as it might not be the same issue.

Comment 1 Steven Walter 2019-07-23 19:35:11 UTC
Customer notes:
It's happening in every OCP 4.1 cluster where we stop the EC2 instances in the evening.

OCP 4.1 clusters that keep running continuously don't seem to have this problem.

Comment 2 Jeff Cantrill 2019-07-30 14:46:08 UTC
This may be similar/same as https://bugzilla.redhat.com/show_bug.cgi?id=1724053.  Is it possible to get Kibana back into that conditions and compare the oauthclient.secret (plain) to the kibana secret entry oauthsecret(base64 encoded)

Comment 4 Jeff Cantrill 2019-07-30 19:08:47 UTC
(In reply to Carsten Clasohm from comment #3)
> We have a 4.1.4 cluster where Kibana is in this condition (500 Internal
> Error after login to Kibana).
> 
Can you attach the yaml of the kibana deployment, oauthclient, and kibana secret when you see the issue?  Speculating if maybe the operator regens these objects but the container already loaded the secret and they are no longer in sync.

Comment 9 Carsten Clasohm 2019-07-31 14:14:30 UTC
(In reply to Jeff Cantrill from comment #4)
> Can you attach the yaml of the kibana deployment, oauthclient, and kibana
> secret when you see the issue?  Speculating if maybe the operator regens
> these objects but the container already loaded the secret and they are no
> longer in sync.

The private attachments I added were taken after the cluster had been switched off over night. Kibana logins give us the 500 Internal Error at the moment.

Let me know if you need any information from within the running Kibana pod.

Comment 10 Masaki Furuta ( RH ) 2019-08-06 10:06:06 UTC
(In reply to Carsten Clasohm from comment #9)
> (In reply to Jeff Cantrill from comment #4)
> > Can you attach the yaml of the kibana deployment, oauthclient, and kibana
> > secret when you see the issue?  Speculating if maybe the operator regens
> > these objects but the container already loaded the secret and they are no
> > longer in sync.
> 
> The private attachments I added were taken after the cluster had been
> switched off over night. Kibana logins give us the 500 Internal Error at the
> moment.
> 
> Let me know if you need any information from within the running Kibana pod.

Hi Jeff Cantrill,

Another customer has also reported similar issue on sfdc#02434370.
Would you want to grab another information from them too ?

Because I do not want to confuse you with similar information from different customer, I will hold untill receivning response from you.

Thank your for your help and your investigation.

Thank you,

BR,
Masaki

Comment 11 Jeff Cantrill 2019-08-13 17:04:24 UTC
Suggestions to try from the security team as this is related to the oauth-proxy:

* Fresh browser session or even an 'incognito' browser to ensure its not related to browser caching
* deleting oauth client authorizations:  `oc -n openshift-logging delete oauthclientauthorizations`

Given this only occurs when you shutdown the cluster at night we should consider lowering the priority

Comment 14 Masaki Furuta ( RH ) 2019-08-23 07:06:03 UTC
(In reply to Masaki Furuta from comment #13)

Hi Jeff,

Thank you for your continued help and support.
Would you please find feedback from customer at #12?

In order to investigate this issue further, would you please let me know whether there is anything else I would request customer to verify , or what would I ask them to collect additionally ?

I am grateful for your help and your suggestion.

Thank you,

BR,
Masaki

Comment 17 Anping Li 2019-09-16 08:37:44 UTC
Verified using v4.2.0-201909151553.  After the kibana can be access after start from stoppping status.    note: oauthclient/kibana-proxy have been dropped in 4.2.

Comment 20 Jeff Cantrill 2019-09-27 13:18:49 UTC
(In reply to Muhammad Aizuddin Zali from comment #18)
> I hit the same issue with customer updating 4.1.16 -> 4.1.17 involving node
> reboot. Do we need to tell customer you need to reinstall logging instance
> for now(since I believed no workaround atm) and wait for 4.2?

Note this issue is logged for 4.1 here: https://bugzilla.redhat.com/show_bug.cgi?id=1745182

Comment 21 errata-xmlrpc 2019-10-16 06:30:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.