Bug 1644008 - kibana presenting blank page or timeout [NEEDINFO]
Summary: kibana presenting blank page or timeout
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.9.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks: 1705589 1726433
TreeView+ depends on / blocked
 
Reported: 2018-10-29 18:01 UTC by Steven Walter
Modified: 2019-07-22 19:28 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1705589 (view as bug list)
Environment:
Last Closed: 2019-05-02 13:05:43 UTC
Target Upstream Version:
jcantril: needinfo? (jfoots)


Attachments (Terms of Use)

Description Steven Walter 2018-10-29 18:01:43 UTC
Description of problem:
Customer in secured environment is seeing a few strange behaviors in the EFK stack.

In the non-ops kibana, they either get:

Courier Fetch: unhandled courier request error: [security_exception] no permissions for indices:data/read/mget 

Or a gateway timeout with similar/same message as:
https://github.com/elastic/kibana/issues/12707

These seem relatively unpredictable when it will occur (i.e. sometimes they'll get one message, then after clearing browser cache or using a different browser/client they'll get the other, etc)
  - They tried increasing timeouts as per unverified kcs https://access.redhat.com/solutions/3508851 but this didn't do much
     oc set env dc/logging-kibana ELASTICSEARCH_REQUESTTIMEOUT=119000 ELASTICSEARCH_SHARDTIMEOUT=119000
  - We considered that Elasticsearch might be too slow to respond -- INSTANCE_RAM and the limit are set to 6gb in one environment and 8gb in another. However, the cluster is essentially empty (only a single app running for test purposes, a single master, only a couple nodes). All the indices are green and healthy, fluentd is not reporting any errors. Also the issue occurs even when trying to view pages like the settings page, so it's not only when contacting the backend (so far as we can tell)

In the ops kibana, after authentication step, kibana replies to the client with just an empty response body. From the browser's perspective, it's an empty body but a 200 response message. The kibana logs seem to indicate the same -- a response length of 12 (maybe header?) and a 200.

They set up graphical tools on the master host so they could use a browser *from* the master -- same behavior.


Version-Release number of selected component (if applicable):
3.9

kibana 4.064
build 10229


How reproducible:
Unconfirmed. This affects multiple environments, and multiple image versions (they recently updated to the current latest 3.9 images just to be sure) but we can't seem to reproduce in test environments. I normally would suspect an environment or config issue, however since even kibana's logs seem to indicate a near-empty response body I'm not entirely sure.


Notes:
I figure we're missing something else reasonably "obvious" to look at, at least with respect to the timeouts -- but I'm not sure how that would affect the settings menu or etc.

Consulting is on site so while we can't get data out (logs, configs, etc) we can interface through them.

Comment 6 Jeff Cantrill 2018-10-31 18:32:41 UTC
Some of the navigational issues from web-console will be resolved by [1].

I would expect 'indices:data/read/mget' error to be resolved by [2] for users who have this permission; I don't see that change in 3.9.  One can dump the access permissions initially after a login attempt with something like [3]:

'oc exec -c elasticsearch $espod -- es_acl get --doc=(roles|rolesmapping|actiongroups)'

Additionally, you can rsh to the pod, manually update $HOME/sgconfig/sg_action_groups.yaml and run 'es_seed_acl' to see if it resolves the permission problem.  We probably need to update the values in 3.9  You can additionall use [4] to determine what the role name will be for a given user assuming the kibana index mode is not 'unique'.  

It could be this failure is causing the issue.


[1] https://github.com/openshift/origin-web-console/pull/3088
[2]https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/sgconfig/sg_action_groups.yml#L119
[3] https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/elasticsearch/utils/es_acl
[4] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/kibana-index-name

Comment 7 Jeff Cantrill 2018-11-12 20:22:38 UTC
@Steve,

Can you confirm if updating the action groups to correct the 'indices:data/read/mget' permission and reseeding resolves their error?

Comment 8 Steven Walter 2018-11-14 16:03:50 UTC
Customer has not reported on whether the proposed fixes helped. They did notice their MTU was set to an unexpected value on some of the nodes, and got *some* better results from that (at least, the blank pages issue). They said they were going to install a new cluster to test with, but I have not heard back from them since.

If you'd like we can close this with INSUFFICIENT_DATA for now -- reopening if required.

Comment 12 Jeff Cantrill 2019-05-02 13:05:43 UTC
Closing per original resolution.  Please open/clone issue as pertinent for the proper version.  Your attachment points to a 3.11 cluster which is ES5 and completely different from 3.9 which ES2


Note You need to log in before you can comment on or make changes to this bug.