Description of problem: Customer in secured environment is seeing a few strange behaviors in the EFK stack. In the non-ops kibana, they either get: Courier Fetch: unhandled courier request error: [security_exception] no permissions for indices:data/read/mget Or a gateway timeout with similar/same message as: https://github.com/elastic/kibana/issues/12707 These seem relatively unpredictable when it will occur (i.e. sometimes they'll get one message, then after clearing browser cache or using a different browser/client they'll get the other, etc) - They tried increasing timeouts as per unverified kcs https://access.redhat.com/solutions/3508851 but this didn't do much oc set env dc/logging-kibana ELASTICSEARCH_REQUESTTIMEOUT=119000 ELASTICSEARCH_SHARDTIMEOUT=119000 - We considered that Elasticsearch might be too slow to respond -- INSTANCE_RAM and the limit are set to 6gb in one environment and 8gb in another. However, the cluster is essentially empty (only a single app running for test purposes, a single master, only a couple nodes). All the indices are green and healthy, fluentd is not reporting any errors. Also the issue occurs even when trying to view pages like the settings page, so it's not only when contacting the backend (so far as we can tell) In the ops kibana, after authentication step, kibana replies to the client with just an empty response body. From the browser's perspective, it's an empty body but a 200 response message. The kibana logs seem to indicate the same -- a response length of 12 (maybe header?) and a 200. They set up graphical tools on the master host so they could use a browser *from* the master -- same behavior. Version-Release number of selected component (if applicable): 3.9 kibana 4.064 build 10229 How reproducible: Unconfirmed. This affects multiple environments, and multiple image versions (they recently updated to the current latest 3.9 images just to be sure) but we can't seem to reproduce in test environments. I normally would suspect an environment or config issue, however since even kibana's logs seem to indicate a near-empty response body I'm not entirely sure. Notes: I figure we're missing something else reasonably "obvious" to look at, at least with respect to the timeouts -- but I'm not sure how that would affect the settings menu or etc. Consulting is on site so while we can't get data out (logs, configs, etc) we can interface through them.
Some of the navigational issues from web-console will be resolved by [1]. I would expect 'indices:data/read/mget' error to be resolved by [2] for users who have this permission; I don't see that change in 3.9. One can dump the access permissions initially after a login attempt with something like [3]: 'oc exec -c elasticsearch $espod -- es_acl get --doc=(roles|rolesmapping|actiongroups)' Additionally, you can rsh to the pod, manually update $HOME/sgconfig/sg_action_groups.yaml and run 'es_seed_acl' to see if it resolves the permission problem. We probably need to update the values in 3.9 You can additionall use [4] to determine what the role name will be for a given user assuming the kibana index mode is not 'unique'. It could be this failure is causing the issue. [1] https://github.com/openshift/origin-web-console/pull/3088 [2]https://github.com/openshift/origin-aggregated-logging/blob/master/elasticsearch/sgconfig/sg_action_groups.yml#L119 [3] https://github.com/openshift/origin-aggregated-logging/blob/release-3.9/elasticsearch/utils/es_acl [4] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/kibana-index-name
@Steve, Can you confirm if updating the action groups to correct the 'indices:data/read/mget' permission and reseeding resolves their error?
Customer has not reported on whether the proposed fixes helped. They did notice their MTU was set to an unexpected value on some of the nodes, and got *some* better results from that (at least, the blank pages issue). They said they were going to install a new cluster to test with, but I have not heard back from them since. If you'd like we can close this with INSUFFICIENT_DATA for now -- reopening if required.
Closing per original resolution. Please open/clone issue as pertinent for the proper version. Your attachment points to a 3.11 cluster which is ES5 and completely different from 3.9 which ES2
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days