Created attachment 1536692 [details]
Description of problem:
Installing or upgrading to the latest image version of EFK (at the moment of this writing v3.11.69) ends up with kibana giving constant 504 Gateway Timeout on the browser.
Also inside the kibana container curl es api ends up with error or stuck:
$ oc exec -c kibana <some-kibana-pod-name> -- \
curl --cacert /etc/kibana/keys/ca \
--cert /etc/kibana/keys/cert --key /etc/kibana/keys/key \
From kibana-proxy we are seeing a lot of "http: proxy error: context canceled" errors.
Version-Release number of selected component (if applicable):
Logging images v3.11.69
On some customers every time
Steps to Reproduce:
1. Install logging stack with the v3.11.69 tag
2. Check status from inside elasticsearch pod(s)
3. Curl es-api from inside kibana and access kibana URL
kibana URL gives 504 Gateway Timeout response
Can you confirm you are not generally experiencing networking issues ? Kibana has an unconfigurable request timeout set to 3 sec
Created attachment 1536897 [details]
Created attachment 1536898 [details]
Created attachment 1536899 [details]
Created attachment 1536900 [details]
Could we get the output of ? I'm interested in understanding what resources are allocated to Elasticsearch
(In reply to Andre Costa from comment #8)
> Hi Jeff,
> I don't the logging-dump from my customer, but I have been working with him
> on several approaches to try understand this behaviour from kibana and
> elasticsearch. We have upgrade the logging stack to the latest 3.11 image
> version, which solved the kibana gateway timeout issue for a brief moment.
> Apart from that elasticsearch pod continues to be very unstable with
> constantly being restarted.
Likely because of the readiness probe.
> oc describe pod logging-es-data-master-hwq893ny-2-wkd7x
> Name: logging-es-data-master-hwq893ny-2-wkd7x
> Controlled By: ReplicationController/logging-es-data-master-hwq893ny-2
> memory: 8Gi
> cpu: 1
> memory: 8Gi
> memory: 64Mi
> cpu: 100m
> memory: 64Mi
Your cluster is starved for memory I imagine. Our OOTB recommend minimum is 16G and you really should bump it to probably 32G, 64G if its available. Elasticsearch is a resource hog and more is better. Note this amount is split in half because of how Elasticsearch utilizes memory and the temp space made available to the container. Your max operational heap in your example is 4G which is not much at all if there is any significant load on the cluster.
> sh-4.2$ curl --cacert /etc/kibana/keys/ca --cert /etc/kibana/keys/cert --key
> /etc/kibana/keys/key -XGET https://logging-es:9200/_cat/indices?v
> permissions for [indices:monitor/stats] and User
> roles=]"}],"type":"security_exception","reason":"no permissions for
> [indices:monitor/stats] and User
If you were to look at the Elasticsearch logs (e.g. oc exec -c elasticsearch $pod -- logs) I imagine the ACL seeding is failing. This is probably also the reason the pods are restarting; its because seeding is part of what determines success/failure of the readiness probes. You could remove the readiness probes to ensure the pods don't get prematurally restarted by the platform and then we could correct after they nodes cluster.
A few items regarding troubleshooting that may be of interest to you . Some scripts that may be of use to you 
3.11 PR https://github.com/openshift/origin-aggregated-logging/pull/1558
This fix does not directly resolve the reported issues as per #c8, I believe more memory needs to be given to the cluster. This will, however, resolve the issue fixed in 6.x of Kibana where the pingTimeout was hard coded to 3000ms.
Tested in logging-kibana5-v3.11.98-2
elasticsearch.pingTimeout can be set by:
oc set env dc/logging-kibana ELASTICSEARCH_PINGTIMEOUT=5000
# oc exec -c kibana logging-kibana-2-72z7r env |grep PING
Move this bug to VERIFIED.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.