Created attachment 1536692 [details] kibana-proxy-logs Description of problem: Installing or upgrading to the latest image version of EFK (at the moment of this writing v3.11.69) ends up with kibana giving constant 504 Gateway Timeout on the browser. Also inside the kibana container curl es api ends up with error or stuck: $ oc exec -c kibana <some-kibana-pod-name> -- \ curl --cacert /etc/kibana/keys/ca \ --cert /etc/kibana/keys/cert --key /etc/kibana/keys/key \ -XGET https://logging-es:9200/_cat/indices?v From kibana-proxy we are seeing a lot of "http: proxy error: context canceled" errors. Version-Release number of selected component (if applicable): OCP 3.11.69 Logging images v3.11.69 How reproducible: On some customers every time Steps to Reproduce: 1. Install logging stack with the v3.11.69 tag 2. Check status from inside elasticsearch pod(s) 3. Curl es-api from inside kibana and access kibana URL Actual results: kibana URL gives 504 Gateway Timeout response
Can you confirm you are not generally experiencing networking issues [1]? Kibana has an unconfigurable request timeout set to 3 sec [1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-kibana-to-es-connectivity
Created attachment 1536897 [details] elasticsearch_logs_blkbkum001
Created attachment 1536898 [details] es_proxy_logs_blkbkum001
Created attachment 1536899 [details] kibana_logs_blkbkum001
Created attachment 1536900 [details] kibana_proxy_logs_blkbkum001
Could we get the output of [1]? I'm interested in understanding what resources are allocated to Elasticsearch [1] https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh
(In reply to Andre Costa from comment #8) > Hi Jeff, > > I don't the logging-dump from my customer, but I have been working with him > on several approaches to try understand this behaviour from kibana and > elasticsearch. We have upgrade the logging stack to the latest 3.11 image > version, which solved the kibana gateway timeout issue for a brief moment. > Apart from that elasticsearch pod continues to be very unstable with > constantly being restarted. Likely because of the readiness probe. > > oc describe pod logging-es-data-master-hwq893ny-2-wkd7x > Name: logging-es-data-master-hwq893ny-2-wkd7x > Controlled By: ReplicationController/logging-es-data-master-hwq893ny-2 > Containers: > elasticsearch: > Limits: > memory: 8Gi > Requests: > cpu: 1 > memory: 8Gi > proxy: > Limits: > memory: 64Mi > Requests: > cpu: 100m > memory: 64Mi Your cluster is starved for memory I imagine. Our OOTB recommend minimum is 16G and you really should bump it to probably 32G, 64G if its available. Elasticsearch is a resource hog and more is better. Note this amount is split in half because of how Elasticsearch utilizes memory and the temp space made available to the container. Your max operational heap in your example is 4G which is not much at all if there is any significant load on the cluster. > > sh-4.2$ curl --cacert /etc/kibana/keys/ca --cert /etc/kibana/keys/cert --key > /etc/kibana/keys/key -XGET https://logging-es:9200/_cat/indices?v > {"error":{"root_cause":[{"type":"security_exception","reason":"no > permissions for [indices:monitor/stats] and User > [name=CN=system.logging.kibana,OU=OpenShift,O=Logging, > roles=[]]"}],"type":"security_exception","reason":"no permissions for > [indices:monitor/stats] and User > [name=CN=system.logging.kibana,OU=OpenShift,O=L If you were to look at the Elasticsearch logs (e.g. oc exec -c elasticsearch $pod -- logs) I imagine the ACL seeding is failing. This is probably also the reason the pods are restarting; its because seeding is part of what determines success/failure of the readiness probes. You could remove the readiness probes to ensure the pods don't get prematurally restarted by the platform and then we could correct after they nodes cluster. A few items regarding troubleshooting that may be of interest to you [1]. Some scripts that may be of use to you [2] [1] https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/docs/troubleshooting.md#elasticsearch [2] https://github.com/jcantrill/cluster-logging-tools/tree/master/scripts
3.11 PR https://github.com/openshift/origin-aggregated-logging/pull/1558
This fix does not directly resolve the reported issues as per #c8, I believe more memory needs to be given to the cluster. This will, however, resolve the issue fixed in 6.x of Kibana where the pingTimeout was hard coded to 3000ms.
Tested in logging-kibana5-v3.11.98-2 elasticsearch.pingTimeout can be set by: oc set env dc/logging-kibana ELASTICSEARCH_PINGTIMEOUT=5000 # oc exec -c kibana logging-kibana-2-72z7r env |grep PING ELASTICSEARCH_PINGTIMEOUT=5000 Move this bug to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0636