Bug 1679159
Summary: | [EFK] Kibana responds with Gateway Timeout | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Andre Costa <andcosta> | ||||||||||||
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 3.11.0 | CC: | andcosta, aos-bugs, fhirtz, jcantril, qitang, rmeggins, travi | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | 3.11.z | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: |
Cause: High network latency between Kibana and Elasticsearch due to either network issues or under allocated memory for ES
Consequence: Kibana is unusable because of the Gateway timeout
Fix: Backport changes from Kibana 6.x which allows modification to the ping timeout. Admins are now able to override the default pingTimeout of 3000ms by setting the ELASTICSEARCH_REQUESTTIMEOUT environment variable.
Result: Kibana is functional until the underlying conditions can be resolved.
|
Story Points: | --- | ||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2019-04-11 05:38:34 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Andre Costa
2019-02-20 13:37:50 UTC
Can you confirm you are not generally experiencing networking issues [1]? Kibana has an unconfigurable request timeout set to 3 sec [1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-kibana-to-es-connectivity Created attachment 1536897 [details]
elasticsearch_logs_blkbkum001
Created attachment 1536898 [details]
es_proxy_logs_blkbkum001
Created attachment 1536899 [details]
kibana_logs_blkbkum001
Created attachment 1536900 [details]
kibana_proxy_logs_blkbkum001
Could we get the output of [1]? I'm interested in understanding what resources are allocated to Elasticsearch [1] https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh (In reply to Andre Costa from comment #8) > Hi Jeff, > > I don't the logging-dump from my customer, but I have been working with him > on several approaches to try understand this behaviour from kibana and > elasticsearch. We have upgrade the logging stack to the latest 3.11 image > version, which solved the kibana gateway timeout issue for a brief moment. > Apart from that elasticsearch pod continues to be very unstable with > constantly being restarted. Likely because of the readiness probe. > > oc describe pod logging-es-data-master-hwq893ny-2-wkd7x > Name: logging-es-data-master-hwq893ny-2-wkd7x > Controlled By: ReplicationController/logging-es-data-master-hwq893ny-2 > Containers: > elasticsearch: > Limits: > memory: 8Gi > Requests: > cpu: 1 > memory: 8Gi > proxy: > Limits: > memory: 64Mi > Requests: > cpu: 100m > memory: 64Mi Your cluster is starved for memory I imagine. Our OOTB recommend minimum is 16G and you really should bump it to probably 32G, 64G if its available. Elasticsearch is a resource hog and more is better. Note this amount is split in half because of how Elasticsearch utilizes memory and the temp space made available to the container. Your max operational heap in your example is 4G which is not much at all if there is any significant load on the cluster. > > sh-4.2$ curl --cacert /etc/kibana/keys/ca --cert /etc/kibana/keys/cert --key > /etc/kibana/keys/key -XGET https://logging-es:9200/_cat/indices?v > {"error":{"root_cause":[{"type":"security_exception","reason":"no > permissions for [indices:monitor/stats] and User > [name=CN=system.logging.kibana,OU=OpenShift,O=Logging, > roles=[]]"}],"type":"security_exception","reason":"no permissions for > [indices:monitor/stats] and User > [name=CN=system.logging.kibana,OU=OpenShift,O=L If you were to look at the Elasticsearch logs (e.g. oc exec -c elasticsearch $pod -- logs) I imagine the ACL seeding is failing. This is probably also the reason the pods are restarting; its because seeding is part of what determines success/failure of the readiness probes. You could remove the readiness probes to ensure the pods don't get prematurally restarted by the platform and then we could correct after they nodes cluster. A few items regarding troubleshooting that may be of interest to you [1]. Some scripts that may be of use to you [2] [1] https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/docs/troubleshooting.md#elasticsearch [2] https://github.com/jcantrill/cluster-logging-tools/tree/master/scripts This fix does not directly resolve the reported issues as per #c8, I believe more memory needs to be given to the cluster. This will, however, resolve the issue fixed in 6.x of Kibana where the pingTimeout was hard coded to 3000ms. Tested in logging-kibana5-v3.11.98-2 elasticsearch.pingTimeout can be set by: oc set env dc/logging-kibana ELASTICSEARCH_PINGTIMEOUT=5000 # oc exec -c kibana logging-kibana-2-72z7r env |grep PING ELASTICSEARCH_PINGTIMEOUT=5000 Move this bug to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0636 |