Bug 1679159 - [EFK] Kibana responds with Gateway Timeout
Summary: [EFK] Kibana responds with Gateway Timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.11.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.11.z
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-20 13:37 UTC by Andre Costa
Modified: 2020-01-27 13:55 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: High network latency between Kibana and Elasticsearch due to either network issues or under allocated memory for ES Consequence: Kibana is unusable because of the Gateway timeout Fix: Backport changes from Kibana 6.x which allows modification to the ping timeout. Admins are now able to override the default pingTimeout of 3000ms by setting the ELASTICSEARCH_REQUESTTIMEOUT environment variable. Result: Kibana is functional until the underlying conditions can be resolved.
Clone Of:
Environment:
Last Closed: 2019-04-11 05:38:34 UTC
Target Upstream Version:


Attachments (Terms of Use)
kibana-proxy-logs (4.46 KB, text/plain)
2019-02-20 13:37 UTC, Andre Costa
no flags Details
elasticsearch_logs_blkbkum001 (10.14 KB, text/plain)
2019-02-21 06:24 UTC, Ravi Trivedi
no flags Details
es_proxy_logs_blkbkum001 (136.83 KB, text/plain)
2019-02-21 06:25 UTC, Ravi Trivedi
no flags Details
kibana_logs_blkbkum001 (3.00 MB, text/plain)
2019-02-21 06:26 UTC, Ravi Trivedi
no flags Details
kibana_proxy_logs_blkbkum001 (3.59 KB, text/plain)
2019-02-21 06:26 UTC, Ravi Trivedi
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift origin-aggregated-logging pull 1558 'None' closed [release-3.11] bug 1679159. Allow setting of pingTimeout 2020-02-19 09:14:50 UTC
Red Hat Product Errata RHBA-2019:0636 None None None 2019-04-11 05:38:43 UTC

Description Andre Costa 2019-02-20 13:37:50 UTC
Created attachment 1536692 [details]
kibana-proxy-logs

Description of problem:
Installing or upgrading to the latest image version of EFK (at the moment of this writing v3.11.69) ends up with kibana giving constant 504 Gateway Timeout on the browser.
Also inside the kibana container curl es api ends up with error or stuck:

 $ oc exec -c kibana <some-kibana-pod-name> -- \
  curl --cacert /etc/kibana/keys/ca \
   --cert /etc/kibana/keys/cert --key /etc/kibana/keys/key \
   -XGET https://logging-es:9200/_cat/indices?v

From kibana-proxy we are seeing a lot of "http: proxy error: context canceled" errors.

Version-Release number of selected component (if applicable):
OCP 3.11.69
Logging images v3.11.69

How reproducible:
On some customers every time

Steps to Reproduce:
1. Install logging stack with the v3.11.69 tag
2. Check status from inside elasticsearch pod(s)
3. Curl es-api from inside kibana and access kibana URL

Actual results:
kibana URL gives 504 Gateway Timeout response

Comment 1 Jeff Cantrill 2019-02-20 14:32:24 UTC
Can you confirm you are not generally experiencing networking issues [1]? Kibana has an unconfigurable request timeout set to 3 sec

[1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-kibana-to-es-connectivity

Comment 3 Ravi Trivedi 2019-02-21 06:24:30 UTC
Created attachment 1536897 [details]
elasticsearch_logs_blkbkum001

Comment 4 Ravi Trivedi 2019-02-21 06:25:29 UTC
Created attachment 1536898 [details]
es_proxy_logs_blkbkum001

Comment 5 Ravi Trivedi 2019-02-21 06:26:06 UTC
Created attachment 1536899 [details]
kibana_logs_blkbkum001

Comment 6 Ravi Trivedi 2019-02-21 06:26:41 UTC
Created attachment 1536900 [details]
kibana_proxy_logs_blkbkum001

Comment 7 Jeff Cantrill 2019-03-05 20:45:00 UTC
Could we get the output of [1]? I'm interested in understanding what resources are allocated to Elasticsearch

[1] https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh

Comment 9 Jeff Cantrill 2019-03-12 13:52:55 UTC
(In reply to Andre Costa from comment #8)
> Hi Jeff,
> 
> I don't the logging-dump from my customer, but I have been working with him
> on several approaches to try understand this behaviour from kibana and
> elasticsearch. We have upgrade the logging stack to the latest 3.11 image
> version, which solved the kibana gateway timeout issue for a brief moment.
> Apart from that elasticsearch pod continues to be very unstable with
> constantly being restarted.

Likely because of the readiness probe.

> 
> oc describe pod logging-es-data-master-hwq893ny-2-wkd7x
> Name:               logging-es-data-master-hwq893ny-2-wkd7x

> Controlled By:      ReplicationController/logging-es-data-master-hwq893ny-2
> Containers:
>   elasticsearch:
>     Limits:
>       memory:  8Gi
>     Requests:
>       cpu:      1
>       memory:   8Gi
>   proxy:
>     Limits:
>       memory:  64Mi
>     Requests:
>       cpu:        100m
>       memory:     64Mi

Your cluster is starved for memory I imagine.  Our OOTB recommend minimum is 16G and you really should bump it to probably 32G, 64G if its available.  Elasticsearch is a resource hog and more is better.  Note this amount is split in half because of how Elasticsearch utilizes memory and the temp space made available to the container.  Your max operational heap in your example is 4G which is not much at all if there is any significant load on the cluster.

> 
> sh-4.2$ curl --cacert /etc/kibana/keys/ca --cert /etc/kibana/keys/cert --key
> /etc/kibana/keys/key -XGET https://logging-es:9200/_cat/indices?v
> {"error":{"root_cause":[{"type":"security_exception","reason":"no
> permissions for [indices:monitor/stats] and User
> [name=CN=system.logging.kibana,OU=OpenShift,O=Logging,
> roles=[]]"}],"type":"security_exception","reason":"no permissions for
> [indices:monitor/stats] and User
> [name=CN=system.logging.kibana,OU=OpenShift,O=L

If you were to look at the Elasticsearch logs (e.g. oc exec -c elasticsearch $pod -- logs) I imagine the ACL seeding is failing.  This is probably also the reason the pods are restarting; its because seeding is part of what determines success/failure of the readiness probes.  You could remove the readiness probes to ensure the pods don't get prematurally restarted by the platform and then we could correct after they nodes cluster.


A few items regarding troubleshooting that may be of interest to you [1].  Some scripts that may be of use to you [2]

[1] https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/docs/troubleshooting.md#elasticsearch
[2] https://github.com/jcantrill/cluster-logging-tools/tree/master/scripts

Comment 11 Jeff Cantrill 2019-03-14 12:09:29 UTC
This fix does not directly resolve the reported issues as per #c8, I believe more memory needs to be given to the cluster.  This will, however, resolve the issue fixed in 6.x of Kibana where the pingTimeout was hard coded to 3000ms.

Comment 13 Qiaoling Tang 2019-03-25 03:07:07 UTC
Tested in logging-kibana5-v3.11.98-2

elasticsearch.pingTimeout can be set by:
oc set env dc/logging-kibana ELASTICSEARCH_PINGTIMEOUT=5000

# oc exec -c kibana logging-kibana-2-72z7r env |grep PING
ELASTICSEARCH_PINGTIMEOUT=5000


Move this bug to VERIFIED.

Comment 15 errata-xmlrpc 2019-04-11 05:38:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0636


Note You need to log in before you can comment on or make changes to this bug.