Bug 1679159

Summary:

[EFK] Kibana responds with Gateway Timeout

Product:

OpenShift Container Platform

Reporter:

Andre Costa <andcosta>

Component:

Logging

Assignee:

Jeff Cantrill <jcantril>

Status:

CLOSED ERRATA

QA Contact:

Anping Li <anli>

Severity:

high

Docs Contact:

Priority:

high

Version:

3.11.0

CC:

andcosta, aos-bugs, fhirtz, jcantril, qitang, rmeggins, travi

Target Milestone:

---

Target Release:

3.11.z

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: High network latency between Kibana and Elasticsearch due to either network issues or under allocated memory for ES Consequence: Kibana is unusable because of the Gateway timeout Fix: Backport changes from Kibana 6.x which allows modification to the ping timeout. Admins are now able to override the default pingTimeout of 3000ms by setting the ELASTICSEARCH_REQUESTTIMEOUT environment variable. Result: Kibana is functional until the underlying conditions can be resolved.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-04-11 05:38:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
kibana-proxy-logs	none
elasticsearch_logs_blkbkum001	none
es_proxy_logs_blkbkum001	none
kibana_logs_blkbkum001	none
kibana_proxy_logs_blkbkum001	none

Description Andre Costa 2019-02-20 13:37:50 UTC

Created attachment 1536692 [details]
kibana-proxy-logs

Description of problem:
Installing or upgrading to the latest image version of EFK (at the moment of this writing v3.11.69) ends up with kibana giving constant 504 Gateway Timeout on the browser.
Also inside the kibana container curl es api ends up with error or stuck:

 $ oc exec -c kibana <some-kibana-pod-name> -- \
  curl --cacert /etc/kibana/keys/ca \
   --cert /etc/kibana/keys/cert --key /etc/kibana/keys/key \
   -XGET https://logging-es:9200/_cat/indices?v

From kibana-proxy we are seeing a lot of "http: proxy error: context canceled" errors.

Version-Release number of selected component (if applicable):
OCP 3.11.69
Logging images v3.11.69

How reproducible:
On some customers every time

Steps to Reproduce:
1. Install logging stack with the v3.11.69 tag
2. Check status from inside elasticsearch pod(s)
3. Curl es-api from inside kibana and access kibana URL

Actual results:
kibana URL gives 504 Gateway Timeout response

Comment 1 Jeff Cantrill 2019-02-20 14:32:24 UTC

Can you confirm you are not generally experiencing networking issues [1]? Kibana has an unconfigurable request timeout set to 3 sec

[1] https://github.com/jcantrill/cluster-logging-tools/blob/master/scripts/check-kibana-to-es-connectivity

Comment 3 Ravi Trivedi 2019-02-21 06:24:30 UTC

Created attachment 1536897 [details]
elasticsearch_logs_blkbkum001

Comment 4 Ravi Trivedi 2019-02-21 06:25:29 UTC

Created attachment 1536898 [details]
es_proxy_logs_blkbkum001

Comment 5 Ravi Trivedi 2019-02-21 06:26:06 UTC

Created attachment 1536899 [details]
kibana_logs_blkbkum001

Comment 6 Ravi Trivedi 2019-02-21 06:26:41 UTC

Created attachment 1536900 [details]
kibana_proxy_logs_blkbkum001

Comment 7 Jeff Cantrill 2019-03-05 20:45:00 UTC

Could we get the output of [1]? I'm interested in understanding what resources are allocated to Elasticsearch

[1] https://github.com/openshift/origin-aggregated-logging/blob/master/hack/logging-dump.sh

Comment 9 Jeff Cantrill 2019-03-12 13:52:55 UTC

(In reply to Andre Costa from comment #8)
> Hi Jeff,
> 
> I don't the logging-dump from my customer, but I have been working with him
> on several approaches to try understand this behaviour from kibana and
> elasticsearch. We have upgrade the logging stack to the latest 3.11 image
> version, which solved the kibana gateway timeout issue for a brief moment.
> Apart from that elasticsearch pod continues to be very unstable with
> constantly being restarted.

Likely because of the readiness probe.

> 
> oc describe pod logging-es-data-master-hwq893ny-2-wkd7x
> Name:               logging-es-data-master-hwq893ny-2-wkd7x

> Controlled By:      ReplicationController/logging-es-data-master-hwq893ny-2
> Containers:
>   elasticsearch:
>     Limits:
>       memory:  8Gi
>     Requests:
>       cpu:      1
>       memory:   8Gi
>   proxy:
>     Limits:
>       memory:  64Mi
>     Requests:
>       cpu:        100m
>       memory:     64Mi

Your cluster is starved for memory I imagine.  Our OOTB recommend minimum is 16G and you really should bump it to probably 32G, 64G if its available.  Elasticsearch is a resource hog and more is better.  Note this amount is split in half because of how Elasticsearch utilizes memory and the temp space made available to the container.  Your max operational heap in your example is 4G which is not much at all if there is any significant load on the cluster.

> 
> sh-4.2$ curl --cacert /etc/kibana/keys/ca --cert /etc/kibana/keys/cert --key
> /etc/kibana/keys/key -XGET https://logging-es:9200/_cat/indices?v
> {"error":{"root_cause":[{"type":"security_exception","reason":"no
> permissions for [indices:monitor/stats] and User
> [name=CN=system.logging.kibana,OU=OpenShift,O=Logging,
> roles=[]]"}],"type":"security_exception","reason":"no permissions for
> [indices:monitor/stats] and User
> [name=CN=system.logging.kibana,OU=OpenShift,O=L

If you were to look at the Elasticsearch logs (e.g. oc exec -c elasticsearch $pod -- logs) I imagine the ACL seeding is failing.  This is probably also the reason the pods are restarting; its because seeding is part of what determines success/failure of the readiness probes.  You could remove the readiness probes to ensure the pods don't get prematurally restarted by the platform and then we could correct after they nodes cluster.


A few items regarding troubleshooting that may be of interest to you [1].  Some scripts that may be of use to you [2]

[1] https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/docs/troubleshooting.md#elasticsearch
[2] https://github.com/jcantrill/cluster-logging-tools/tree/master/scripts

Comment 10 Jeff Cantrill 2019-03-13 19:56:22 UTC

3.11 PR https://github.com/openshift/origin-aggregated-logging/pull/1558

Comment 11 Jeff Cantrill 2019-03-14 12:09:29 UTC

This fix does not directly resolve the reported issues as per #c8, I believe more memory needs to be given to the cluster.  This will, however, resolve the issue fixed in 6.x of Kibana where the pingTimeout was hard coded to 3000ms.

Comment 13 Qiaoling Tang 2019-03-25 03:07:07 UTC

Tested in logging-kibana5-v3.11.98-2

elasticsearch.pingTimeout can be set by:
oc set env dc/logging-kibana ELASTICSEARCH_PINGTIMEOUT=5000

# oc exec -c kibana logging-kibana-2-72z7r env |grep PING
ELASTICSEARCH_PINGTIMEOUT=5000


Move this bug to VERIFIED.

Comment 15 errata-xmlrpc 2019-04-11 05:38:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0636