Bug 1874934 - Kibana gateway timeout and elasticsearch cluster health unknown
Summary: Kibana gateway timeout and elasticsearch cluster health unknown
Keywords:
Status: CLOSED DUPLICATE of bug 1883357
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Periklis Tsirakidis
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-02 15:40 UTC by Steven Walter
Modified: 2023-09-14 06:08 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-30 14:17:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Steven Walter 2020-09-02 15:40:13 UTC
Description of problem:
In customer environment, the Elasticsearch cluster health flaps between "Unknown" and "green"


$ oc get Elasticsearch  -n openshift-logging -o yaml | egrep -i "health|status"
  status:
      status: cluster health unknown              <-----------
    clusterHealth: ""
$ oc get Elasticsearch  -n openshift-logging -o yaml | egrep -i "health|status"
  status:
      status: green                                    <-----------
    clusterHealth: ""

Kibana shows intermittent 504 gateway timeouts.


Version-Release number of selected component (if applicable):
BUILD_RELEASE=202008210157.p0
BUILD_VERSION=v4.4.0

How reproducible:
Unconfirmed



Additional info:
We see a lot of messages like the below in the ES logs:

[2020-09-01T13:55:33,619][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165054] overhead, spent [286ms] collecting in the last [1s]
[2020-09-01T13:55:38,726][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165059] overhead, spent [445ms] collecting in the last [1s]
[2020-09-01T13:55:39,780][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165060] overhead, spent [356ms] collecting in the last [1s]

We attempted increasing the RAM just in case, but issue perists. As well, heap is not overloaded:

ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.129.0.16           31          95  30    3.76    4.18     3.93 mdi       -      elasticsearch-cdm-953io34j-3
10.131.0.9            58          80  28    2.20    2.02     1.84 mdi       *      elasticsearch-cdm-953io34j-1
10.131.8.13           66          99  26    2.43    2.69     2.64 mdi       -      elasticsearch-cdm-953io34j-2

And I am seeing new logs are being created:
green  open   project.example.3d12f389-1c5d-469d-94e8-90c6989c0708.2020.09.01                  61qdomAZQZyfesYlbhJRjw   3   1      19531            0         52             26

I will upload 2 logging dumps

Comment 8 Jeff Cantrill 2020-09-07 10:23:06 UTC
Moving to 4.7 as this does not seem to be a blocker.  The EO will constantly update the status and one possible outcome is it will flap if EO has issues consistently connecting to the cluster.

Comment 9 Periklis Tsirakidis 2020-09-08 11:08:25 UTC
@steven walter

The logging dump tarballs are all broken when unarchiving them. Could you please take a proper dump again? I suggest to use our new must-gather for the the logging stack instead of the logging-dump script.

https://github.com/openshift/cluster-logging-operator/tree/master/must-gather

Comment 12 Jeff Cantrill 2020-09-11 20:18:39 UTC
Moving to UpcomingRelease

Comment 13 Periklis Tsirakidis 2020-09-24 14:53:42 UTC
@Steven / @Nicolas

Taking a look on the logs, i can see a lot of messages like this one:

[2020-09-02T13:36:55,373][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28597] overhead, spent [455ms] collecting in the last [1s]                             
[2020-09-02T13:36:56,378][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28598] overhead, spent [539ms] collecting in the last [1s]                             
[2020-09-02T13:36:57,415][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28599] overhead, spent [304ms] collecting in the last [1s]                             
[2020-09-02T13:36:59,416][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28601] overhead, spent [252ms] collecting in the last [1s]                             
[2020-09-02T13:37:02,416][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28604] overhead, spent [382ms] collecting in the last [1s]

The deployment spec tells me that this cluster may deserve a little more cpu attention. Depending on the load on the cluster nodes, we may have the cluster slipping towards unresponsiveness and in turn resulting in proxy logs like this one:

2020-09-08T16:30:04.572154112+04:00 2020/09/08 12:30:04 reverseproxy.go:437: http: proxy error: context canceled

I suggest you need to tune the CPU/Memory until ES is more responsive.

resources:
  limits:
    memory: 32Gi
  requests:
    cpu: 100m
    memory: 32Gi

Comment 14 Periklis Tsirakidis 2020-09-30 14:17:30 UTC

*** This bug has been marked as a duplicate of bug 1883357 ***

Comment 15 Nicolas Nosenzo 2020-10-01 06:41:08 UTC
@Periklis, the change didn't improve the scenario Anyway, I see this has been marked as duplicate of bug 1883357, we will keep monitoring that BZ.

Comment 16 Periklis Tsirakidis 2020-10-01 07:05:22 UTC
(In reply to Nicolas Nosenzo from comment #15)
> @Periklis, the change didn't improve the scenario Anyway, I see this has
> been marked as duplicate of bug 1883357, we will keep monitoring that BZ.

Which change do you refer? [1] or [2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1874934#c13
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1883357#c2

Comment 17 Nicolas Nosenzo 2020-10-01 08:46:02 UTC
I meant [1], I will have a look at the comments on bz 1883357

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1874934#c13

Comment 18 Red Hat Bugzilla 2023-09-14 06:08:06 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.