Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1874934

Summary:	Kibana gateway timeout and elasticsearch cluster health unknown
Product:	OpenShift Container Platform	Reporter:	Steven Walter <stwalter>
Component:	Logging	Assignee:	Periklis Tsirakidis <periklis>
Status:	CLOSED DUPLICATE	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	aos-bugs, nnosenzo, periklis
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-30 14:17:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Steven Walter 2020-09-02 15:40:13 UTC

Description of problem:
In customer environment, the Elasticsearch cluster health flaps between "Unknown" and "green"


$ oc get Elasticsearch  -n openshift-logging -o yaml | egrep -i "health|status"
  status:
      status: cluster health unknown              <-----------
    clusterHealth: ""
$ oc get Elasticsearch  -n openshift-logging -o yaml | egrep -i "health|status"
  status:
      status: green                                    <-----------
    clusterHealth: ""

Kibana shows intermittent 504 gateway timeouts.


Version-Release number of selected component (if applicable):
BUILD_RELEASE=202008210157.p0
BUILD_VERSION=v4.4.0

How reproducible:
Unconfirmed



Additional info:
We see a lot of messages like the below in the ES logs:

[2020-09-01T13:55:33,619][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165054] overhead, spent [286ms] collecting in the last [1s]
[2020-09-01T13:55:38,726][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165059] overhead, spent [445ms] collecting in the last [1s]
[2020-09-01T13:55:39,780][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165060] overhead, spent [356ms] collecting in the last [1s]

We attempted increasing the RAM just in case, but issue perists. As well, heap is not overloaded:

ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.129.0.16           31          95  30    3.76    4.18     3.93 mdi       -      elasticsearch-cdm-953io34j-3
10.131.0.9            58          80  28    2.20    2.02     1.84 mdi       *      elasticsearch-cdm-953io34j-1
10.131.8.13           66          99  26    2.43    2.69     2.64 mdi       -      elasticsearch-cdm-953io34j-2

And I am seeing new logs are being created:
green  open   project.example.3d12f389-1c5d-469d-94e8-90c6989c0708.2020.09.01                  61qdomAZQZyfesYlbhJRjw   3   1      19531            0         52             26

I will upload 2 logging dumps

Comment 8 Jeff Cantrill 2020-09-07 10:23:06 UTC

Moving to 4.7 as this does not seem to be a blocker.  The EO will constantly update the status and one possible outcome is it will flap if EO has issues consistently connecting to the cluster.

Comment 9 Periklis Tsirakidis 2020-09-08 11:08:25 UTC

@steven walter

The logging dump tarballs are all broken when unarchiving them. Could you please take a proper dump again? I suggest to use our new must-gather for the the logging stack instead of the logging-dump script.

https://github.com/openshift/cluster-logging-operator/tree/master/must-gather

Comment 12 Jeff Cantrill 2020-09-11 20:18:39 UTC

Moving to UpcomingRelease

Comment 13 Periklis Tsirakidis 2020-09-24 14:53:42 UTC

@Steven / @Nicolas

Taking a look on the logs, i can see a lot of messages like this one:

[2020-09-02T13:36:55,373][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28597] overhead, spent [455ms] collecting in the last [1s]                             
[2020-09-02T13:36:56,378][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28598] overhead, spent [539ms] collecting in the last [1s]                             
[2020-09-02T13:36:57,415][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28599] overhead, spent [304ms] collecting in the last [1s]                             
[2020-09-02T13:36:59,416][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28601] overhead, spent [252ms] collecting in the last [1s]                             
[2020-09-02T13:37:02,416][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28604] overhead, spent [382ms] collecting in the last [1s]

The deployment spec tells me that this cluster may deserve a little more cpu attention. Depending on the load on the cluster nodes, we may have the cluster slipping towards unresponsiveness and in turn resulting in proxy logs like this one:

2020-09-08T16:30:04.572154112+04:00 2020/09/08 12:30:04 reverseproxy.go:437: http: proxy error: context canceled

I suggest you need to tune the CPU/Memory until ES is more responsive.

resources:
  limits:
    memory: 32Gi
  requests:
    cpu: 100m
    memory: 32Gi

Comment 14 Periklis Tsirakidis 2020-09-30 14:17:30 UTC


*** This bug has been marked as a duplicate of bug 1883357 ***

Comment 15 Nicolas Nosenzo 2020-10-01 06:41:08 UTC

@Periklis, the change didn't improve the scenario Anyway, I see this has been marked as duplicate of bug 1883357, we will keep monitoring that BZ.

Comment 16 Periklis Tsirakidis 2020-10-01 07:05:22 UTC

(In reply to Nicolas Nosenzo from comment #15)
> @Periklis, the change didn't improve the scenario Anyway, I see this has
> been marked as duplicate of bug 1883357, we will keep monitoring that BZ.

Which change do you refer? [1] or [2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1874934#c13
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1883357#c2

Comment 17 Nicolas Nosenzo 2020-10-01 08:46:02 UTC

I meant [1], I will have a look at the comments on bz 1883357

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1874934#c13

Comment 18 Red Hat Bugzilla 2023-09-14 06:08:06 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days