Description of problem: In customer environment, the Elasticsearch cluster health flaps between "Unknown" and "green" $ oc get Elasticsearch -n openshift-logging -o yaml | egrep -i "health|status" status: status: cluster health unknown <----------- clusterHealth: "" $ oc get Elasticsearch -n openshift-logging -o yaml | egrep -i "health|status" status: status: green <----------- clusterHealth: "" Kibana shows intermittent 504 gateway timeouts. Version-Release number of selected component (if applicable): BUILD_RELEASE=202008210157.p0 BUILD_VERSION=v4.4.0 How reproducible: Unconfirmed Additional info: We see a lot of messages like the below in the ES logs: [2020-09-01T13:55:33,619][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165054] overhead, spent [286ms] collecting in the last [1s] [2020-09-01T13:55:38,726][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165059] overhead, spent [445ms] collecting in the last [1s] [2020-09-01T13:55:39,780][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][165060] overhead, spent [356ms] collecting in the last [1s] We attempted increasing the RAM just in case, but issue perists. As well, heap is not overloaded: ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 10.129.0.16 31 95 30 3.76 4.18 3.93 mdi - elasticsearch-cdm-953io34j-3 10.131.0.9 58 80 28 2.20 2.02 1.84 mdi * elasticsearch-cdm-953io34j-1 10.131.8.13 66 99 26 2.43 2.69 2.64 mdi - elasticsearch-cdm-953io34j-2 And I am seeing new logs are being created: green open project.example.3d12f389-1c5d-469d-94e8-90c6989c0708.2020.09.01 61qdomAZQZyfesYlbhJRjw 3 1 19531 0 52 26 I will upload 2 logging dumps
Moving to 4.7 as this does not seem to be a blocker. The EO will constantly update the status and one possible outcome is it will flap if EO has issues consistently connecting to the cluster.
@steven walter The logging dump tarballs are all broken when unarchiving them. Could you please take a proper dump again? I suggest to use our new must-gather for the the logging stack instead of the logging-dump script. https://github.com/openshift/cluster-logging-operator/tree/master/must-gather
Moving to UpcomingRelease
@Steven / @Nicolas Taking a look on the logs, i can see a lot of messages like this one: [2020-09-02T13:36:55,373][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28597] overhead, spent [455ms] collecting in the last [1s] [2020-09-02T13:36:56,378][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28598] overhead, spent [539ms] collecting in the last [1s] [2020-09-02T13:36:57,415][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28599] overhead, spent [304ms] collecting in the last [1s] [2020-09-02T13:36:59,416][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28601] overhead, spent [252ms] collecting in the last [1s] [2020-09-02T13:37:02,416][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-cdm-953io34j-1] [gc][28604] overhead, spent [382ms] collecting in the last [1s] The deployment spec tells me that this cluster may deserve a little more cpu attention. Depending on the load on the cluster nodes, we may have the cluster slipping towards unresponsiveness and in turn resulting in proxy logs like this one: 2020-09-08T16:30:04.572154112+04:00 2020/09/08 12:30:04 reverseproxy.go:437: http: proxy error: context canceled I suggest you need to tune the CPU/Memory until ES is more responsive. resources: limits: memory: 32Gi requests: cpu: 100m memory: 32Gi
*** This bug has been marked as a duplicate of bug 1883357 ***
@Periklis, the change didn't improve the scenario Anyway, I see this has been marked as duplicate of bug 1883357, we will keep monitoring that BZ.
(In reply to Nicolas Nosenzo from comment #15) > @Periklis, the change didn't improve the scenario Anyway, I see this has > been marked as duplicate of bug 1883357, we will keep monitoring that BZ. Which change do you refer? [1] or [2] [1] https://bugzilla.redhat.com/show_bug.cgi?id=1874934#c13 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1883357#c2
I meant [1], I will have a look at the comments on bz 1883357 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1874934#c13
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days