Bug 1896770

Summary:	Increased memory consumption by the FluentD pods in Cluster Logging instance on OCP 4.6.1 / s390x
Product:	OpenShift Container Platform	Reporter:	Lakshmi Ravichandran <lakshmi.ravichandran1>
Component:	Logging	Assignee:	Jeff Cantrill <jcantril>
Status:	CLOSED WORKSFORME	QA Contact:	Anping Li <anli>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.6.z	CC:	aos-bugs, danijel.soldo, danili, Holger.Wolf, periklis, wvoesch
Target Milestone:	---
Target Release:	4.7.0
Hardware:	s390x
OS:	Linux
Whiteboard:	logging-core
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-20 15:44:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1881153

Description Lakshmi Ravichandran 2020-11-11 14:11:35 UTC

Description of problem:
The recent performance measurements on OCP 4.6.1 / s390x cluster running cluster logging instance shows increased CPU consumption of about
- 5 CPU cores for fluentD over 6 nodes (3M + 3 W)
- 1 CPU core for Elastic Search over 3 worker nodes

Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.6.0-rc.4
Server Version: 4.6.1
Kubernetes Version: v1.19.0+d59ce34

How reproducible:
Every time

Steps to Reproduce:
1. Install an OCP 4.6.1 cluster on s390x
2. Install Elastic Search, Cluster Logging, Local Storage operators from the console
3. Make local PVs available using LSO
4. Deploy the cluster logging instance to the cluster (definition below)
5. Measure the CPU consumption of FluentD, elasticsearch process using the top command in each of the cluster nodes

the logging instance definition is as below
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
name: "instance"
namespace: "openshift-logging"
spec:
managementState: "Managed"
logStore:
type: "elasticsearch"
elasticsearch:
nodeCount: 3
storage:
storageClassName: "local-sc"
size: 7043Mi
redundancyPolicy: "ZeroRedundancy"
resources:
request:
memory: 2Gi
visualization:
type: "kibana"
kibana:
replicas: 1
curation:
type: "curator"
curator:
schedule: "30 3 * * *"
collection:
logs:
type: "fluentd"
fluentd: {}

Actual results:
Increased CPU consumption as described above.

Expected results:
What would be the reason for an increased CPU consumption by the fluentD pods? if, is there a way to reduce it?
what would be the recommended CPU requests/limits for the ES, fluentD pods?
Is there any performance report/profiling results available for the cluster logging components (fluentD, elasticsearch)?

Additional info:
The cluster’s resource spec is
- master nodes - 4 CPU / 16G ,
- worker nodes 01,02 - 10 CPU / 32G (increased memory as needed for ES pods),
- worker 03 - 4 CPU / 16G.
In the log_instance definition, the fluentD / ES pod do not have any request limits specified.

As the performance measurements were taken using an internally developed tool, the performance results are not made visible.
Please, let me know other logs which would be of interest here, I can provide them.

Comment 1 wvoesch 2020-12-07 13:51:25 UTC

We are expecting this behavior also to occur on x86. Could someone please check this on x86? Thank you

Comment 3 Lakshmi Ravichandran 2021-01-20 15:44:55 UTC

The fix for bug https://bugzilla.redhat.com/show_bug.cgi?id=1895385 was followed up and verified in the OCP 4.6.0-0.nightly-s390x-2021-01-18-070324; hence, closing it.