Bug 1896770

Summary: Increased memory consumption by the FluentD pods in Cluster Logging instance on OCP 4.6.1 / s390x
Product: OpenShift Container Platform Reporter: Lakshmi Ravichandran <lakshmi.ravichandran1>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED WORKSFORME QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.6.zCC: aos-bugs, danijel.soldo, danili, Holger.Wolf, periklis, wvoesch
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: s390x   
OS: Linux   
Whiteboard: logging-core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-20 15:44:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1881153    

Description Lakshmi Ravichandran 2020-11-11 14:11:35 UTC
Description of problem:
The recent performance measurements on OCP 4.6.1 / s390x cluster running cluster logging instance shows increased CPU consumption of about 
- 5 CPU cores for fluentD over 6 nodes (3M + 3 W) 
- 1 CPU core for Elastic Search over 3 worker nodes 


Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.6.0-rc.4
Server Version: 4.6.1
Kubernetes Version: v1.19.0+d59ce34

How reproducible:
Every time

Steps to Reproduce:
1. Install an OCP 4.6.1 cluster on s390x
2. Install Elastic Search, Cluster Logging, Local Storage operators from the console
3. Make local PVs available using LSO 
4. Deploy the cluster logging instance to the cluster (definition below)
5. Measure the CPU consumption of FluentD, elasticsearch process using the top command in each of the cluster nodes

the logging instance definition is as below
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
  name: "instance"
  namespace: "openshift-logging"
spec:
  managementState: "Managed"
  logStore:
    type: "elasticsearch"
    elasticsearch:
      nodeCount: 3
      storage:
        storageClassName: "local-sc"
        size: 7043Mi
      redundancyPolicy: "ZeroRedundancy"
      resources:
        request:
          memory: 2Gi
  visualization:
    type: "kibana"
    kibana:
      replicas: 1
  curation:
    type: "curator"
    curator:
      schedule: "30 3 * * *"
  collection:
    logs:
      type: "fluentd"
      fluentd: {}
      

Actual results:
Increased CPU consumption as described above.

Expected results:
What would be the reason for an increased CPU consumption by the fluentD pods? if, is there a way to reduce it?
what would be the recommended CPU requests/limits for the ES, fluentD pods?
Is there any performance report/profiling results available for the cluster logging components (fluentD, elasticsearch)? 


Additional info:
The cluster’s resource spec is  
- master nodes - 4 CPU  / 16G ,
- worker nodes 01,02 - 10 CPU / 32G (increased memory as needed for ES pods),
- worker 03 - 4 CPU / 16G.
In the log_instance definition, the fluentD / ES pod do not have any request limits specified.

As the performance measurements were taken using an internally developed tool, the performance results are not made visible.
Please, let me know other logs which would be of interest here, I can provide them.

Comment 1 wvoesch 2020-12-07 13:51:25 UTC
We are expecting this behavior also to occur on x86. Could someone please check this on x86? Thank you

Comment 3 Lakshmi Ravichandran 2021-01-20 15:44:55 UTC
The fix for bug https://bugzilla.redhat.com/show_bug.cgi?id=1895385 was followed up and verified in the OCP 4.6.0-0.nightly-s390x-2021-01-18-070324; hence, closing it.