Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1710404

Summary: Suggestion to adjust Elasticsearch OOTB cpu limit and memory request
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: LoggingAssignee: ewolinet
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: anli, aos-bugs, ewolinet, jcantril, pweil, qitang, rmeggins
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:28:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Fiedler 2019-05-15 13:55:22 UTC
Let me know if you want to break this into 2 bugs.   The default request/limits for ES in 4.1 have two problems.

1.  On the default AWS IPI instance size (m4.xlarge - 4vCPU and 16GB memory), the ES pods will not schedule due to their memory request of 16Gi.  Larger instances have to be added to the cluster via machineset.   I understand there is a good reason for setting this limit, but I think having the pods not schedule by default is going to be a surprise/cause issues.   Can we consider lowering the request to 8Gi with a strong recommendation in the doc to run on larger instances with a higher request?

2. The default CPU limit of 1 is too low.   An ES pod limited to 1 CPU can not handle much of a message load at all.   On a 250 node scale cluster, even operations logs were buffering in fluentd due to bulk rejects.   I think we need to set the OOTB limit to at least 2 or remove the limit altogether.

Comment 1 Jeff Cantrill 2019-05-15 14:40:49 UTC
None of our pods should have cpu limits, only memory.  This came out of pportante's experience of pods getting kicked off nodes.  cpu limits should be a bug.  I hesitate to drop the memlimit as we explicitly set it to 16G in 3.x releases because of capacity issues.  I defer to PM their desires here as now we have an alternate user experience which is similarly less then desirable.

Comment 2 Rich Megginson 2019-05-15 16:05:53 UTC
(In reply to Jeff Cantrill from comment #1)
> None of our pods should have cpu limits, only memory.  This came out of
> pportante's experience of pods getting kicked off nodes.

Was that setting the `requests` or the `limits`?

      resources:
        limits:
          cpu: 500m
          memory: 4Gi 
        requests:
          cpu: 500m
          memory: 4Gi 

I would expect the requests settings would cause issues with the scheduler, if there is not a node which can spare 500m cpu or 4GB RAM.

> cpu limits should
> be a bug.  I hesitate to drop the memlimit as we explicitly set it to 16G in
> 3.x releases because of capacity issues.  I defer to PM their desires here
> as now we have an alternate user experience which is similarly less then
> desirable.

Right.  The alternative to Elasticsearch not being scheduled at all, is Elasticsearch having poor performance (and the subsequent support calls about log records not showing up in Kibana, or Kibana not working at all).

Comment 3 Mike Fiedler 2019-05-16 01:29:22 UTC
I opened bug 1710657 for the ES CPU limit issue.   The gory details are in that bz, but here is the tl;dr request/limit excerpt from the Elasticsearch deployment YAML:


        resources:
          limits:
            cpu: "1"
            memory: 16Gi
          requests:
            cpu: "1"
            memory: 16Gi

Comment 5 Anping Li 2019-07-11 10:35:13 UTC
Mike, Is the result acceptable?

The default resource deployed by CLO:
    resources
      limits:
        memory: 16Gi
      requests:
        cpu: "1"
        memory: 16Gi

The default resource deployed by EO:
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 1Gi

Comment 8 Mike Fiedler 2019-07-17 17:26:44 UTC
Verified on the cluster-logging-operator image from 4.2.0-0.nightly-2019-07-17-165351.  Values are set as in comment 5.

Comment 9 errata-xmlrpc 2019-10-16 06:28:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922