Bug 1710404 - Suggestion to adjust Elasticsearch OOTB cpu limit and memory request
Summary: Suggestion to adjust Elasticsearch OOTB cpu limit and memory request
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: ewolinet
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-15 13:55 UTC by Mike Fiedler
Modified: 2019-10-22 02:57 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:28:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 180 0 None closed Bug 1710404: No longer set CPU Limit for default ES resource requirements 2021-01-12 15:38:03 UTC
Github openshift elasticsearch-operator pull 145 0 None closed Bug 1710404: Better OOTB resource request values 2020-02-03 08:47:36 UTC
Github openshift elasticsearch-operator pull 160 0 None closed Bug 1710404 - Removing setting a default cpu limit 2020-02-03 08:47:36 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:29:04 UTC

Description Mike Fiedler 2019-05-15 13:55:22 UTC
Let me know if you want to break this into 2 bugs.   The default request/limits for ES in 4.1 have two problems.

1.  On the default AWS IPI instance size (m4.xlarge - 4vCPU and 16GB memory), the ES pods will not schedule due to their memory request of 16Gi.  Larger instances have to be added to the cluster via machineset.   I understand there is a good reason for setting this limit, but I think having the pods not schedule by default is going to be a surprise/cause issues.   Can we consider lowering the request to 8Gi with a strong recommendation in the doc to run on larger instances with a higher request?

2. The default CPU limit of 1 is too low.   An ES pod limited to 1 CPU can not handle much of a message load at all.   On a 250 node scale cluster, even operations logs were buffering in fluentd due to bulk rejects.   I think we need to set the OOTB limit to at least 2 or remove the limit altogether.

Comment 1 Jeff Cantrill 2019-05-15 14:40:49 UTC
None of our pods should have cpu limits, only memory.  This came out of pportante's experience of pods getting kicked off nodes.  cpu limits should be a bug.  I hesitate to drop the memlimit as we explicitly set it to 16G in 3.x releases because of capacity issues.  I defer to PM their desires here as now we have an alternate user experience which is similarly less then desirable.

Comment 2 Rich Megginson 2019-05-15 16:05:53 UTC
(In reply to Jeff Cantrill from comment #1)
> None of our pods should have cpu limits, only memory.  This came out of
> pportante's experience of pods getting kicked off nodes.

Was that setting the `requests` or the `limits`?

      resources:
        limits:
          cpu: 500m
          memory: 4Gi 
        requests:
          cpu: 500m
          memory: 4Gi 

I would expect the requests settings would cause issues with the scheduler, if there is not a node which can spare 500m cpu or 4GB RAM.

> cpu limits should
> be a bug.  I hesitate to drop the memlimit as we explicitly set it to 16G in
> 3.x releases because of capacity issues.  I defer to PM their desires here
> as now we have an alternate user experience which is similarly less then
> desirable.

Right.  The alternative to Elasticsearch not being scheduled at all, is Elasticsearch having poor performance (and the subsequent support calls about log records not showing up in Kibana, or Kibana not working at all).

Comment 3 Mike Fiedler 2019-05-16 01:29:22 UTC
I opened bug 1710657 for the ES CPU limit issue.   The gory details are in that bz, but here is the tl;dr request/limit excerpt from the Elasticsearch deployment YAML:


        resources:
          limits:
            cpu: "1"
            memory: 16Gi
          requests:
            cpu: "1"
            memory: 16Gi

Comment 5 Anping Li 2019-07-11 10:35:13 UTC
Mike, Is the result acceptable?

The default resource deployed by CLO:
    resources
      limits:
        memory: 16Gi
      requests:
        cpu: "1"
        memory: 16Gi

The default resource deployed by EO:
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 1Gi

Comment 8 Mike Fiedler 2019-07-17 17:26:44 UTC
Verified on the cluster-logging-operator image from 4.2.0-0.nightly-2019-07-17-165351.  Values are set as in comment 5.

Comment 9 errata-xmlrpc 2019-10-16 06:28:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.