1710404 – Suggestion to adjust Elasticsearch OOTB cpu limit and memory request

Bug 1710404 - Suggestion to adjust Elasticsearch OOTB cpu limit and memory request

Summary: Suggestion to adjust Elasticsearch OOTB cpu limit and memory request

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	ewolinet
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-15 13:55 UTC by Mike Fiedler
Modified:	2019-10-22 02:57 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:28:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 180	None	closed	Bug 1710404: No longer set CPU Limit for default ES resource requirements	2021-01-12 15:38:03 UTC
Github	openshift elasticsearch-operator pull 145	None	closed	Bug 1710404: Better OOTB resource request values	2020-02-03 08:47:36 UTC
Github	openshift elasticsearch-operator pull 160	None	closed	Bug 1710404 - Removing setting a default cpu limit	2020-02-03 08:47:36 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:29:04 UTC

Description Mike Fiedler 2019-05-15 13:55:22 UTC

Let me know if you want to break this into 2 bugs.   The default request/limits for ES in 4.1 have two problems.

1.  On the default AWS IPI instance size (m4.xlarge - 4vCPU and 16GB memory), the ES pods will not schedule due to their memory request of 16Gi.  Larger instances have to be added to the cluster via machineset.   I understand there is a good reason for setting this limit, but I think having the pods not schedule by default is going to be a surprise/cause issues.   Can we consider lowering the request to 8Gi with a strong recommendation in the doc to run on larger instances with a higher request?

2. The default CPU limit of 1 is too low.   An ES pod limited to 1 CPU can not handle much of a message load at all.   On a 250 node scale cluster, even operations logs were buffering in fluentd due to bulk rejects.   I think we need to set the OOTB limit to at least 2 or remove the limit altogether.

Comment 1 Jeff Cantrill 2019-05-15 14:40:49 UTC

None of our pods should have cpu limits, only memory.  This came out of pportante's experience of pods getting kicked off nodes.  cpu limits should be a bug.  I hesitate to drop the memlimit as we explicitly set it to 16G in 3.x releases because of capacity issues.  I defer to PM their desires here as now we have an alternate user experience which is similarly less then desirable.

Comment 2 Rich Megginson 2019-05-15 16:05:53 UTC

(In reply to Jeff Cantrill from comment #1)
> None of our pods should have cpu limits, only memory.  This came out of
> pportante's experience of pods getting kicked off nodes.

Was that setting the `requests` or the `limits`?

      resources:
        limits:
          cpu: 500m
          memory: 4Gi 
        requests:
          cpu: 500m
          memory: 4Gi 

I would expect the requests settings would cause issues with the scheduler, if there is not a node which can spare 500m cpu or 4GB RAM.

> cpu limits should
> be a bug.  I hesitate to drop the memlimit as we explicitly set it to 16G in
> 3.x releases because of capacity issues.  I defer to PM their desires here
> as now we have an alternate user experience which is similarly less then
> desirable.

Right.  The alternative to Elasticsearch not being scheduled at all, is Elasticsearch having poor performance (and the subsequent support calls about log records not showing up in Kibana, or Kibana not working at all).

Comment 3 Mike Fiedler 2019-05-16 01:29:22 UTC

I opened bug 1710657 for the ES CPU limit issue.   The gory details are in that bz, but here is the tl;dr request/limit excerpt from the Elasticsearch deployment YAML:


        resources:
          limits:
            cpu: "1"
            memory: 16Gi
          requests:
            cpu: "1"
            memory: 16Gi

Comment 5 Anping Li 2019-07-11 10:35:13 UTC

Mike, Is the result acceptable?

The default resource deployed by CLO:
    resources
      limits:
        memory: 16Gi
      requests:
        cpu: "1"
        memory: 16Gi

The default resource deployed by EO:
    resources:
      limits:
        memory: 4Gi
      requests:
        cpu: 100m
        memory: 1Gi

Comment 8 Mike Fiedler 2019-07-17 17:26:44 UTC

Verified on the cluster-logging-operator image from 4.2.0-0.nightly-2019-07-17-165351.  Values are set as in comment 5.

Comment 9 errata-xmlrpc 2019-10-16 06:28:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.