Let me know if you want to break this into 2 bugs. The default request/limits for ES in 4.1 have two problems. 1. On the default AWS IPI instance size (m4.xlarge - 4vCPU and 16GB memory), the ES pods will not schedule due to their memory request of 16Gi. Larger instances have to be added to the cluster via machineset. I understand there is a good reason for setting this limit, but I think having the pods not schedule by default is going to be a surprise/cause issues. Can we consider lowering the request to 8Gi with a strong recommendation in the doc to run on larger instances with a higher request? 2. The default CPU limit of 1 is too low. An ES pod limited to 1 CPU can not handle much of a message load at all. On a 250 node scale cluster, even operations logs were buffering in fluentd due to bulk rejects. I think we need to set the OOTB limit to at least 2 or remove the limit altogether.
None of our pods should have cpu limits, only memory. This came out of pportante's experience of pods getting kicked off nodes. cpu limits should be a bug. I hesitate to drop the memlimit as we explicitly set it to 16G in 3.x releases because of capacity issues. I defer to PM their desires here as now we have an alternate user experience which is similarly less then desirable.
(In reply to Jeff Cantrill from comment #1) > None of our pods should have cpu limits, only memory. This came out of > pportante's experience of pods getting kicked off nodes. Was that setting the `requests` or the `limits`? resources: limits: cpu: 500m memory: 4Gi requests: cpu: 500m memory: 4Gi I would expect the requests settings would cause issues with the scheduler, if there is not a node which can spare 500m cpu or 4GB RAM. > cpu limits should > be a bug. I hesitate to drop the memlimit as we explicitly set it to 16G in > 3.x releases because of capacity issues. I defer to PM their desires here > as now we have an alternate user experience which is similarly less then > desirable. Right. The alternative to Elasticsearch not being scheduled at all, is Elasticsearch having poor performance (and the subsequent support calls about log records not showing up in Kibana, or Kibana not working at all).
I opened bug 1710657 for the ES CPU limit issue. The gory details are in that bz, but here is the tl;dr request/limit excerpt from the Elasticsearch deployment YAML: resources: limits: cpu: "1" memory: 16Gi requests: cpu: "1" memory: 16Gi
Mike, Is the result acceptable? The default resource deployed by CLO: resources limits: memory: 16Gi requests: cpu: "1" memory: 16Gi The default resource deployed by EO: resources: limits: memory: 4Gi requests: cpu: 100m memory: 1Gi
Verified on the cluster-logging-operator image from 4.2.0-0.nightly-2019-07-17-165351. Values are set as in comment 5.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922