Bug 1977380

Summary: [RFE] optimize osd_memory_target_cgroup_limit_ratio defaults for different deployments
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Ben England <bengland>
Component: ocs-operatorAssignee: Malay Kumar parida <mparida>
Status: CLOSED CURRENTRELEASE QA Contact: Joy John Pinto <jopinto>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: bkunal, bniver, ebenahar, jdurgin, kramdoss, mbukatov, muagarwa, ocs-bugs, odf-bz-bot, orit.was, owasserm, rperiyas, shan, shberry, sostapov, tnielsen
Target Milestone: ---Keywords: FutureFeature, Performance
Target Release: ODF 4.12.0Flags: mbukatov: needinfo?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 1991458 (view as bug list) Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1991458    

Description Ben England 2021-06-29 14:54:36 UTC
Description of problem (please be detailed as possible and provide log
snippests):
-----------------------------------------------------------------------

The OCS operator defaults the parameter osd_memory_target_cgroup_limit_ratio to 0.5 (1/2 of the memory CGroup limit).   This effectively wastes 50% of the memory allocated to the OSD pod by Kubernetes, and prevents effective bluestore caching (where cache hits are in-process, rather than from Linux buffer cache).  I know it can't be too aggressive, but below is an idea for how it could be done better.


Version of all relevant components (if applicable):
-----------------------------------------------------------------------

OCS 4.8
OCP 4.7

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
-----------------------------------------------------------------------

It makes baremetal configs work poorly since OSDs are forced to rely on kernel buffer cache for caching, defeating the whole point of bluestore design, to bypass the kernel buffer cache.


Is there any workaround available to the best of your knowledge?
-----------------------------------------------------------------------

I have a workaround, but it's not a good one that we would want customers to use.

from toolbox: ceph config set global osd_memory_target_cgroup_limit_ratio 0.8

If I understood Josh Durgin's e-mail correctly, OCS OSDs get their params from the monitor, not /etc/ceph/ceph.conf.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
-----------------------------------------------------------------------

Any OCS install.


Can this issue reproducible?
-----------------------------------------------------------------------

Yes, every time.


Can this issue reproduce from the UI?
-----------------------------------------------------------------------

Yes, every time.


If this is a regression, please provide more details to justify this:
-----------------------------------------------------------------------

rook-ceph used to allow OSD overrides using configmap rook-ceph-overrides, this isn't a good design for an opinionated ocs-operator.  But we can't make Ceph perform well without giving OSDs some local memory.

collecting data to justify this statement now using a baremetal configuration.


Steps to Reproduce:
-----------------------------------------------------------------------

1. install OCS
2. run fio
3. observe memory consumption


Actual results:
-----------------------------------------------------------------------

normally OCS OSDs are limited to 2.5 GB RAM, since the default CGroup memory limit for an OSD is 5 GB and the above parameter limits osd_memory_target to 2.5 GB.

from ceph toolbox:

sh-4.4# for n in `seq 0 11` ; do ceph config show osd.$n osd_memory_target_cgroup_limit_ratio ; done
0.500000
...
0.500000

However, osd_memory_target is nowhere near what I would expect:

sh-4.4# for n in `seq 0 11` ; do ceph config show osd.$n osd_memory_target ; done
42949672960
...

sh-4.4# ceph config show-with-defaults osd.2
...
osd_memory_target                                          42949672960                                                                                                   

override (env[42949672960]),(default[42949672960])         
osd_memory_target_cgroup_limit_ratio                       0.500000   
                                                                          

But CGroup limit for OSD pods are (from oc describe pod):

    Limits:
      cpu:     10
      memory:  80Gi
    Requests:
      cpu:        2
      memory:     80Gi

So how is this computed?



Expected results:
-----------------------------------------------------------------------

I would expect that for a large-memory configuration more of the memory could be used by the Ceph OSDs, more like osd_memory_target_cgroup_limit_ratio = 0.9.  We have to reserve enough memory to defend against transitory memory allocation spikes that could result in OOM kills.   This can be done with a formula in the OSD like:

osd_memory_target = (cgroup-limit * osd_memory_target_cgroup_limit_ratio) - 1.5-GiBf
if osd_memory_target < 2 GiB: error


For small-memory configs (i.e. 5 GiB/OSD), this formula reserves the same amount of memory as before for memory allocation spikes.   But for large memory configurations, this would give  much better memory utilization.   For example, if CGroup limit is 80 GiB, this would result in osd_memory_target of 70.5 GiB, with plenty of room to spare for transitory memory spikes.


Additional info:
-----------------------------------------------------------------------

see https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/reconcile.go#L47

Comment 2 Jose A. Rivera 2021-06-29 16:37:25 UTC
This sounds fair enough, though it is purely an optimization. My hunch is that this would not be a trivial implementation, but not untenable. Still, moving this to ODF 4.9.

We also need more Ceph-focused discussion to determine the right path here. Travis, Orit, could either of you weigh in here?

Comment 3 Travis Nielsen 2021-06-29 18:16:26 UTC
I'm thinking we should do the following:
- The OCS operator should not be setting osd_memory_target_cgroup_limit_ratio in the config overrides
- Rook's storageClassDeviceSets should have a setting, let's call it osdMemoryLimitRatio to correspond to ceph's osd_memory_target_cgroup_limit_ratio. If not set, Rook would default this to the suggested ratio depending on the memory limits, 0.5 for smaller OSDs, 0.9 for larger OSDs, etc.
- The OCS StorageCluster CR could also expose the setting if the default computation needs to be overridden.

Comment 5 Travis Nielsen 2021-07-22 22:39:36 UTC
Taking another look at this, Rook really doesn't seem like the right place to calculate a smart default. Exposing this detailed of a setting in Rook should also be avoided if possible since it's getting into the internals of ceph.

What about the following?
1. The OCS operator should allow the value of osd_memory_target_cgroup_limit_ratio to be overridden, whether this is by allowing the configmap values to not be reconciled (preferred), or by exposing a new setting in the StorageCluster CR. This would give flexibility for perf testing and advanced customers.
2. The OSDs themselves would compute a smart default themselves if the osd_memory_target_cgroup_limit_ratio value is not explicitly set. Ben is there any reason Ceph can't do this computation itself?

Comment 6 Ben England 2021-07-26 12:05:23 UTC
Let's ask them, cc'ing Josh Durgin and Sebastien Han.   rook is imposing the CGroup limit, but RHCS 5 (container-based) does this too, so seems like in this time frame we could make Ceph do it when it detects a CGroup limit for memory?  I don't care who does it as long as it gets done, that's for the engineering groups to decide.

Comment 7 Orit Wasserman 2021-07-26 13:42:20 UTC
(In reply to Travis Nielsen from comment #5)
> Taking another look at this, Rook really doesn't seem like the right place
> to calculate a smart default. Exposing this detailed of a setting in Rook
> should also be avoided if possible since it's getting into the internals of
> ceph.
> 
> What about the following?
> 1. The OCS operator should allow the value of
> osd_memory_target_cgroup_limit_ratio to be overridden, whether this is by
> allowing the configmap values to not be reconciled (preferred), or by
> exposing a new setting in the StorageCluster CR. This would give flexibility
> for perf testing and advanced customers.

+1

> 2. The OSDs themselves would compute a smart default themselves if the
> osd_memory_target_cgroup_limit_ratio value is not explicitly set. Ben is
> there any reason Ceph can't do this computation itself?

@Josh?

Comment 8 Sébastien Han 2021-07-27 10:35:58 UTC
cgroup limits are set via pod resources from the ocs-op.
Ceph detects the variables passed by Rook, respectively POD_MEMORY_LIMIT and POD_MEMORY_REQUEST.

Ben when you said:

>However, osd_memory_target is nowhere near what I would expect:
>
>sh-4.4# for n in `seq 0 11` ; do ceph config show osd.$n osd_memory_target ; done
>42949672960

It's expected. Ceph currently looks up the value of osd_memory_target_cgroup_limit_ratio then computes it by the POD_MEMORY_LIMIT value. So in your case:
osd_memory_target_cgroup_limit_ratio (0.5) * POD_MEMORY_LIMIT (85899345920) = 42949672960

Now for "big" configuration, the default value for osd_memory_target_cgroup_limit_ratio is already 0.8 which seems high enough.

Now for "small" configuration, if I read you correctly, Ceph should lower the default value of osd_memory_target_cgroup_limit_ratio when POD_MEMORY_LIMIT is around 5GB which seems acceptable to me if this is what the performance results are showing.
When it's done we can remove ocs-op default's value for osd_memory_target_cgroup_limit_ratio.
Now ocs-op probably sets osd_memory_target_cgroup_limit_ratio to 0.5 because most envs we deploy on don't have that much memory.
Recently, in https://bugzilla.redhat.com/show_bug.cgi?id=1914475 ocs-op has allowed overriding the CM rook-config-override when a specific deployment strategy is used.
So I believe you can already set osd_memory_target_cgroup_limit_ratio to 0.9 if you want.


Ben, does that help?

Comment 9 Mudit Agarwal 2021-08-09 08:26:37 UTC
Created a clone for 4.8.z https://bugzilla.redhat.com/show_bug.cgi?id=1991458

Comment 10 Ben England 2021-09-16 15:18:22 UTC
If you want IBM cloud to utilize ODF at large scale, we had better get our efficiency act together.  IBM Cloud offers 16-TB volumes as OSDs, ask Elvir Kuric, and a Ceph OSD requires more than 2.5 GB of RAM to service a volume that big, just to cache metadata!  Ask Mark Nelson or Josh Durgin.  You can have all the functionality in the world, but if ODF runs really slowly no one will want it for anything more than a toy configuration.

As for bz 1914475, this is a unsupported, undocumented backdoor.  So if your primary way to support large ODF sites is this, that's defeating the whole point of ODF, which was to make it easy to deploy Ceph. 

We need to adjust osd_memory_target in an opinionated way, and I'm suggesting above how you could do that safely and effectively.  Perhaps the initial post wasn't clear enough.  I want to ensure that there is at least 2 GB separating osd_memory_target and CGroup limit = K8S memory:limit, to prevent short-term memory spikes from triggering an OSD OOM.   With this formula, for a CGroup limit of 5 GB (current default), osd_memory_target should be 3 GB, not that different from current default of 2.5 GB.  For a CGroup limit ratio of 10 GB, osd_memory_target should be 8 GB.   So by doubling your memory:limit, you get 3x the amount of *usable* memory for each OSD, which should have a very beneficial effect on metadata caching in the OSD.   That's why this is important for sites with large OSD volumes.

Comment 11 Ben England 2021-09-16 15:31:55 UTC
sorry, posted prematurely ;-)  I meant "ensure that there is 2.5 GB separating OSD memory target and CGroup limit"  So for CGroup limit of 5 GB, osd_memory_target should be 2.5 GB, same as current default.   For doubled CGroup limit of 10 GB, osd_memory_target would be 7.5 GB, 3 times what it was before, and 75% efficient.  As Cgroup limit climbs, efficiency climbs.  The admin can adjust the CGroup limit today via the CR in the StorageDeviceSet.

Sorry for the late response.

Comment 13 Jose A. Rivera 2021-10-11 16:00:47 UTC
We do agree that this optimization would be very beneficial tot he product. That said, we certainly don't have time to do so for ODF 4.9, so moving it to ODF 4.10 for further consideration.

I'm also renaming this BZ as an RFE. Eran, please weigh in on how much priority this should take in our planning.

Comment 17 Ben England 2021-11-18 19:24:40 UTC
lowering the queue depth from 16 to 2 for my 9-pod test results in stable numbers for 16-KiB randwrites, while getting reasonably good throughput.   We need to retest across the whole range of parameters to verify, but am getting excellent stability from the tests this way.

Now for the actual regression numbers, from here:

https://docs.google.com/spreadsheets/d/1hnq-N4eEEvpxg82JYAFTsRSHMSMHn3OGLAWTAqrEQsU/edit#gid=1049516327

It shows that for 16-KiB size (drum roll....):

randread throughput increased by 10% (yay!)
randwrite throughput decreased by 30% (boo!)

These numbers are very consistent and the percent deviation of the samples is VERY LOW.  But why?   I see that more reads are happening during randwrite test with ODF 4.9:

with OCS 4.8:  http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-16-14-32/Screenshot%20from%202021-11-16%2017-13-20.png
with ODF 4.9:  http://nfs-storage01.scalelab.redhat.com/bengland/public/openshift/ripsaw/fio-logs/2021-11-17-17-45/iops-over-time.png

To get these PromQL graphs, go to the openshift console and replace everything to the right of the hostname in the URL with:

monitoring/query-browser?query0=irate%28node_disk_writes_completed_total%7Bdevice%3D~"nvme%5B0-9%5Dn1"%7D%5B1m%5D%29&query1=irate%28node_disk_reads_completed_total%7Bdevice%3D~"nvme%5B0-9%5Dn1"%7D%5B1m%5D%29

I see that more reads are happening with ODF 4.9 during randwrite test than happened with 4.8.  If so, this could explain the regression.

next step is to run this test with higher mem limits and see if this difference goes away.   If so, the solution might be to just give OSDs a little more memory to account for changes happening in RHCS 5 = Ceph Pacific, which wouldn't be so bad.  We can investigate why this might have happened, but the important thing for now is to stop the regression so we can  release 4.9.  Right now I've altered the ocs-storagecluster CR as shown here:

[bengland@localhost ripsaw]$ ocos get storagecluster ocs-storagecluster -o json | jq .spec.storageDeviceSets[0].resources
{
  "limits": {
    "cpu": "2",
    "memory": "10Gi"
  },
  "requests": {
    "cpu": "2",
    "memory": "10Gi"
  }
}

So now I get OSDs with more memory and CPU:

[bengland@localhost ripsaw]$ oc -n openshift-storage get pod rook-ceph-osd-2-86dcbc5fbb-wfvnp -o json | jq .spec.containers[0].resources
{
  "limits": {
    "cpu": "2",
    "memory": "10Gi"
  },
  "requests": {
    "cpu": "2",
    "memory": "10Gi"
  }
}

rerunning test now.   If this works, I'll experiment with different amounts of memory to see how much is actually needed.   Shekhar Berry has already shed light on this with his bz, 

https://bugzilla.redhat.com/show_bug.cgi?id=2012082

Comment 18 Ben England 2021-11-18 19:27:31 UTC
ignore last reply to this bz, it was intended for a different bz!

Comment 19 Orit Wasserman 2021-11-23 13:56:28 UTC
We can start with setting osd_memory_target_cgroup_limit_ratio to 0 to disable the feature. We set the memory requests and limits to the same value in OCS/ODF. It should be the simplest fix

Comment 20 Orit Wasserman 2021-11-23 13:59:59 UTC
@Ben, can you try with setting osd_memory_target_cgroup_limit_ratio to 0?
In ODF there is no reason to use it as we set the same memory requests and limits on the OSD pod.

Comment 21 Shekhar Berry 2021-11-24 05:27:52 UTC
The reason we have left some buffer in OSD by setting osd_memory_target_cgroup_limit_ratio to 0.5 is to prevent OSDs getting OOM kill. This can happen when there is burst of high IO causing transitory memory spikes in OSD memory usage causing K8s to kill the OSD pod. 

So, IMO if we set osd_memory_target_cgroup_limit_ratio to 0 the possibility of OSD getting OOM kill is significantly increased. 

On the other hand for large OSD memory size (especially in baremetal environments) we would waste a lot of memory if osd_memory_target_cgroup_limit_ratio is set to 0.5 as explained in comment 0 by Ben. 

I am ok with the solution proposed by Ben in comment 0, that would work well for both large and small memory size.

Just to repeat what Ben says:

Set osd_memory_target_cgroup_limit_ratio to 0.9

osd_memory_target = (OSD_limit * osd_memory_target_cgroup_limit_ratio) - 1.5GB

What do you think of this solution Orit?

Comment 22 Orit Wasserman 2021-11-25 17:45:34 UTC
After consulting with Josh Durgin (Rados). 
Using the Ceph default of 0.8 with OCS default of 5G will get a 1G of buffer that would be more than enough and there will be no risk of OOM.
I recommend just using Ceph default as it will most likely be update if needed and it will be the simplest.

Comment 23 Mark Nelson 2021-11-29 16:29:45 UTC
@owasserm A 20% buffer is a good default starting point, but it in no way guarantees no risk of OOM.  Rather it was enough to prevent OOM in the cases we tested upstream with a default 4GB osd_memory_target and also using the default tcmalloc memory allocator.  It's possible that there are corner cases that could cause extreme memory fragmentation that could still push a given OSD over that limit (even temporarily) and a hard cap at 20% per-OSD could result in OOM.  It may be rare, but it could still happen.

This is one area where containers with hard memory limits suffer vs contains with no limits (or non-container installs).  Ceph's memory autotuning works well to keep daemons within a reasonable average memory limit over time, but it's too CPU intensive to have it react to extreme transient memory spikes  (this is all configurable via config parameters btw).  Having it react too quickly also means we thrash the caches every time something not controlled by the priority cache uses too much memory.  When memory limits are shared across OSDs, these transient memory spikes are much less likely to cause a global OOM event than when implementing per-container or per-osd quotas.

Comment 24 Mark Nelson 2021-11-29 19:28:33 UTC
To illustrate the point, here's a look at mapped osd memory from Ceph master from 20210923 during interleaved RBD 4K random read/write and RGW 4K object puts/gets/deletes.


https://docs.google.com/spreadsheets/d/1lSp2cLzYmRfPILDCyLMXciIfdf0OvSFngwXukQFXIqQ/edit#gid=1834043199


In most cases we keep the mapped* osd memory overage well below 20%.  There is a single point during a transition from RGW deletes to RBD 4k random reads (around the 1680 second mark) where the overage spikes to nearly the full 20%.  Very small RGW object workloads tend to stress memory allocation far more than RBD so this isn't totally unexpected.  That kind of extreme workload transition especially combined with very high IO depth are when we might see the overage potentially go beyond 20%.  Having said that, this test was designed specifically to stress memory allocation behavior so in practice even large spikes may be rarer on properly configured** systems than this test indicates.

* mapped memory is heap memory - unmapped memory, but it's not the same as RSS memory used when viewed with say top.  Even if we tell the kernel we don't want some region of memory anymore there's no guarantee that the kernel will (or even can) reclaim it.

** meaning huge pages are not being improperly used, tcmalloc is used with enough thread cache, The OSD memory target is 4GB+, rocksdb is not throttling, etc.

Comment 25 Orit Wasserman 2021-12-01 14:44:45 UTC
We always use tcmalloc in ODF so that is good news. Another thing is that the workload used on ODF are not heavy on I/O and each the disk size we support per OSD is up to 4T.
All those make OOM very unlikely.

Comment 26 Jose A. Rivera 2021-12-01 18:20:49 UTC
I feel comfortable taking this in for ODF 4.9.z, the development work should be fairly trivial. It would come down to QE how much they're willing to take on the validation.

Comment 27 Yaniv Kaul 2021-12-14 12:19:01 UTC
(In reply to Jose A. Rivera from comment #26)
> I feel comfortable taking this in for ODF 4.9.z, the development work should
> be fairly trivial. It would come down to QE how much they're willing to take
> on the validation.

QE?

Comment 33 Martin Bukatovic 2022-05-10 13:11:00 UTC
Clarification on status of the bug:

- Jose subscribed (comment 26) to implement changes summarized in comment 21, does that still hold?
- Do we plant to have performance testing done by a perf team for this change?
- Are we comfortable to not having this change covered by a performance tests?

QE team would cover this by standard regressions run, but on top of that, I assume that we should
try additional standard memory sizes of storage node, making sure that nothing is OOM killed
by mistake.

Comment 36 Mudit Agarwal 2022-10-19 03:18:43 UTC
Fixed as part of this dev preview epic.
https://issues.redhat.com/browse/RHSTOR-2517

Comment 37 Mudit Agarwal 2022-11-16 02:52:47 UTC
The original request was to automatically tune the configuration.
But instead of that we have provided a way to configure it manually as per https://bugzilla.redhat.com/show_bug.cgi?id=1977380#c30

Requesting qa_ack based on the same as the above feature is already part of 4.12