Bug 1297502

Summary: [RFE] Add support for modifying the TCMalloc thread cache
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Kyle Bader <kbader>
Component: BuildAssignee: Samuel Just <sjust>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: medium Docs Contact: Bara Ancincova <bancinco>
Priority: unspecified    
Version: 1.3.2CC: bhubbard, ceph-eng-bugs, dzafman, flucifre, gmeno, hnallurv, kbader, kchai, kdreyer, mnelson, racpatel, sjust, vumrao
Target Milestone: rcKeywords: FutureFeature
Target Release: 1.3.2   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-0.94.5-4.el7cp Ubuntu: ceph_0.94.5-3redhat1trusty Doc Type: Enhancement
Doc Text:
.TCMalloc thread cache is now configurable With Red Hat Ceph Storage 1.3.2, support for modifying the size of the `TCMalloc` thread cache has been added. Increasing the thread cache size significantly improves Ceph cluster performance. To set the thread cache size, edit the value of the `TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES` parameter in the Ceph system configuration file, that is `/etc/sysconfig/ceph` for Red Hat Enterprise Linux and `/etc/default/ceph` for Ubuntu. In addition, the default value of `TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES` has been changed from 32 MB to 128 MB.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-29 14:44:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1299303    

Description Kyle Bader 2016-01-11 17:09:39 UTC
Description of problem:

TCMalloc supports changing the size of the thread cache through the environmental variable TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES. This variable was not honored with TCMalloc 2.1 due to a bug, and this was the version previously provided by the Ceph repos. RHEL 7.2 has picked up TCMalloc 2.4, and the thread cache bug is resolved in this version. Increasing the TCMalloc thread cache to 128M can improve performance 4-5x. It would be great to have a way of setting the TCMalloc thread cache to 128M, instead of the default 32M. The Ceph init script should probably handle this, pulling a thread cache number from ceph.conf, /etc/default/ceph, or something similar.

There is a related ticket in the upstream tracker here:

http://tracker.ceph.com/issues/12513

Comment 2 Federico Lucifredi 2016-01-18 16:48:37 UTC
Gregory, it would be great if we could have this in 1.3.2 — can this be done?

Comment 3 Ken Dreyer (Red Hat) 2016-01-19 03:21:45 UTC
Maybe backport https://github.com/ceph/ceph/pull/6732 , which would give us the ability to set this /etc/sysconfig/ceph (RHEL).

For Ubuntu, that PR doesn't touch the upstart files in src/upstart, so we'd need to add something like

[ -f /etc/default/ceph ] && . /etc/default/ceph

...to each upstart script, and possibly export TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES as well.

Comment 4 Ken Dreyer (Red Hat) 2016-01-19 15:08:05 UTC
Re-targeting to 1.3.2 , let's try to get this into the RHEL packaging if we can.

Comment 5 Ken Dreyer (Red Hat) 2016-01-19 15:20:16 UTC
Are you sure TCMalloc defaults to 32MB when the user specifies nothing? http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html seems to indicate it's 16MB.

Should we default to any value in /etc/sysconfig/ceph, or leave a line commented out there for users to un-comment ?

Comment 6 Ken Dreyer (Red Hat) 2016-01-19 18:40:19 UTC
Mark (or anyone), how can I empirically verify that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is taking effect?

Comment 7 Ken Dreyer (Red Hat) 2016-01-19 20:29:29 UTC
(In reply to Ken Dreyer (Red Hat) from comment #5)
> Are you sure TCMalloc defaults to 32MB when the user specifies nothing?
> http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html seems to
> indicate it's 16MB.

I see, "The default cache size is 32M, the tcmalloc documentation is outdated" https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23575.html

Comment 8 Ken Dreyer (Red Hat) 2016-01-19 20:53:21 UTC
James Page @ Ubuntu has cherry-picked the patch that makes TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES effective. This is in google-perftools 2.1-2ubuntu1.1. So in theory we can implement a solution for both RHEL 7 and Ubuntu Trusty.

Still need to know the following:

1) Do we want to choose a default value (greater than 32MB), or let the user decide?

2) How can I empirically verify that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is taking effect?

Comment 9 Kyle Bader 2016-01-19 21:05:44 UTC
For 2, it doesn't look like we can use the existing memory profiling code to determine the total thread cache size:

http://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/

We should probably add a admin socket command to inspect the tcmalloc thread cache size a la:

MallocExtension::instance()->GetNumericProperty(tcmalloc.current_total_thread_cache_bytes, &value);

https://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html#Sizing_Thread_Cache_Free_Lists

Comment 10 Ken Dreyer (Red Hat) 2016-01-19 21:26:38 UTC
Do we want to report tcmalloc.current_total_thread_cache_bytes or tcmalloc.max_total_thread_cache_bytes? Or both?

Who can add that functionality to the admin socket?

Comment 11 Kyle Bader 2016-01-20 03:37:36 UTC
Yeah, you're right. We want tcmalloc.max_total_thread_cache_bytes.

I've verified that you can inspect the thread cache size, and interestingly, you can also set it at runtime. This means that we could potentially have the daemon set it's own value, based off something in ceph.conf.

Example:

https://gist.github.com/mmgaggle/a5818d4e8528d3681534

Comment 12 Ken Dreyer (Red Hat) 2016-01-20 15:50:55 UTC
We need a patch to Ceph upstream for this. (Mark, if you're not the best assignee, please re-assign as appropriate)

Comment 13 Samuel Just 2016-01-20 21:36:24 UTC
Working on https://github.com/athanatos/ceph/tree/wip-admin-malloc

Comment 14 Ken Dreyer (Red Hat) 2016-01-21 05:22:40 UTC
proposed init systems change: https://github.com/ceph/ceph/pull/7304

Comment 15 Federico Lucifredi 2016-01-22 22:31:41 UTC
After discussions with Kyle, Brent, Mark, Neil and many others, we all agree that default thread cache should be at 128MB by default. 

Please change the default setting. I will take care of release notes and doc bugs associated.

Comment 16 Harish NV Rao 2016-01-25 10:21:47 UTC
Hi Federico,

I have few questions:

1) It's expected that this fix will improve the performance by 4-5x. Is there a need in 1.3.2 to support this by running performance tests? If yes, then we may have to coordinate with Ben Turner and Mark Nelson.

2) As per comment 15, the default thread cache would be 128MB by default. Do we allow users to change it? If yes, please share the steps to do so on both RHEL and Ubuntu. 

3) How to make sure that whatever the value(or default value) we have set for thread cache has taken into effect on both RHEL and Ubuntu clusters? Need steps/instructions for this.

4) We would be running automated tests on the RHEL and some manual tests on Ubuntu with this fix in place. Is there anything else that need to be tested apart from these (Ken, can you please confirm here?) ?

I feel the scope of testing this fix for now would be to test 2) and/or 3) [with 4) being regression tests] above. Please let me know your opinion.

Thanks,
Harish

Comment 17 Ken Dreyer (Red Hat) 2016-01-25 16:03:26 UTC
(In reply to Harish NV Rao from comment #16)
> 2) As per comment 15, the default thread cache would be 128MB by default. Do
> we allow users to change it? If yes, please share the steps to do so on both
> RHEL and Ubuntu. 

Yes, we will add a "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M" setting in /etc/sysconfig/ceph (RHEL) and /etc/default/ceph (Ubuntu). Users will be allowed to edit this setting to "64MB", for example, if they wish.

> 3) How to make sure that whatever the value(or default value) we have set
> for thread cache has taken into effect on both RHEL and Ubuntu clusters?
> Need steps/instructions for this.

On your OSDs, check the output of "ps e -p <ceph-osd-pid>". For example, this checks all the OSD pids on a system:

  ps e -p $(pgrep ceph-osd) | grep --color=auto TCMALLOC

It may be a big wall of text that is hard to read, so "--color" helps there.

If "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M" is in the output there, you will know that it is in effect.

> 4) We would be running automated tests on the RHEL and some manual tests on
> Ubuntu with this fix in place. Is there anything else that need to be tested
> apart from these (Ken, can you please confirm here?) ?

Not that I can think of.

Comment 18 Kyle Bader 2016-01-25 18:10:19 UTC
If it's not already in the regression tests for gperftools, we might want to use this test to ensure the allocator is honoring the environmental variable:

https://launchpadlibrarian.net/202635014/gperftest.c

Comment 21 Ken Dreyer (Red Hat) 2016-02-03 16:42:54 UTC
To be clear to QE, things to check with this bug:

1. After installing the ceph-osd packages, verify /etc/default/ceph (Ubuntu) or /etc/sysconfig/ceph (RHEL) contains a TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES setting of 128M out of the box.

2. After starting up the OSD service, verify that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is part of the OSD pid's environment. Run "ps e -p <ceph-osd-pid>". For example, this checks all the OSD pids on a system:

  ps e -p $(pgrep ceph-osd) | grep --color=auto TCMALLOC

It may be a big wall of text that is hard to read, so "--color" helps there.

If "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M" is in the output there, you will know that it is in effect.

3. Change the value to something else (eg 64M), restart the daemons, and check again with "ps" that the environment variable reflects the new "64M" value.

Comment 22 Rachana Patel 2016-02-04 18:47:17 UTC
Verified as mentioned in comment 21 on RHEL machine. default value is 128 MB and changed to 64MB, 32MB and back to 128MB. working as expected hence moving to verified

version:-
ceph-osd-0.94.5-4.el7cp.x86_64

Comment 25 errata-xmlrpc 2016-02-29 14:44:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0313

Comment 26 Ken Dreyer (Red Hat) 2016-03-04 17:35:50 UTC
upstream change to 128MB by default: https://github.com/ceph/ceph/pull/7934