Hide Forgot
Description of problem: TCMalloc supports changing the size of the thread cache through the environmental variable TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES. This variable was not honored with TCMalloc 2.1 due to a bug, and this was the version previously provided by the Ceph repos. RHEL 7.2 has picked up TCMalloc 2.4, and the thread cache bug is resolved in this version. Increasing the TCMalloc thread cache to 128M can improve performance 4-5x. It would be great to have a way of setting the TCMalloc thread cache to 128M, instead of the default 32M. The Ceph init script should probably handle this, pulling a thread cache number from ceph.conf, /etc/default/ceph, or something similar. There is a related ticket in the upstream tracker here: http://tracker.ceph.com/issues/12513
Gregory, it would be great if we could have this in 1.3.2 — can this be done?
Maybe backport https://github.com/ceph/ceph/pull/6732 , which would give us the ability to set this /etc/sysconfig/ceph (RHEL). For Ubuntu, that PR doesn't touch the upstart files in src/upstart, so we'd need to add something like [ -f /etc/default/ceph ] && . /etc/default/ceph ...to each upstart script, and possibly export TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES as well.
Re-targeting to 1.3.2 , let's try to get this into the RHEL packaging if we can.
Are you sure TCMalloc defaults to 32MB when the user specifies nothing? http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html seems to indicate it's 16MB. Should we default to any value in /etc/sysconfig/ceph, or leave a line commented out there for users to un-comment ?
Mark (or anyone), how can I empirically verify that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is taking effect?
(In reply to Ken Dreyer (Red Hat) from comment #5) > Are you sure TCMalloc defaults to 32MB when the user specifies nothing? > http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html seems to > indicate it's 16MB. I see, "The default cache size is 32M, the tcmalloc documentation is outdated" https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23575.html
James Page @ Ubuntu has cherry-picked the patch that makes TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES effective. This is in google-perftools 2.1-2ubuntu1.1. So in theory we can implement a solution for both RHEL 7 and Ubuntu Trusty. Still need to know the following: 1) Do we want to choose a default value (greater than 32MB), or let the user decide? 2) How can I empirically verify that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is taking effect?
For 2, it doesn't look like we can use the existing memory profiling code to determine the total thread cache size: http://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/ We should probably add a admin socket command to inspect the tcmalloc thread cache size a la: MallocExtension::instance()->GetNumericProperty(tcmalloc.current_total_thread_cache_bytes, &value); https://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html#Sizing_Thread_Cache_Free_Lists
Do we want to report tcmalloc.current_total_thread_cache_bytes or tcmalloc.max_total_thread_cache_bytes? Or both? Who can add that functionality to the admin socket?
Yeah, you're right. We want tcmalloc.max_total_thread_cache_bytes. I've verified that you can inspect the thread cache size, and interestingly, you can also set it at runtime. This means that we could potentially have the daemon set it's own value, based off something in ceph.conf. Example: https://gist.github.com/mmgaggle/a5818d4e8528d3681534
We need a patch to Ceph upstream for this. (Mark, if you're not the best assignee, please re-assign as appropriate)
Working on https://github.com/athanatos/ceph/tree/wip-admin-malloc
proposed init systems change: https://github.com/ceph/ceph/pull/7304
After discussions with Kyle, Brent, Mark, Neil and many others, we all agree that default thread cache should be at 128MB by default. Please change the default setting. I will take care of release notes and doc bugs associated.
Hi Federico, I have few questions: 1) It's expected that this fix will improve the performance by 4-5x. Is there a need in 1.3.2 to support this by running performance tests? If yes, then we may have to coordinate with Ben Turner and Mark Nelson. 2) As per comment 15, the default thread cache would be 128MB by default. Do we allow users to change it? If yes, please share the steps to do so on both RHEL and Ubuntu. 3) How to make sure that whatever the value(or default value) we have set for thread cache has taken into effect on both RHEL and Ubuntu clusters? Need steps/instructions for this. 4) We would be running automated tests on the RHEL and some manual tests on Ubuntu with this fix in place. Is there anything else that need to be tested apart from these (Ken, can you please confirm here?) ? I feel the scope of testing this fix for now would be to test 2) and/or 3) [with 4) being regression tests] above. Please let me know your opinion. Thanks, Harish
(In reply to Harish NV Rao from comment #16) > 2) As per comment 15, the default thread cache would be 128MB by default. Do > we allow users to change it? If yes, please share the steps to do so on both > RHEL and Ubuntu. Yes, we will add a "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M" setting in /etc/sysconfig/ceph (RHEL) and /etc/default/ceph (Ubuntu). Users will be allowed to edit this setting to "64MB", for example, if they wish. > 3) How to make sure that whatever the value(or default value) we have set > for thread cache has taken into effect on both RHEL and Ubuntu clusters? > Need steps/instructions for this. On your OSDs, check the output of "ps e -p <ceph-osd-pid>". For example, this checks all the OSD pids on a system: ps e -p $(pgrep ceph-osd) | grep --color=auto TCMALLOC It may be a big wall of text that is hard to read, so "--color" helps there. If "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M" is in the output there, you will know that it is in effect. > 4) We would be running automated tests on the RHEL and some manual tests on > Ubuntu with this fix in place. Is there anything else that need to be tested > apart from these (Ken, can you please confirm here?) ? Not that I can think of.
If it's not already in the regression tests for gperftools, we might want to use this test to ensure the allocator is honoring the environmental variable: https://launchpadlibrarian.net/202635014/gperftest.c
To be clear to QE, things to check with this bug: 1. After installing the ceph-osd packages, verify /etc/default/ceph (Ubuntu) or /etc/sysconfig/ceph (RHEL) contains a TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES setting of 128M out of the box. 2. After starting up the OSD service, verify that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is part of the OSD pid's environment. Run "ps e -p <ceph-osd-pid>". For example, this checks all the OSD pids on a system: ps e -p $(pgrep ceph-osd) | grep --color=auto TCMALLOC It may be a big wall of text that is hard to read, so "--color" helps there. If "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M" is in the output there, you will know that it is in effect. 3. Change the value to something else (eg 64M), restart the daemons, and check again with "ps" that the environment variable reflects the new "64M" value.
Verified as mentioned in comment 21 on RHEL machine. default value is 128 MB and changed to 64MB, 32MB and back to 128MB. working as expected hence moving to verified version:- ceph-osd-0.94.5-4.el7cp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0313
upstream change to 128MB by default: https://github.com/ceph/ceph/pull/7934