Bug 1717414
Summary: | glibc: glibc malloc vs tcmalloc performance on fio using Ceph RBD | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jason Dillaman <jdillama> |
Component: | glibc | Assignee: | glibc team <glibc-bugzilla> |
Status: | CLOSED UPSTREAM | QA Contact: | qe-baseos-tools-bugs |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 8.2 | CC: | ashankar, chayang, codonell, coli, commodorekappa+redhat, dj, fweimer, jinzhao, juzhang, mnewsome, pbonzini, pfrankli, qzhang, rbalakri, sipoyare, virt-maint, yama |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-01 07:31:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1496871 | ||
Bug Blocks: |
Description
Jason Dillaman
2019-06-05 12:43:22 UTC
(In reply to Jason Dillaman from comment #0) > Description of problem: > RHEL 8 removed the use of tcmalloc within QEMU under the assumption that the > new glibc thread cache allocator improvements would eliminate the need for > tcmalloc. Performance testing under high IOPS workloads shows that this is > not the case. > > Version-Release number of selected component (if applicable): > RHEL 8.0 > > How reproducible: > 100% > > Steps to Reproduce: > 1. Use "fio" within a VM to benchmark a fast Ceph RBD virtual disk > 2. Re-run after starting QEMU under "LD_PRELOAD=/usr/lib64/libtcmalloc.so" > > Actual results: > RHEL 8.0 QEMU is around 20% slower when not using tcmalloc as compared to an > instance of QEMU that is using tcmalloc. Paolo pushed for this change via bug 1496871, so reassigning to him. Is it 20% slower with non-RBD disks too? Anyway we need to gather a trace and send it over to the glibc guys. Also, RBD has support in fio, it would be interesting to see if enabling tcmalloc via LD_LIBRARY_PATH provides a speedup on the host. That would make it much simpler to gather traces for glibc. (In reply to Paolo Bonzini from comment #3) > Jason, can you look at comment 2? I don't know if it's slower w/ non-RBD disks. My focus is RBD land and trying to avoid a major performance regression on RHEL 8. (In reply to Paolo Bonzini from comment #4) > Also, RBD has support in fio, it would be interesting to see if enabling > tcmalloc via LD_LIBRARY_PATH provides a speedup on the host. That would make > it much simpler to gather traces for glibc. You mean w/ LD_PRELOAD? If so, the answer is yes and now (as a result), fio will link w/ tcmalloc automatically if it's available [1]. [1] https://github.com/axboe/fio/commit/01fe773df4bc4a35450ce3ef50c8075b3bf55cd0#diff-e2d5a00791bce9a01f99bc6fd613a39d QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks Since the upstream commit for fio (https://github.com/axboe/fio/commit/01fe773df4bc4a35450ce3ef50c8075b3bf55cd0) says that it is reproducible without QEMU, reassigning to glibc. (In reply to Paolo Bonzini from comment #9) > Since the upstream commit for fio > (https://github.com/axboe/fio/commit/ > 01fe773df4bc4a35450ce3ef50c8075b3bf55cd0) says that it is reproducible > without QEMU, reassigning to glibc. Could you please provide detailed steps to reproduce this, or if you can, access to a system that reproduces it? I don't know how to setup a Ceph RBD that is appropriate for a high IOPS workload as described in the original description. It would be easiest if we had access to systems you consider sufficiently correctly configured. (In reply to Carlos O'Donell from comment #10) > Could you please provide detailed steps to reproduce this, or if you can, > access to a system that reproduces it? > > I don't know how to setup a Ceph RBD that is appropriate for a high IOPS > workload as described in the original description. > > It would be easiest if we had access to systems you consider sufficiently > correctly configured. I can't provide a system since I really only have my personal development box. I can, however, perhaps just provide a reproducer in the new year (since I won't have time to work on in next week) or maybe a test build of QEMU w/ a dummy RBD driver. The big change was that QEMU used to link against tcmalloc for this exact reason but it was dropped in RHEL 8 because GCC stated it improved its allocator to the point where tcmalloc was no longer necessary. However, that does not appear to be the actual case. (In reply to Jason Dillaman from comment #11) > (In reply to Carlos O'Donell from comment #10) > > Could you please provide detailed steps to reproduce this, or if you can, > > access to a system that reproduces it? > > > > I don't know how to setup a Ceph RBD that is appropriate for a high IOPS > > workload as described in the original description. > > > > It would be easiest if we had access to systems you consider sufficiently > > correctly configured. > > I can't provide a system since I really only have my personal development > box. I can, however, perhaps just provide a reproducer in the new year > (since I won't have time to work on in next week) or maybe a test build of > QEMU w/ a dummy RBD driver. The big change was that QEMU used to link > against tcmalloc for this exact reason but it was dropped in RHEL 8 because > GCC stated it improved its allocator to the point where tcmalloc was no > longer necessary. However, that does not appear to be the actual case. The glibc malloc allocator is improved (thread local cache), but different workloads will see different benefits. I look forward to getting a reproducer that we can use to test the performance. Just for clarity will this impact current Ceph users? (In reply to Carlos O'Donell from comment #12) > Just for clarity will this impact current Ceph users? IO is 20% slower (see the problem description) as compared to the same software build running under tcmalloc (via LD_PRELOAD). With the dummy librbd / rbd CLI tool, you can see the effect of the glibc memory allocator: # GLIBC MEMORY ALLOCATOR $ rbd bench --io-type write --io-pattern rand --io-size 4K --io-total 5G image1 bench type write io_size 4096 io_threads 16 bytes 5368709120 pattern random SEC OPS OPS/SEC BYTES/SEC 1 101200 101418 396 MiB/s 2 202960 101589 397 MiB/s 3 303488 101235 395 MiB/s 4 404160 101094 395 MiB/s 5 506112 101265 396 MiB/s 6 611936 102146 399 MiB/s 7 718464 103100 403 MiB/s 8 817680 102838 402 MiB/s 9 918384 102844 402 MiB/s 10 1024544 103686 405 MiB/s 11 1125680 102748 401 MiB/s 12 1228208 101948 398 MiB/s elapsed: 12 ops: 1310720 ops/sec: 102359 bytes/sec: 400 MiB/s # TCMALLOC MEMORY ALLOCATOR $ LD_PRELOAD=/usr/lib64/libtcmalloc.so rbd bench --io-type write --io-pattern rand --io-size 4K --io-total 5G image1 bench type write io_size 4096 io_threads 16 bytes 5368709120 pattern random SEC OPS OPS/SEC BYTES/SEC 1 128288 128818 503 MiB/s 2 267456 134003 523 MiB/s 3 408640 136400 533 MiB/s 4 550352 137729 538 MiB/s 5 692208 138555 541 MiB/s 6 833984 141138 551 MiB/s 7 975232 141554 553 MiB/s 8 1117152 141701 554 MiB/s 9 1258992 141727 554 MiB/s elapsed: 9 ops: 1310720 ops/sec: 139973 bytes/sec: 547 MiB/s The results show that the mocked IO benchmark under librbd is >25% slower when using glibc's memory allocator as compared to tcmalloc. My original goal was not to pressure glibc team into matching tcmalloc's performance, though. I really just wanted the fast/easy path of having RHEL 8's QEMU linked against tcmalloc again like it was under RHEL 7 (since that linkage was removed because glibc would match tcmalloc performance and/or due to BaseOS vs AppStream repos). After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. (In reply to Jason Dillaman from comment #15) > With the dummy librbd / rbd CLI tool, you can see the effect of the glibc > memory allocator: > > # GLIBC MEMORY ALLOCATOR > $ rbd bench --io-type write --io-pattern rand --io-size 4K --io-total 5G > image1 > bench type write io_size 4096 io_threads 16 bytes 5368709120 pattern random > SEC OPS OPS/SEC BYTES/SEC > 1 101200 101418 396 MiB/s > 2 202960 101589 397 MiB/s > 3 303488 101235 395 MiB/s > 4 404160 101094 395 MiB/s > 5 506112 101265 396 MiB/s > 6 611936 102146 399 MiB/s > 7 718464 103100 403 MiB/s > 8 817680 102838 402 MiB/s > 9 918384 102844 402 MiB/s > 10 1024544 103686 405 MiB/s > 11 1125680 102748 401 MiB/s > 12 1228208 101948 398 MiB/s > elapsed: 12 ops: 1310720 ops/sec: 102359 bytes/sec: 400 MiB/s > > # TCMALLOC MEMORY ALLOCATOR > $ LD_PRELOAD=/usr/lib64/libtcmalloc.so rbd bench --io-type write > --io-pattern rand --io-size 4K --io-total 5G image1 > bench type write io_size 4096 io_threads 16 bytes 5368709120 pattern random > SEC OPS OPS/SEC BYTES/SEC > 1 128288 128818 503 MiB/s > 2 267456 134003 523 MiB/s > 3 408640 136400 533 MiB/s > 4 550352 137729 538 MiB/s > 5 692208 138555 541 MiB/s > 6 833984 141138 551 MiB/s > 7 975232 141554 553 MiB/s > 8 1117152 141701 554 MiB/s > 9 1258992 141727 554 MiB/s > elapsed: 9 ops: 1310720 ops/sec: 139973 bytes/sec: 547 MiB/s > > The results show that the mocked IO benchmark under librbd is >25% slower > when using glibc's memory allocator as compared to tcmalloc. > > My original goal was not to pressure glibc team into matching tcmalloc's > performance, though. I really just wanted the fast/easy path of having RHEL > 8's QEMU linked against tcmalloc again like it was under RHEL 7 (since that > linkage was removed because glibc would match tcmalloc performance and/or > due to BaseOS vs AppStream repos). This is going to be a longer term project to review performance again, but having examples that are problematic is important for the glibc team. This issue is auto-closed because we aren't going to get this fixed in say RHEL 8.5/8.6. I'll discuss this with the glibc team and we'll see what we can do. I'm marking this bug CLOSED/UPSTREAM and we'll track this upstream with this bug: https://sourceware.org/bugzilla/show_bug.cgi?id=28050 This means that if we make progress upstream (independent project, or other partners work on it) we can come back here and review the bug for inclusion in RHEL. |