Bug 1717414

Summary:	glibc: glibc malloc vs tcmalloc performance on fio using Ceph RBD
Product:	Red Hat Enterprise Linux 8	Reporter:	Jason Dillaman <jdillama>
Component:	glibc	Assignee:	glibc team <glibc-bugzilla>
Status:	CLOSED UPSTREAM	QA Contact:	qe-baseos-tools-bugs
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.2	CC:	ashankar, chayang, codonell, coli, commodorekappa+redhat, dj, fweimer, jinzhao, juzhang, mnewsome, pbonzini, pfrankli, qzhang, rbalakri, sipoyare, virt-maint, yama
Target Milestone:	rc	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-01 07:31:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1496871
Bug Blocks:

Description Jason Dillaman 2019-06-05 12:43:22 UTC

Description of problem:
RHEL 8 removed the use of tcmalloc within QEMU under the assumption that the new glibc thread cache allocator improvements would eliminate the need for tcmalloc. Performance testing under high IOPS workloads shows that this is not the case.

Version-Release number of selected component (if applicable):
RHEL 8.0

How reproducible:
100%

Steps to Reproduce:
1. Use "fio" within a VM to benchmark a fast Ceph RBD virtual disk
2. Re-run after starting QEMU under "LD_PRELOAD=/usr/lib64/libtcmalloc.so"

Actual results:
RHEL 8.0 QEMU is around 20% slower when not using tcmalloc as compared to an instance of QEMU that is using tcmalloc.

Expected results:
No QEMU performance regression between RHEL 7 and RHEL 8

Additional info:

I ran a quick test and I see a 20% degradation of IOPS when running QEMU against an RBD block device. The test results below are from a Fedora 30 system with the "fio" test running inside a RHEL 7 guest OS under QEMU w/ "/dev/vdb" being backed by RBD, so I would expect the results to be comparable to a RHEL8 base system:

### w/o LD_PRELOAD=/usr/lib64/libtcmalloc.so ###

$ fio --ioengine=libaio --filename=/dev/vdb --numjobs=1 --name=test --iodepth=32 --direct=1 --sync=0 --bs=4k --time_based=1 --runtime=60 --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=149MiB/s][r=0,w=38.1k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1622: Mon Jun  3 10:35:02 2019
  write: IOPS=37.7k, BW=147MiB/s (154MB/s)(8832MiB/60001msec)
    slat (nsec): min=1965, max=310189, avg=3831.35, stdev=2909.63
    clat (usec): min=117, max=17158, avg=843.09, stdev=319.25
     lat (usec): min=127, max=17174, avg=847.34, stdev=319.43
    clat percentiles (usec):
     |  1.00th=[  412],  5.00th=[  529], 10.00th=[  578], 20.00th=[  627],
     | 30.00th=[  676], 40.00th=[  725], 50.00th=[  775], 60.00th=[  848],
     | 70.00th=[  938], 80.00th=[ 1045], 90.00th=[ 1172], 95.00th=[ 1287],
     | 99.00th=[ 1663], 99.50th=[ 2212], 99.90th=[ 3949], 99.95th=[ 5080],
     | 99.99th=[ 7898]
   bw (  KiB/s): min=127240, max=158480, per=100.00%, avg=150753.46, stdev=8376.18, samples=120
   iops        : min=31810, max=39620, avg=37688.31, stdev=2094.07, samples=120
  lat (usec)   : 250=0.08%, 500=3.28%, 750=42.35%, 1000=30.28%
  lat (msec)   : 2=23.37%, 4=0.53%, 10=0.09%, 20=0.01%
  cpu          : usr=9.61%, sys=21.03%, ctx=170329, majf=0, minf=28
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2260934,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=8832MiB (9261MB), run=60001-60001msec

Disk stats (read/write):
  vdb: ios=40/2256895, merge=0/0, ticks=15/1749703, in_queue=1749387, util=99.90%

### w/ LD_PRELOAD=/usr/lib64/libtcmalloc.so ###

$ fio --ioengine=libaio --filename=/dev/vdb --numjobs=1 --name=test --iodepth=32 --direct=1 --sync=0 --bs=4k --time_based=1 --runtime=60 --rw=randwrite
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][r=0KiB/s,w=193MiB/s][r=0,w=49.3k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1506: Mon Jun  3 10:37:24 2019
  write: IOPS=48.8k, BW=191MiB/s (200MB/s)(11.2GiB/60001msec)
    slat (nsec): min=1901, max=1618.5k, avg=4051.50, stdev=3065.36
    clat (usec): min=108, max=22815, avg=649.01, stdev=228.86
     lat (usec): min=113, max=22818, avg=653.49, stdev=228.96
    clat percentiles (usec):
     |  1.00th=[  310],  5.00th=[  396], 10.00th=[  437], 20.00th=[  494],
     | 30.00th=[  545], 40.00th=[  594], 50.00th=[  644], 60.00th=[  685],
     | 70.00th=[  725], 80.00th=[  775], 90.00th=[  840], 95.00th=[  914],
     | 99.00th=[ 1172], 99.50th=[ 1631], 99.90th=[ 2769], 99.95th=[ 3523],
     | 99.99th=[ 6063]
   bw (  KiB/s): min=128296, max=202352, per=100.00%, avg=195361.88, stdev=7970.23, samples=119
   iops        : min=32074, max=50588, avg=48840.41, stdev=1992.56, samples=119
  lat (usec)   : 250=0.28%, 500=21.04%, 750=54.35%, 1000=22.02%
  lat (msec)   : 2=2.02%, 4=0.25%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=12.58%, sys=28.29%, ctx=246561, majf=0, minf=28
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2929620,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=191MiB/s (200MB/s), 191MiB/s-191MiB/s (200MB/s-200MB/s), io=11.2GiB (11.0GB), run=60001-60001msec

Disk stats (read/write):
  vdb: ios=40/2924418, merge=0/0, ticks=3/1723372, in_queue=1723022, util=99.92%

Comment 1 Ademar Reis 2019-06-06 17:57:34 UTC

(In reply to Jason Dillaman from comment #0)
> Description of problem:
> RHEL 8 removed the use of tcmalloc within QEMU under the assumption that the
> new glibc thread cache allocator improvements would eliminate the need for
> tcmalloc. Performance testing under high IOPS workloads shows that this is
> not the case.
> 
> Version-Release number of selected component (if applicable):
> RHEL 8.0
> 
> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1. Use "fio" within a VM to benchmark a fast Ceph RBD virtual disk
> 2. Re-run after starting QEMU under "LD_PRELOAD=/usr/lib64/libtcmalloc.so"
> 
> Actual results:
> RHEL 8.0 QEMU is around 20% slower when not using tcmalloc as compared to an
> instance of QEMU that is using tcmalloc.

Paolo pushed for this change via bug 1496871, so reassigning to him.

Comment 2 Paolo Bonzini 2019-06-07 12:33:51 UTC

Is it 20% slower with non-RBD disks too? Anyway we need to gather a trace and send it over to the glibc guys.

Comment 3 Paolo Bonzini 2019-07-19 12:40:54 UTC

Jason, can you look at comment 2?

Comment 4 Paolo Bonzini 2019-07-19 12:42:45 UTC

Also, RBD has support in fio, it would be interesting to see if enabling tcmalloc via LD_LIBRARY_PATH provides a speedup on the host. That would make it much simpler to gather traces for glibc.

Comment 5 Jason Dillaman 2019-07-19 13:03:50 UTC

(In reply to Paolo Bonzini from comment #3)
> Jason, can you look at comment 2?

I don't know if it's slower w/ non-RBD disks. My focus is RBD land and trying to avoid a major performance regression on RHEL 8.

Comment 6 Jason Dillaman 2019-07-19 13:06:57 UTC

(In reply to Paolo Bonzini from comment #4)
> Also, RBD has support in fio, it would be interesting to see if enabling
> tcmalloc via LD_LIBRARY_PATH provides a speedup on the host. That would make
> it much simpler to gather traces for glibc.

You mean w/ LD_PRELOAD? If so, the answer is yes and now (as a result), fio will link w/ tcmalloc automatically if it's available [1].


[1] https://github.com/axboe/fio/commit/01fe773df4bc4a35450ce3ef50c8075b3bf55cd0#diff-e2d5a00791bce9a01f99bc6fd613a39d

Comment 7 Ademar Reis 2020-02-05 22:58:44 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 9 Paolo Bonzini 2020-12-17 14:42:37 UTC

Since the upstream commit for fio (https://github.com/axboe/fio/commit/01fe773df4bc4a35450ce3ef50c8075b3bf55cd0) says that it is reproducible without QEMU, reassigning to glibc.

Comment 10 Carlos O'Donell 2020-12-18 19:42:08 UTC

(In reply to Paolo Bonzini from comment #9)
> Since the upstream commit for fio
> (https://github.com/axboe/fio/commit/
> 01fe773df4bc4a35450ce3ef50c8075b3bf55cd0) says that it is reproducible
> without QEMU, reassigning to glibc.

Could you please provide detailed steps to reproduce this, or if you can, access to a system that reproduces it?

I don't know how to setup a Ceph RBD that is appropriate for a high IOPS workload as described in the original description.

It would be easiest if we had access to systems you consider sufficiently correctly configured.

Comment 11 Jason Dillaman 2020-12-18 21:37:00 UTC

(In reply to Carlos O'Donell from comment #10)
> Could you please provide detailed steps to reproduce this, or if you can,
> access to a system that reproduces it?
> 
> I don't know how to setup a Ceph RBD that is appropriate for a high IOPS
> workload as described in the original description.
> 
> It would be easiest if we had access to systems you consider sufficiently
> correctly configured.

I can't provide a system since I really only have my personal development box. I can, however, perhaps just provide a reproducer in the new year (since I won't have time to work on in next week) or maybe a test build of QEMU w/ a dummy RBD driver. The big change was that QEMU used to link against tcmalloc for this exact reason but it was dropped in RHEL 8 because GCC stated it improved its allocator to the point where tcmalloc was no longer necessary. However, that does not appear to be the actual case.

Comment 12 Carlos O'Donell 2020-12-18 21:57:01 UTC

(In reply to Jason Dillaman from comment #11)
> (In reply to Carlos O'Donell from comment #10)
> > Could you please provide detailed steps to reproduce this, or if you can,
> > access to a system that reproduces it?
> > 
> > I don't know how to setup a Ceph RBD that is appropriate for a high IOPS
> > workload as described in the original description.
> > 
> > It would be easiest if we had access to systems you consider sufficiently
> > correctly configured.
> 
> I can't provide a system since I really only have my personal development
> box. I can, however, perhaps just provide a reproducer in the new year
> (since I won't have time to work on in next week) or maybe a test build of
> QEMU w/ a dummy RBD driver. The big change was that QEMU used to link
> against tcmalloc for this exact reason but it was dropped in RHEL 8 because
> GCC stated it improved its allocator to the point where tcmalloc was no
> longer necessary. However, that does not appear to be the actual case.

The glibc malloc allocator is improved (thread local cache), but different workloads will see different benefits.

I look forward to getting a reproducer that we can use to test the performance.

Just for clarity will this impact current Ceph users?

Comment 13 Jason Dillaman 2020-12-18 23:22:09 UTC

(In reply to Carlos O'Donell from comment #12)
> Just for clarity will this impact current Ceph users?

IO is 20% slower (see the problem description) as compared to the same software build running under tcmalloc (via LD_PRELOAD).

Comment 15 Jason Dillaman 2021-01-06 01:56:49 UTC

With the dummy librbd / rbd CLI tool, you can see the effect of the glibc memory allocator:

# GLIBC MEMORY ALLOCATOR
$ rbd bench --io-type write --io-pattern rand --io-size 4K --io-total 5G image1
bench  type write io_size 4096 io_threads 16 bytes 5368709120 pattern random
  SEC       OPS   OPS/SEC   BYTES/SEC
    1    101200    101418   396 MiB/s
    2    202960    101589   397 MiB/s
    3    303488    101235   395 MiB/s
    4    404160    101094   395 MiB/s
    5    506112    101265   396 MiB/s
    6    611936    102146   399 MiB/s
    7    718464    103100   403 MiB/s
    8    817680    102838   402 MiB/s
    9    918384    102844   402 MiB/s
   10   1024544    103686   405 MiB/s
   11   1125680    102748   401 MiB/s
   12   1228208    101948   398 MiB/s
elapsed: 12   ops: 1310720   ops/sec: 102359   bytes/sec: 400 MiB/s

# TCMALLOC MEMORY ALLOCATOR
$ LD_PRELOAD=/usr/lib64/libtcmalloc.so rbd bench --io-type write --io-pattern rand --io-size 4K --io-total 5G image1
bench  type write io_size 4096 io_threads 16 bytes 5368709120 pattern random
  SEC       OPS   OPS/SEC   BYTES/SEC
    1    128288    128818   503 MiB/s
    2    267456    134003   523 MiB/s
    3    408640    136400   533 MiB/s
    4    550352    137729   538 MiB/s
    5    692208    138555   541 MiB/s
    6    833984    141138   551 MiB/s
    7    975232    141554   553 MiB/s
    8   1117152    141701   554 MiB/s
    9   1258992    141727   554 MiB/s
elapsed: 9   ops: 1310720   ops/sec: 139973   bytes/sec: 547 MiB/s

The results show that the mocked IO benchmark under librbd is >25% slower when using glibc's memory allocator as compared to tcmalloc. 

My original goal was not to pressure glibc team into matching tcmalloc's performance, though. I really just wanted the fast/easy path of having RHEL 8's QEMU linked against tcmalloc again like it was under RHEL 7 (since that linkage was removed because glibc would match tcmalloc performance and/or due to BaseOS vs AppStream repos).

Comment 18 RHEL Program Management 2021-07-01 07:31:30 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 19 Carlos O'Donell 2021-07-02 15:40:06 UTC

(In reply to Jason Dillaman from comment #15)
> With the dummy librbd / rbd CLI tool, you can see the effect of the glibc
> memory allocator:
> 
> # GLIBC MEMORY ALLOCATOR
> $ rbd bench --io-type write --io-pattern rand --io-size 4K --io-total 5G
> image1
> bench  type write io_size 4096 io_threads 16 bytes 5368709120 pattern random
>   SEC       OPS   OPS/SEC   BYTES/SEC
>     1    101200    101418   396 MiB/s
>     2    202960    101589   397 MiB/s
>     3    303488    101235   395 MiB/s
>     4    404160    101094   395 MiB/s
>     5    506112    101265   396 MiB/s
>     6    611936    102146   399 MiB/s
>     7    718464    103100   403 MiB/s
>     8    817680    102838   402 MiB/s
>     9    918384    102844   402 MiB/s
>    10   1024544    103686   405 MiB/s
>    11   1125680    102748   401 MiB/s
>    12   1228208    101948   398 MiB/s
> elapsed: 12   ops: 1310720   ops/sec: 102359   bytes/sec: 400 MiB/s
> 
> # TCMALLOC MEMORY ALLOCATOR
> $ LD_PRELOAD=/usr/lib64/libtcmalloc.so rbd bench --io-type write
> --io-pattern rand --io-size 4K --io-total 5G image1
> bench  type write io_size 4096 io_threads 16 bytes 5368709120 pattern random
>   SEC       OPS   OPS/SEC   BYTES/SEC
>     1    128288    128818   503 MiB/s
>     2    267456    134003   523 MiB/s
>     3    408640    136400   533 MiB/s
>     4    550352    137729   538 MiB/s
>     5    692208    138555   541 MiB/s
>     6    833984    141138   551 MiB/s
>     7    975232    141554   553 MiB/s
>     8   1117152    141701   554 MiB/s
>     9   1258992    141727   554 MiB/s
> elapsed: 9   ops: 1310720   ops/sec: 139973   bytes/sec: 547 MiB/s
> 
> The results show that the mocked IO benchmark under librbd is >25% slower
> when using glibc's memory allocator as compared to tcmalloc. 
> 
> My original goal was not to pressure glibc team into matching tcmalloc's
> performance, though. I really just wanted the fast/easy path of having RHEL
> 8's QEMU linked against tcmalloc again like it was under RHEL 7 (since that
> linkage was removed because glibc would match tcmalloc performance and/or
> due to BaseOS vs AppStream repos).

This is going to be a longer term project to review performance again, but having examples that are problematic is important for the glibc team.

This issue is auto-closed because we aren't going to get this fixed in say RHEL 8.5/8.6.

I'll discuss this with the glibc team and we'll see what we can do.

Comment 20 Carlos O'Donell 2021-07-02 15:45:24 UTC

I'm marking this bug CLOSED/UPSTREAM and we'll track this upstream with this bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=28050

This means that if we make progress upstream (independent project, or other partners work on it) we can come back here and review the bug for inclusion in RHEL.