Bug 1275472

Summary: Glibc's malloc is slower for virtio-blk than tcmalloc
Product: [Fedora] Fedora Reporter: Fam Zheng <famz>
Component: glibcAssignee: DJ Delorie <dj>
Status: CLOSED NEXTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 27CC: amit, amit.shah, arjun.is, armbru, codonell, dj, eblake, famz, fweimer, jakub, jen, jkurik, law, marcandre.lureau, mfabian, mjw, pbonzini, pfrankli, siddhesh, tgummels, virt-maint, woodard, yama
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-10-11 12:18:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 975551    
Attachments:
Description Flags
perf.conf.new
none
regression.new.py
none
test malloc - per-thread cache
none
test malloc - per-thread cache none

Description Fam Zheng 2015-10-27 01:56:46 UTC
Using tcmalloc, QEMU will see higher performance when doing I/O on virtio-blk device. See steps and numbers below.

host versions:

glibc-2.21-5.fc22.x86_64
kernel-4.2.3-200.fc22.x86_64
gperftools-libs-2.4-1.fc22.x86_64

QEMU is locally compiled from qemu-kvm-rhev-2.3.0-31.el7 src with below configure options:

./configure --enable-trace-backend=nop --enable-debug --target-list=x86_64-softmmu --extra-ldflags=-lrt --prefix=/home/fam/build/install --disable-gtk --extra-cflags=-Wno-error=deprecated-declarations

guest versions:

kernel-4.0.4-301.fc22.x86_64
fio-2.2.4-1.fc22.x86_64

How reproducible: can reproduce reliably.

Steps to Reproduce:
1. Start QEMU, boot a Fedora 22 guest with a ramdisk (/dev/ram0) attached to virtio-blk-pci device:

LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 \
qemu-system-x86_64  \
  -enable-kvm  \
  -name EU4OKS45  \
  -pidfile /tmp/qsh/EU4OKS45/pid  \
  -qmp unix:/tmp/qsh/EU4OKS45/qmp.sock,server,nowait  \
  -m 1024  \
  -vnc :0  \
  -device virtio-scsi-pci,id=virtio-scsi-bus-0  \
  -drive file=/home/fam/work/qsh/guest.qcow2,id=system-disk-drive,if=none,cache=writeback  \
  -device ide-drive,drive=system-disk-drive,id=system-disk,bootindex=1  \
  -sdl  \
  -serial file:/tmp/qsh/EU4OKS45/serial.out  \
  -netdev user,id=virtio-nat-0,hostfwd=:0.0.0.0:10022-:22  \
  -device virtio-net-pci,id=virtio-net-pci-virtio-nat-0,netdev=virtio-nat-0  \
  -drive file=/dev/ram0,id=virtio-blk-disk-0,if=none,cache=none,aio=native  \
  -device virtio-blk-pci,drive=virtio-blk-disk-0,id=virtio-blk-0,serial=virtio-blk-device-0,ioeventfd=on

2. Run fio benchmark (4k seq read with iodepth=8 and 16 concurrent jobs) against /dev/vda in guest:

fio --rw=read --bs=4k --iodepth=8 --runtime=30 --filename=/dev/vda --numjobs=16 --direct=1 --group_reporting --thread --name=fio-test-job --ioengine=libaio --time_based --size=1G

3. Shutdown VM, restart QEMU without the LD_PRELOAD= modification, rerun the same benchmark in guest.

Actual results:

Using tcmalloc yields ~15% higher performance than glibc:

case            bw         iops  
---------------------------------
tcmalloc        414        106068
glibc           354        90676

Comment 1 Fam Zheng 2015-10-27 07:49:37 UTC
Ramdisk is initialized as
  modprobe brd rd_nr=1 rd_size=1024000

The host machine is my working laptop (Lenovo T430s) with Fedora 22 on it.

$cat /proc/cpuinfo

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
stepping        : 9
microcode       : 0x1b
cpu MHz         : 1292.652
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs            :
bogomips        : 5786.95
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
stepping        : 9
microcode       : 0x1b
cpu MHz         : 1260.027
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs            :
bogomips        : 5786.95
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
stepping        : 9
microcode       : 0x1b
cpu MHz         : 1261.160
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs            :
bogomips        : 5786.95
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
stepping        : 9
microcode       : 0x1b
cpu MHz         : 1208.371
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt
bugs            :
bogomips        : 5786.95
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

$ cat /proc/meminfo 
MemTotal:        7869740 kB
MemFree:         5790784 kB
MemAvailable:    6754028 kB
Buffers:           74284 kB
Cached:           976952 kB
SwapCached:            0 kB
Active:          1283392 kB
Inactive:         566244 kB
Active(anon):     804632 kB
Inactive(anon):   112976 kB
Active(file):     478760 kB
Inactive(file):   453268 kB
Unevictable:          16 kB
Mlocked:              16 kB
SwapTotal:      17272828 kB
SwapFree:       17272828 kB
Dirty:               120 kB
Writeback:             0 kB
AnonPages:        798580 kB
Mapped:           434236 kB
Shmem:            119224 kB
Slab:             117352 kB
SReclaimable:      72964 kB
SUnreclaim:        44388 kB
KernelStack:        5904 kB
PageTables:        25308 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    21207696 kB
Committed_AS:    2958772 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      381660 kB
VmallocChunk:   34358947836 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      150720 kB
DirectMap2M:     7927808 kB

Comment 2 Florian Weimer 2015-10-27 10:37:21 UTC
Thanks, this is very useful information.

I tried to reproduce your findings with stock qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22.  Is this a valid test?

Can you provide a qemu invocation which can run in single user mode (e.g., instructions to set up a serial console to the VM, or networking)?  I currently have to run the reproducer under X, and this might contribute to the relatively high variance I see.

I extracted the performance numbers from the / read : / line in the fio output from 20 runs each (within the same VM, after one warm-up run), with glibc malloc and tcmalloc, using:

awk -F '[=, KB/s]+' '/ read : /{print $7}' # bw
awk -F '[=, KB/s]+' '/ read : /{print $9}' # iops

> tcmalloc = read.table("tcmalloc.bw")
> glibc = read.table("glibc.bw")
> t.test(tcmalloc, glibc)

	Welch Two Sample t-test

data:  tcmalloc and glibc
t = -2.5096, df = 31.914, p-value = 0.01736
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -38050.417  -3953.683
sample estimates:
mean of x mean of y 
 654117.1  675119.2 

> tcmalloc = read.table("tcmalloc.iops")
> glibc = read.table("glibc.iops")
> t.test(tcmalloc, glibc)

	Welch Two Sample t-test

data:  tcmalloc and glibc
t = -2.5096, df = 31.914, p-value = 0.01736
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9512.5416  -988.3584
sample estimates:
mean of x mean of y 
 163528.8  168779.2 

I think this shows that glibc malloc is actually faster than tcmalloc.

Comment 3 Florian Weimer 2015-10-27 11:49:07 UTC
Comparison between glibc malloc and jemalloc follows.

> glibc = read.table("glibc.bw")
> jemalloc = read.table("jemalloc.bw")
> t.test(glibc, jemalloc)

	Welch Two Sample t-test

data:  glibc and jemalloc
t = 37.816, df = 23.405, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 167845.0 187251.5
sample estimates:
mean of x mean of y 
 675119.2  497570.9 

> glibc = read.table("glibc.iops")
> jemalloc = read.table("jemalloc.iops")
> t.test(glibc, jemalloc)

	Welch Two Sample t-test

data:  glibc and jemalloc
t = 37.816, df = 23.406, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 41961.15 46812.75
sample estimates:
mean of x mean of y 
 168779.2  124392.3

Comment 4 Florian Weimer 2015-10-27 13:50:04 UTC
With 2.3.0-31.el7, I can reproduce.

> glibc = read.table("q2-glibc.bw")
> tcmalloc = read.table("q2-tcmalloc.bw")
> t.test(glibc, tcmalloc)

	Welch Two Sample t-test

data:  glibc and tcmalloc
t = -8.789, df = 29.725, p-value = 9.152e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -59104.08 -36808.52
sample estimates:
mean of x mean of y 
 382007.7  429964.0 

> glibc = read.table("q2-glibc.iops")
> tcmalloc = read.table("q2-tcmalloc.iops")
> t.test(glibc, tcmalloc)

	Welch Two Sample t-test

data:  glibc and tcmalloc
t = -8.789, df = 29.726, p-value = 9.15e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -14776.096  -9202.204
sample estimates:
mean of x mean of y 
  95501.4  107490.6 

jemalloc is even slower on this test:

> glibc = read.table("q2-glibc.bw")
> jemalloc = read.table("q2-jemalloc.bw")
> t.test(glibc, jemalloc)

	Welch Two Sample t-test

data:  glibc and jemalloc
t = 8.0144, df = 34.077, p-value = 2.393e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 34861.18 58544.42
sample estimates:
mean of x mean of y 
 382007.7  335304.8 

> glibc = read.table("q2-glibc.iops")
> jemalloc = read.table("q2-jemalloc.iops")
> t.test(glibc, jemalloc)

	Welch Two Sample t-test

data:  glibc and jemalloc
t = 8.0144, df = 34.077, p-value = 2.393e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  8715.295 14636.105
sample estimates:
mean of x mean of y 
  95501.4   83825.7 


Next step is to measure without --enable-debug.  Apparently, it disables optimization and source fortification.

Comment 5 Florian Weimer 2015-10-27 14:55:01 UTC
Now without --enable-debug.  The difference between glibc malloc and tcmalloc is no longer statistically significant.

> glibc = read.table("q2O-glibc.bw")
> tcmalloc = read.table("q2O-tcmalloc.bw")
> t.test(glibc, tcmalloc)

	Welch Two Sample t-test

data:  glibc and tcmalloc
t = 1.6035, df = 37.897, p-value = 0.1171
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4242.287 36550.887
sample estimates:
mean of x mean of y 
 377884.9  361730.6 

> glibc = read.table("q2O-glibc.iops")
> tcmalloc = read.table("q2O-tcmalloc.iops")
> t.test(glibc, tcmalloc)

	Welch Two Sample t-test

data:  glibc and tcmalloc
t = 1.6036, df = 37.898, p-value = 0.1171
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1060.345  9137.945
sample estimates:
mean of x mean of y 
 94470.85  90432.05 

It seems that the tcmalloc bandwidth distribution is just broader (results are less predictable):

> summary(glibc)
       V1        
 Min.   :330813  
 1st Qu.:352874  
 Median :371779  
 Mean   :377885  
 3rd Qu.:392586  
 Max.   :437157  
> summary(tcmalloc)
       V1        
 Min.   :325622  
 1st Qu.:339592  
 Median :349546  
 Mean   :361731  
 3rd Qu.:373303  
 Max.   :443457  

But the jemalloc results are now much better than both tcmalloc and jemalloc.

> glibc = read.table("q2O-glibc.bw")
> jemalloc = read.table("q2O-jemalloc.bw")
> t.test(glibc, jemalloc)

	Welch Two Sample t-test

data:  glibc and jemalloc
t = -15.873, df = 37.386, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -174130.9 -134720.4
sample estimates:
mean of x mean of y 
 377884.9  532310.6 

> glibc = read.table("q2O-glibc.iops")
> jemalloc = read.table("q2O-jemalloc.iops")
> t.test(glibc, jemalloc)

	Welch Two Sample t-test

data:  glibc and jemalloc
t = -15.873, df = 37.386, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -43532.49 -33679.91
sample estimates:
mean of x mean of y 
 94470.85 133077.05 

I need to double-check this because this result is suspicious.

Comment 6 Carlos O'Donell 2015-10-28 03:13:59 UTC
I did a double check of the results just for peer review.

The only high-level comment I have is that we should collect more samples for testing.

(In reply to Florian Weimer from comment #2)
> I think this shows that glibc malloc is actually faster than tcmalloc.

Agreed. It does for that configuration.

(In reply to Florian Weimer from comment #3)
> Comparison between glibc malloc and jemalloc follows.

Agreed, it shows a statistically significant difference. Namely that jemalloc is faster.

(In reply to Florian Weimer from comment #4)
> With 2.3.0-31.el7, I can reproduce.

Agreed. Looks like performance ranking (best to worst) is: tcmalloc, glibc, jemalloc. Something is certainly odd there. Analysis required.

(In reply to Florian Weimer from comment #5)
> Now without --enable-debug.  The difference between glibc malloc and
> tcmalloc is no longer statistically significant.

Agreed (p-value > 0.05).
 
> It seems that the tcmalloc bandwidth distribution is just broader (results
> are less predictable):

Agreed.

> But the jemalloc results are now much better than both tcmalloc and jemalloc.

Agreed. If we can find out why, we might be able to copy the technique.
 
> I need to double-check this because this result is suspicious.

Agreed.

Comment 7 Fam Zheng 2015-10-28 03:21:06 UTC
(In reply to Florian Weimer from comment #2) 
> Can you provide a qemu invocation which can run in single user mode (e.g.,
> instructions to set up a serial console to the VM, or networking)?  I
> currently have to run the reproducer under X, and this might contribute to
> the relatively high variance I see.

My guest doesn't have X, it is a minimal F22 installation.

My command line has "-vnc :0" and "-sdl" so you can access the vm tty from either vncviewer or the SDL window that is created. Also there is "-netdev user,id=virtio-nat-0,hostfwd=:0.0.0.0:10022-:22" option that forwards host port 10022 to guest port 22, so I can ssh to the vm from host with "ssh -P 10022 root@localhost".

If you want serial console, add "console=tty0 console=ttyS0" to guest kernel boot options and add "-serial stdio" to the command line.

Fam

Comment 8 Fam Zheng 2015-10-28 03:33:09 UTC
I did another round of testing, comparing bandwidth (MB/s).

tcmalloc vs. glibc

qemu-kvm-rhev-2.3.0-31.el7 without --enable-debug:
760 vs 638

qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22 repo:
725 vs 694

qemu.git vs without --enable-debug
418 vs 376

The qemu.git absolute numbers are very suspecious but I haven't looked into it yet, but the performance advantage of tcmalloc is consistent across all four pairs.

Comment 9 Carlos O'Donell 2015-10-28 03:36:05 UTC
(In reply to Fam Zheng from comment #8)
> I did another round of testing, comparing bandwidth (MB/s).
> 
> tcmalloc vs. glibc
> 
> qemu-kvm-rhev-2.3.0-31.el7 without --enable-debug:
> 760 vs 638
> 
> qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22 repo:
> 725 vs 694
> 
> qemu.git vs without --enable-debug
> 418 vs 376
> 
> The qemu.git absolute numbers are very suspecious but I haven't looked into
> it yet, but the performance advantage of tcmalloc is consistent across all
> four pairs.

In order to make these differences statistically significant you need to run them multiple times, and then carry out something like the Welch's t-test (non-paired) like Florian used. How many times did you run the test? Are your reported values the mean results?

Comment 10 Fam Zheng 2015-10-28 03:55:52 UTC
(In reply to Carlos O'Donell from comment #9)
> In order to make these differences statistically significant you need to run
> them multiple times, and then carry out something like the Welch's t-test
> (non-paired) like Florian used. How many times did you run the test? Are
> your reported values the mean results?

I didn't do t-test, but each value is the mean of 16 repetitions. I'm just trying to reproduce the formal benchmarking done in BZ1213882#c5 and this configuration is where glibc was seen slower, in the case "[11] single disk + virtio_blk", where Student's t-test was actually carried out.

Comment 11 Carlos O'Donell 2015-10-28 03:58:40 UTC
(In reply to Carlos O'Donell from comment #9)
> (In reply to Fam Zheng from comment #8)
> > I did another round of testing, comparing bandwidth (MB/s).
> > 
> > tcmalloc vs. glibc
> > 
> > qemu-kvm-rhev-2.3.0-31.el7 without --enable-debug:
> > 760 vs 638
> > 
> > qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22 repo:
> > 725 vs 694
> > 
> > qemu.git vs without --enable-debug
> > 418 vs 376
> > 
> > The qemu.git absolute numbers are very suspecious but I haven't looked into
> > it yet, but the performance advantage of tcmalloc is consistent across all
> > four pairs.
> 
> In order to make these differences statistically significant you need to run
> them multiple times, and then carry out something like the Welch's t-test
> (non-paired) like Florian used. How many times did you run the test? Are
> your reported values the mean results?

As an exmaple:

sudo yum install R
# create your results in two text files one value per line
# one file for glibc e.g. glibc.iops, glibc.bw
# and one for tcmalloc e.g. tcmalloc.iops, glibc.bw
# start R
R
glibc = read.table("glibc.iops")
tcmalloc = read.table("tcmalloc.iops")
t.test(glibc, tcmalloc)

If the p-value is greater than 0.05 then there is no statistically significant difference between the means for those value. That is to say that the iops achieved under glibc and tcmalloc are the same within the noise (roughly).

With small p-values like 2.2e-16, there is a significant difference between the means of populations (iops or bw) and that difference needs to be understood by the glibc team in order to implement a solution.

In truth we should do a power calculation based on our estimate of the differences we're trying to detect and that will tell us roughly how many test runs we need to do to detect such a difference.

Secondly, if the samples are not normal, then we will again likely need more samples for the effect size to determine if there is a real difference. Theory says they will be normal, but rule-of-thumb shows we likely need 20-30 runs minimum.

Comment 12 Carlos O'Donell 2015-10-28 04:00:57 UTC
(In reply to Fam Zheng from comment #10)
> (In reply to Carlos O'Donell from comment #9)
> > In order to make these differences statistically significant you need to run
> > them multiple times, and then carry out something like the Welch's t-test
> > (non-paired) like Florian used. How many times did you run the test? Are
> > your reported values the mean results?
> 
> I didn't do t-test, but each value is the mean of 16 repetitions. I'm just
> trying to reproduce the formal benchmarking done in BZ1213882#c5 and this
> configuration is where glibc was seen slower, in the case "[11] single disk
> + virtio_blk", where Student's t-test was actually carried out.

Please have a look at R's t.test, which is Welch's t-test, which is basically always better than Student's t-test for this kind of data.

If with Welch's t-test you can show a difference, then that's good, and we can look into that.

I assume your testing is on your own box? The i7-3520M/8GB RAM?

Comment 13 Carlos O'Donell 2015-10-28 04:13:59 UTC
(In reply to Fam Zheng from comment #10)
> (In reply to Carlos O'Donell from comment #9)
> > In order to make these differences statistically significant you need to run
> > them multiple times, and then carry out something like the Welch's t-test
> > (non-paired) like Florian used. How many times did you run the test? Are
> > your reported values the mean results?
> 
> I didn't do t-test, but each value is the mean of 16 repetitions. I'm just
> trying to reproduce the formal benchmarking done in BZ1213882#c5 and this
> configuration is where glibc was seen slower, in the case "[11] single disk
> + virtio_blk", where Student's t-test was actually carried out.

For reference:
http://kvm-perf.englab.nay.redhat.com/results/regression/2015-w32/ramdisk/fio_raw_virtio_blk.html

The tcmalloc gains were made in fio read for 4%, which is what we're looking to reproduce here.

It looks like tcmalloc also had a 1-11% regression in fio randrw tests?

Is the "read" test more important than "randrw" (random read/write)?

Comment 14 Carlos O'Donell 2015-10-28 04:17:14 UTC
(In reply to Carlos O'Donell from comment #13)
> (In reply to Fam Zheng from comment #10)
> > (In reply to Carlos O'Donell from comment #9)
> > > In order to make these differences statistically significant you need to run
> > > them multiple times, and then carry out something like the Welch's t-test
> > > (non-paired) like Florian used. How many times did you run the test? Are
> > > your reported values the mean results?
> > 
> > I didn't do t-test, but each value is the mean of 16 repetitions. I'm just
> > trying to reproduce the formal benchmarking done in BZ1213882#c5 and this
> > configuration is where glibc was seen slower, in the case "[11] single disk
> > + virtio_blk", where Student's t-test was actually carried out.
> 
> For reference:
> http://kvm-perf.englab.nay.redhat.com/results/regression/2015-w32/ramdisk/
> fio_raw_virtio_blk.html
> 
> The tcmalloc gains were made in fio read for 4%, which is what we're looking
> to reproduce here.
> 
> It looks like tcmalloc also had a 1-11% regression in fio randrw tests?
> 
> Is the "read" test more important than "randrw" (random read/write)?

Is "[2] raw+ virtio_blk" also another feasible test to run?

It had a ~9% gain in random write testing, which is different from test "[11] single disk + virtio_blk".

Comment 15 Carlos O'Donell 2015-10-28 05:02:47 UTC
Note that we might want to use a Wilcoxon rank-sum test under the assumption that the means are not normal.

Also note that the sample size of means for the official virt tests is only 4. Despite the test running for 60 seconds, it is still only 4 mean values for comparison. While that doesn't mean anything per-se, if normality is violated, I would expect you to need much more than 4 samples for Student's or Welch's t-test to reject the null hypothesis (as power is reduced).

Comment 16 Florian Weimer 2015-10-28 08:40:30 UTC
(In reply to Fam Zheng from comment #7)
> (In reply to Florian Weimer from comment #2) 
> > Can you provide a qemu invocation which can run in single user mode (e.g.,
> > instructions to set up a serial console to the VM, or networking)?  I
> > currently have to run the reproducer under X, and this might contribute to
> > the relatively high variance I see.
> 
> My guest doesn't have X, it is a minimal F22 installation.

I meant the host.  I want to run without the full desktop environment, in an attempt to bring down the variance between the test runs.

> My command line has "-vnc :0" and "-sdl" so you can access the vm tty from
> either vncviewer or the SDL window that is created. Also there is "-netdev
> user,id=virtio-nat-0,hostfwd=:0.0.0.0:10022-:22" option that forwards host
> port 10022 to guest port 22, so I can ssh to the vm from host with "ssh -P
> 10022 root@localhost".
> 
> If you want serial console, add "console=tty0 console=ttyS0" to guest kernel
> boot options and add "-serial stdio" to the command line.

Thanks, I will try that.

Comment 17 Fam Zheng 2015-10-28 10:28:15 UTC
On a virtlab server that has no X I reran the tests 16 times and did R's t-test:

$ cat read.glibc.out 
409
420
428
420
421
410
413
404
427
425
429
430
431
417
430
404
$ cat read.tcmalloc.out
405
433
429
435
435
431
433
430
424
424
427
423
427
428
432
444

> glibc = read.table("read.glibc.out")
> tcmalloc = read.table("read.tcmalloc.out")
> t.test(glibc, tcmalloc)

        Welch Two Sample t-test

data:  glibc and tcmalloc
t = -2.8394, df = 29.456, p-value = 0.008111
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -15.263413  -2.486587
sample estimates:
mean of x mean of y 
  419.875   428.750

Comment 19 Fam Zheng 2015-10-28 10:41:43 UTC
(In reply to Carlos O'Donell from comment #13)
> It looks like tcmalloc also had a 1-11% regression in fio randrw tests?

Couldn't reproduce this, as I don't see a significant difference in this test.

> 
> Is the "read" test more important than "randrw" (random read/write)?

I cannot say that, they're just different workloads. The actual importance depends on what users do with the system. But at any rate pure sequential I/O is far less common than mixed or random workload.

Comment 22 Carlos O'Donell 2015-10-30 16:32:11 UTC
(In reply to Fam Zheng from comment #19)
> (In reply to Carlos O'Donell from comment #13)
> > It looks like tcmalloc also had a 1-11% regression in fio randrw tests?
> 
> Couldn't reproduce this, as I don't see a significant difference in this
> test.

OK. The original test showed a significant difference.

Can I get access to the code that generates those tables please?

> > Is the "read" test more important than "randrw" (random read/write)?
> 
> I cannot say that, they're just different workloads. The actual importance
> depends on what users do with the system. But at any rate pure sequential
> I/O is far less common than mixed or random workload.

So the "read" improvement of 4% would likely not be "worth" as much as the loss of 11% in "randrw" (random read write)?

Comment 23 Fam Zheng 2015-11-02 02:16:33 UTC
(In reply to Carlos O'Donell from comment #22)
> (In reply to Fam Zheng from comment #19)
> > (In reply to Carlos O'Donell from comment #13)
> > > It looks like tcmalloc also had a 1-11% regression in fio randrw tests?
> > 
> > Couldn't reproduce this, as I don't see a significant difference in this
> > test.
> 
> OK. The original test showed a significant difference.
> 
> Can I get access to the code that generates those tables please?

The tests belong to QE.

Yanhui?

Comment 24 Yanhui Ma 2015-11-02 07:10:51 UTC
Created attachment 1088482 [details]
perf.conf.new

Comment 25 Yanhui Ma 2015-11-02 07:11:46 UTC
Created attachment 1088483 [details]
regression.new.py

Comment 26 Yanhui Ma 2015-11-02 07:13:34 UTC
(In reply to Carlos O'Donell from comment #22)
> (In reply to Fam Zheng from comment #19)
> > (In reply to Carlos O'Donell from comment #13)
> > > It looks like tcmalloc also had a 1-11% regression in fio randrw tests?
> > 
> > Couldn't reproduce this, as I don't see a significant difference in this
> > test.
> 
> OK. The original test showed a significant difference.
> 
> Can I get access to the code that generates those tables please?

Please see the attachments (perf.conf.new, regression.new.py)

> 
> > > Is the "read" test more important than "randrw" (random read/write)?
> > 
> > I cannot say that, they're just different workloads. The actual importance
> > depends on what users do with the system. But at any rate pure sequential
> > I/O is far less common than mixed or random workload.
> 
> So the "read" improvement of 4% would likely not be "worth" as much as the
> loss of 11% in "randrw" (random read write)?

Comment 27 Paolo Bonzini 2015-11-02 20:57:40 UTC
It's worth testing with G_SLICE=always-malloc in the environment.  This will match the original experiments more closely, and it will also match QEMU 2.5 which removes the g_slice_* allocator in favor of regular malloc.

Comment 28 Carlos O'Donell 2015-11-02 22:08:21 UTC
(In reply to Yanhui Ma from comment #26)
> > Can I get access to the code that generates those tables please?
> 
> Please see the attachments (perf.conf.new, regression.new.py)

Thanks! I see you're using scipi.stats.ttest_*, which helps us make sure we are also computing similar values when we look at the final performance numbers.

Comment 29 Paolo Bonzini 2015-12-22 13:06:54 UTC
Running latest QEMU I see (with perf) a lot of L1-dcache-load-misses in malloc, that go away with tcmalloc.

Comment 30 Carlos O'Donell 2015-12-22 17:41:30 UTC
(In reply to Paolo Bonzini from comment #29)
> Running latest QEMU I see (with perf) a lot of L1-dcache-load-misses in
> malloc, that go away with tcmalloc.

We've started to make progress on this issue.

DJ Delorie from the tools team is working on glibc's malloc and has an experimental hybrid cache added to glibc's malloc.

Like in tcmalloc and jemalloc, DJ has added a per-thread cache (making it a hybrid of per-thread/per-cpu) which can fetch from a local pool without any locking and thereby reduce the number of cycles required to get a block of memory. The pool refill adds latency since you have to go back to the per-cpu cache to get memory, and eventually all the way back to the OS (mmap) at some point if the pressure is high enough.

To reiterate, we are making progress here and the numbers are quite good (200% speedup in some <1024 byte allocations) so far in our testing of effectively the same approach as tcmalloc and jemalloc.

Any win for glibc's malloc is a win for the entire system.

Comment 31 DJ Delorie 2016-01-06 23:57:46 UTC
Created attachment 1112295 [details]
test malloc - per-thread cache

Here is a version of glibc's malloc to test, which has a new per-thread cache.  It should work on Fedora 20+ or RHEL 7+.  Use LD_PRELOAD=djmalloc.so as usual to test it.

Note that since this version is split out from glibc's so, there might be some features that require integration with glibc to work correctly (i.e. there might be memory leaks due to thread exits not being cross-registered with the pthreads library).  I'm providing this solely for testing performance :-)

The primary boost in this new version is when small (<1024 byte) allocations happen more than once, a shorter path can be taken which is significantly faster due to a small per-thread cache.

Comment 32 Ben Woodard 2016-01-13 23:55:46 UTC
I am asking LLNL to do some testing to verify that the work that you did adding a per-thread cache helps address their most pressing performance issue. Other issues that the point outhem are: which also affect t
1) Problems with a growth in the virtual memory address space allocated to a process. Because they run diskless, any dirty memory becomes resident in RAM and can't get paged out. Thus reclaiming arenas and mmapped regions rather than abandoning them becomes more important.
2) They also seem to have problems with glibc's malloc fragmenting memory. Since they have been taught to do malloc's in the context of the thread that they plan on using it from, DJ's work may already address this.

Comment 33 DJ Delorie 2016-02-05 19:27:21 UTC
Created attachment 1121501 [details]
test malloc - per-thread cache

Fixes a bug in the previous version

Comment 34 Jan Kurik 2016-02-24 13:52:38 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase

Comment 35 Jan Kurik 2017-08-15 09:48:20 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle.
Changing version to '27'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Releases/27/HouseKeeping#Rawhide_Rebase

Comment 36 DJ Delorie 2017-08-30 21:30:59 UTC
The per-thread cache was released in glibc 2.26 and is available in rawhide.  Could you please repeat your original tests to see if the performance difference is still significant?

Comment 37 Fam Zheng 2017-09-04 14:11:58 UTC
I don't see significant difference on rawhide now:

Numbers in kIOPS:

tcmalloc 204
jemalloc 217
glibc    211