Bug 912939 - [QEMU/KVM-RHS] Terrible performance on OS installation on 6x2 volume.
Summary: [QEMU/KVM-RHS] Terrible performance on OS installation on 6x2 volume.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.0
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ---
: ---
Assignee: shishir gowda
QA Contact: Sachidananda Urs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-02-20 03:18 UTC by Gowrishankar Rajaiyan
Modified: 2013-12-09 01:35 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-03-04 22:18:27 UTC
Embargoed:


Attachments (Terms of Use)
sosreport from server1 (3.65 MB, application/octet-stream)
2013-02-20 04:10 UTC, Sachidananda Urs
no flags Details
sosreport from server2 (3.65 MB, application/octet-stream)
2013-02-20 04:34 UTC, Sachidananda Urs
no flags Details
sosreport from server3 (11.75 MB, application/octet-stream)
2013-02-20 05:33 UTC, Sachidananda Urs
no flags Details
sosreport from server3-1 (11.75 MB, application/octet-stream)
2013-02-20 05:43 UTC, Sachidananda Urs
no flags Details
sosreport from server4 (11.86 MB, application/octet-stream)
2013-02-20 05:54 UTC, Sachidananda Urs
no flags Details
top read and write performance results (4.96 KB, application/x-gzip)
2013-02-25 14:18 UTC, Sachidananda Urs
no flags Details
Profile info for the volume (34.68 KB, application/octet-stream)
2013-02-26 06:56 UTC, Sachidananda Urs
no flags Details
Adding xml dump of the vms (1.57 KB, application/x-bzip)
2013-02-28 05:19 UTC, Sachidananda Urs
no flags Details

Description Gowrishankar Rajaiyan 2013-02-20 03:18:19 UTC
Description of problem:
Coming from dog food setup - Ben Turner & Jonathan Holloway reported that they are seeing terrible performance on OS installations. OS installations are taking hours. Package installs were taking multiple minutes per package and then all the sudden would speed up and slow down again. 

Though this sounds like a network issue still filing it since all the machines are located in the same lab and such terrible performance should not be seen everytime.

Version-Release number of selected component (if applicable):
glusterfs-3.3.0.5rhs-42.el6rhs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL6.3 virtualization host.
2. Install 4 rhsvm's on the /home partition.
3. Attach disks (lvm+xfs) from host (DAS) to these rhs vms.
4. Create a 6x2 dist-repl volume[1].
5. Mount (vmstore) this volume on the same hypervisor in step 1.
6. Install few vms on vmstore.
7. Install RHEL6.4 virtualization host.
8. Mount volume from step 4.
9. Install VMs using PXE.
  
Actual results:
OS installations are taking hours.

Expected results:
OS installation time should be acceptable enough.

Additional info:
[1]
[root@rhsvm1 ~]# gluster vol info
 
Volume Name: vmstore0
Type: Distributed-Replicate
Volume ID: 2a718178-98a6-4cdf-90e5-f97632cc32fc
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: rhsvm1:/data/store-0
Brick2: rhsvm2:/data/store-0
Brick3: rhsvm1:/data/store-1
Brick4: rhsvm2:/data/store-1
Brick5: rhsvm1:/data/store-2
Brick6: rhsvm2:/data/store-2
Brick7: rhsvm3:/data/store-0
Brick8: rhsvm4:/data/store-0
Brick9: rhsvm3:/data/store-1
Brick10: rhsvm4:/data/store-1
Brick11: rhsvm3:/data/store-2
Brick12: rhsvm4:/data/store-2
Options Reconfigured:
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: on
storage.owner-uid: 36
storage.owner-gid: 36
[root@rhsvm1 ~]#

Comment 3 Sachidananda Urs 2013-02-20 04:06:40 UTC
============= MEMORY USAGE ==========

Tasks: 168 total,   1 running, 167 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  1.1%sy,  0.0%ni, 98.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8059900k total,  7913248k used,   146652k free,    67184k buffers
Swap: 10289144k total,        0k used, 10289144k free,  7228328k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          
11468 root      20   0  300m  78m 1924 S  0.0  1.0   0:14.76 glusterfs                                                        
11474 root      20   0  336m  66m 1716 S  0.0  0.8   1:27.65 glusterfs                                                        
 3651 root      20   0 1501m  27m 1492 S  0.0  0.3 301:47.84 glusterfsd                                                       
 3656 root      20   0  839m  27m 1492 S  0.0  0.3 191:04.06 glusterfsd                                                       
 3662 root      20   0  796m  25m 1496 S  0.0  0.3 144:54.12 glusterfsd                                                       
11461 root      20   0  234m  10m 1960 S  0.0  0.1   0:14.12 glusterd

Comment 4 Sachidananda Urs 2013-02-20 04:10:53 UTC
Created attachment 699833 [details]
sosreport from server1

Comment 5 Sachidananda Urs 2013-02-20 04:34:07 UTC
Created attachment 699845 [details]
sosreport from server2

Comment 6 Sachidananda Urs 2013-02-20 05:33:29 UTC
Created attachment 699856 [details]
sosreport from server3

Comment 7 Sachidananda Urs 2013-02-20 05:43:23 UTC
Created attachment 699857 [details]
sosreport from server3-1

Comment 8 Sachidananda Urs 2013-02-20 05:54:05 UTC
Created attachment 699858 [details]
sosreport from server4

Comment 9 Amar Tumballi 2013-02-25 06:58:39 UTC
Shishir, would be great to see the logs once and see what are the issues. Lets see which module is causing this bottleneck.

Comment 10 shishir gowda 2013-02-25 07:22:51 UTC
Have gone through the logs, and there are a lot of repeating getxattr failures (for selinux) on the bricks and fuse client, which might be one of the reasons for degradation of performance.

bricks:
[2013-02-20 09:33:25.917931] E [posix.c:2600:posix_getxattr] 0-vmstore0-posix: getxattr failed on /data/store-0/rhsqe-repo/rhsqe-repo.raw: security.capability (No data available)
[2013-02-20 09:33:25.938996] E [posix.c:2600:posix_getxattr] 0-vmstore0-posix: getxattr failed on /data/store-0/rhsqe-repo/rhsqe-repo.raw: security.capability (No data available)

client:
[2013-02-19 15:00:37.933648] W [fuse-bridge.c:2840:fuse_xattr_cbk] 0-glusterfs-fuse: 324156: GETXATTR(security.capability) <gfid:d8c95dfd-ce4a-4721-a3b3-c72d4621ce40> => -1 (No such file or directory)
[2013-02-19 15:00:37.936414] W [fuse-bridge.c:2840:fuse_xattr_cbk] 0-glusterfs-fuse: 324159: GETXATTR(security.capability) <gfid:d8c95dfd-ce4a-4721-a3b3-c72d4621ce40> => -1 (No such file or directory)

One more thing that would be helpful in such scenario's are the read/write performance stats of the backends involved, and from the mount point.

Please run the following command from the cli:
'gluster volume top <volname> read-perf|write-perf bs <size> count <cnt>'

and provide the dd o/p from one of the client mount points.

Also, turning on profiling "gluster volume profile <volname> start" would help.

After a interval, running "gluster volume profile <volname> info" will provide some more stats.

Comment 11 Sachidananda Urs 2013-02-25 14:18:28 UTC
Created attachment 702365 [details]
top read and write performance results

Comment 12 Amar Tumballi 2013-02-25 14:29:43 UTC
Look at the numbers in above attachment. With such backends, we are destined to be slow :-o

Brick: rhsvm4:/data/store-1
Throughput 1.25 MBps time 0.3279 secs

Brick: rhsvm4:/data/store-0
Throughput 2.86 MBps time 0.1432 secs

Brick: rhsvm4:/data/store-2
Throughput 3.87 MBps time 0.1057 secs

Brick: rhsvm1:/data/store-1
Throughput 14.01 MBps time 0.0292 secs

Brick: rhsvm1:/data/store-0
Throughput 16.60 MBps time 0.0247 secs

Brick: rhsvm1:/data/store-2
Throughput 355.86 MBps time 0.0012 secs

Ideally, the performance should have been like below:

Brick: rhsvm2:/data/store-0
Throughput 611.34 MBps time 0.0007 secs

Brick: rhsvm2:/data/store-2
Throughput 675.91 MBps time 0.0006 secs

Brick: rhsvm2:/data/store-1
Throughput 410.42 MBps time 0.0010 secs

Comment 13 Ben Turner 2013-02-25 23:42:10 UTC
I just went on the brick itself and tested:

[root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-ben1.img bs=1024k count=1000; sync` 
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 354.812 s, 3.0 MB/s

real	5m56.611s
user	0m0.031s
sys	5m52.400s

So on the brick itself we are getting ~2.80 MB/sec, no good there.  Now to check a read:

[root@rhsvm1 store-0]# echo 3 > /proc/sys/vm/drop_caches 
[root@rhsvm1 store-0]# dd if=./dd-ben1.img of=/dev/null bs=1024k
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 6.56717 s, 160 MB/s

Wowo on reads we are getting 160 MB/sec!  Lets try the write again:

[root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-ben1.img bs=1024k count=1000; sync` 
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.03621 s, 1.0 GB/s

real	0m5.939s
user	0m0.001s
sys	0m1.200s

That time the write took under 5 seconds which works out to ~169 MB/sec.  

I spent some time testing different things after this and came to the conclusion that when I dropped caches before both reads and writes I saw full speed.  These full speed reads/writes would continue until I filled RAM up again with dirty pages.  Once RAM was filled with dirty pages I would hit slow performance again.  

I thought that this may be caused by something on the hypervisor so I ran the following system tap script on the PID of rhsvm1:

/usr/share/doc/systemtap-client-1.8/examples/profiling/latencytap.stp --all-modules -v -x 16429

I did a run with dirty pages and one without.  The clean run looked like:

Reason                                  Count  Average(us)  Maximum(us) Percent%
                                            1           -8           -8       0%
synchronous write                           3           -8           -8       0%
                                            5           -8           -8       0%
                                            6           -8           -8       0%
synchronous write                           8           -8           -8       0%
synchronous write                          23           -8           -8       0%
Userspace lock contention                 561           -8           -8       7%
Waiting for event (select)               2037           -8           -8      27%
KVM ioctl                                4897           -8           -8      64%

The dirty run had:

Reason                                  Count  Average(us)  Maximum(us) Percent%
                                            1           -8           -8       0%
KVM ioctl                                   1           -8           -8       0%
Page fault                                  1           -8           -8       0%
synchronous write                           2           -8           -8       0%
Page fault                                  2           -8           -8       0%
synchronous write                           5           -8           -8       0%
                                            6           -8           -8       0%
                                            6           -8           -8       0%
synchronous write                          19           -8           -8       0%
Userspace lock contention                 659           -8           -8       5%
Page fault                               2120           -8           -8      18%
Waiting for event (select)               2270           -8           -8      19%
KVM ioctl                                6521           -8           -8      56%

The big difference looked to be the Page faults on the dirty run(I am following up with the systemtap maintainer on why the averages and maximums were -8s, that seems to be a bug with the stap script).  

I don't have anything conclusive on RCA here yet but I do think this has something to do with allocating memory on/to the guest.  A couple things to look at:

-Why are the disks getting presented to the guest as IDE?
-Are we using any sort of memory saving features on the HV?  ksmtuned?  look at BZ 672216.
-Look at more stap scripts, ioblktime.stp? pfaults.stp?

Comment 14 Sachidananda Urs 2013-02-26 03:28:32 UTC
dd on the mount point:


[root@matrix vmstore]# dd if=/dev/zero of=dd.out bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 2403.95 s, 893 kB/s
[root@matrix vmstore]#

Comment 16 Gowrishankar Rajaiyan 2013-02-26 06:07:31 UTC
-Are we using any sort of memory saving features on the HV?  ksmtuned?

I have started them now.


============= MEMORY USAGE ==========
top - 11:00:23 up 13 days, 18:12,  3 users,  load average: 0.23, 0.12, 0.09
Tasks: 172 total,   1 running, 171 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.4%us,  2.3%sy,  0.0%ni, 97.0%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.7%us,  0.7%sy,  0.0%ni, 98.6%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.3%us,  1.4%sy,  0.0%ni, 97.9%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  1.8%sy,  0.0%ni, 97.9%id,  0.4%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8059900k total,  6590896k used,  1469004k free,    37264k buffers
Swap: 10289144k total,        0k used, 10289144k free,  5975068k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          
20950 root      20   0  300m  77m 1924 S  0.3  1.0   1:36.84 glusterfs                                                        
12077 root      20   0  402m  72m 1260 S  0.3  0.9  16:37.24 glusterfs                                                        
 4040 root      20   0 1558m  29m 1732 S 39.5  0.4 420:11.01 glusterfsd                                                       
 4045 root      20   0 1519m  27m 1720 S  0.0  0.4 907:08.12 glusterfsd                                                       
 4051 root      20   0 1240m  24m 1724 S  0.0  0.3 196:49.63 glusterfsd                                                       
12064 root      20   0  235m  13m 1508 S  0.3  0.2   6:41.03 glusterd                                         


[root@matrix vmstore]# dd if=/dev/zero of=dd.out bs=1M count=2048
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 1183.97 s, 1.8 MB/s
[root@matrix vmstore]# 

[root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-shanks1.img bs=1024k count=1000; sync` 
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 6.51697 s, 161 MB/s

real	0m10.136s
user	0m0.002s
sys	0m6.470s
[root@rhsvm1 store-0]# 


[root@rhsvm1 store-0]# echo 3 > /proc/sys/vm/drop_caches
[root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-shanks2.img bs=1024k count=1000; sync` 
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.28284 s, 245 MB/s

real	0m8.236s
user	0m0.004s
sys	0m4.263s
[root@rhsvm1 store-0]#

Comment 17 Sachidananda Urs 2013-02-26 06:56:01 UTC
Created attachment 702646 [details]
Profile info for the volume

Comment 18 Ben England 2013-02-27 15:54:53 UTC
- Are guests using virtio-blk ?  
- Are guests using io=threads?
- Are you using RHEL6.3 kernel 2.6.32-279.22.1.el6 available in RHEL6.3 z-stream? This has scatter-gather patch.
- Are you using KVM cache=none (direct I/O?)  If not, expect that I/O will  start out fast and then degrade when dirty pages fill up on host.
- Are you using 10-GbE with jumbo frames?  
- How is storage configured and does it have writeback caching?  
- is the network link used by Gluster dedicated to Gluster?

plz post either the KVM virsh dumpxml output or a "ps awux | grep qemu" so we can see how the guest is configured.

If you are using 1-GbE, you can try bonding mode 6 on the KVM hosts to try and get more throughput from the network, but if the network is an order of magnitude slower than the storage then you can expect a problem.   Gluster is totally dependent on the underlying network for performance, always has been.

Comment 19 Gowrishankar Rajaiyan 2013-02-28 04:54:13 UTC
qemu     15914  6.8 24.8 13273912 8179112 ?    Sl   Feb12 1565:26 /usr/libexec/qemu-kvm -S -M rhel6.3.0 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name rhsvm4 -uuid 068bbfc5-1ec2-53d5-df9a-05e12083ef2f -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhsvm4.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/images/rhs/rhsvm4/rhsvm4.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sdk,if=none,id=drive-ide0-0-0,format=raw,cache=none,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/dev/sdl,if=none,id=drive-ide0-0-1,format=raw,cache=none,aio=native -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive file=/dev/sdm,if=none,id=drive-ide0-1-1,format=raw,cache=none,aio=native -device ide-drive,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -netdev tap,fd=27,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:57:c0:c5,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


[root@matrix ~]# uname -r
2.6.32-279.el6.x86_64
[root@matrix ~]# 


Updating the kernel and shall keep you posted.

Comment 20 Sachidananda Urs 2013-02-28 05:19:41 UTC
Created attachment 703794 [details]
Adding xml dump of the vms

Comment 21 Ben England 2013-03-01 17:37:08 UTC
You are using aio=native, please try aio=threads, and consider using the kernel 2.6.32-279.22.1.el6 or later on your KVM hypervisor host.  We know there is a problem with io_submit() system call performance for Gluster, using aio=threads in KVM guest avoids this problem.    A fix for this problem is planned in a future RHEL release (Brian, which one?).

If you do a wireshark trace of Gluster traffic, I bet you see really small I/O requests for your write tests.

Comment 22 Gowrishankar Rajaiyan 2013-03-03 05:38:39 UTC
qemu      8815  165 25.2 8978068 8311936 ?     Sl   Mar01 3743:00 /usr/libexec/qemu-kvm -S -M rhel6.3.0 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name rhsvm3 -uuid b89166f6-475b-4415-cc30-dd5e090156cc -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhsvm3.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/images/rhs/rhsvm3/rhsvm3.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=writeback,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sdh,if=none,id=drive-ide0-0-0,format=raw,cache=writeback,aio=threads -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/dev/sdi,if=none,id=drive-ide0-0-1,format=raw,cache=writeback,aio=threads -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive file=/dev/sdj,if=none,id=drive-ide0-1-1,format=raw,cache=writeback,aio=threads -device ide-drive,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -netdev tap,fd=23,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:73:79:8c,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


# uname -r
2.6.32-279.22.1.el6.x86_64


We are seeing a major performance improvement. Thanks bene.

Think we should document these qemu/kvm performance tuning options in our Quick Start Guide.

Comment 23 Ben Turner 2013-03-04 22:18:27 UTC
After turning on the ksm services I saw much better performance.  Upon inspection the 4 RHS virts running on this HV each used 8 GB of RAM and we only had 32 GB on the host with other guests running as well.  We relocated all other VMs off the HV, turned on ksm stuff, and implemented the suggested tuning from bene and things are running OK now.  I am going to close this as not a bug.


Note You need to log in before you can comment on or make changes to this bug.