Description of problem: Coming from dog food setup - Ben Turner & Jonathan Holloway reported that they are seeing terrible performance on OS installations. OS installations are taking hours. Package installs were taking multiple minutes per package and then all the sudden would speed up and slow down again. Though this sounds like a network issue still filing it since all the machines are located in the same lab and such terrible performance should not be seen everytime. Version-Release number of selected component (if applicable): glusterfs-3.3.0.5rhs-42.el6rhs.x86_64 How reproducible: Always Steps to Reproduce: 1. Install RHEL6.3 virtualization host. 2. Install 4 rhsvm's on the /home partition. 3. Attach disks (lvm+xfs) from host (DAS) to these rhs vms. 4. Create a 6x2 dist-repl volume[1]. 5. Mount (vmstore) this volume on the same hypervisor in step 1. 6. Install few vms on vmstore. 7. Install RHEL6.4 virtualization host. 8. Mount volume from step 4. 9. Install VMs using PXE. Actual results: OS installations are taking hours. Expected results: OS installation time should be acceptable enough. Additional info: [1] [root@rhsvm1 ~]# gluster vol info Volume Name: vmstore0 Type: Distributed-Replicate Volume ID: 2a718178-98a6-4cdf-90e5-f97632cc32fc Status: Started Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: rhsvm1:/data/store-0 Brick2: rhsvm2:/data/store-0 Brick3: rhsvm1:/data/store-1 Brick4: rhsvm2:/data/store-1 Brick5: rhsvm1:/data/store-2 Brick6: rhsvm2:/data/store-2 Brick7: rhsvm3:/data/store-0 Brick8: rhsvm4:/data/store-0 Brick9: rhsvm3:/data/store-1 Brick10: rhsvm4:/data/store-1 Brick11: rhsvm3:/data/store-2 Brick12: rhsvm4:/data/store-2 Options Reconfigured: performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: on storage.owner-uid: 36 storage.owner-gid: 36 [root@rhsvm1 ~]#
============= MEMORY USAGE ========== Tasks: 168 total, 1 running, 167 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 1.1%sy, 0.0%ni, 98.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8059900k total, 7913248k used, 146652k free, 67184k buffers Swap: 10289144k total, 0k used, 10289144k free, 7228328k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11468 root 20 0 300m 78m 1924 S 0.0 1.0 0:14.76 glusterfs 11474 root 20 0 336m 66m 1716 S 0.0 0.8 1:27.65 glusterfs 3651 root 20 0 1501m 27m 1492 S 0.0 0.3 301:47.84 glusterfsd 3656 root 20 0 839m 27m 1492 S 0.0 0.3 191:04.06 glusterfsd 3662 root 20 0 796m 25m 1496 S 0.0 0.3 144:54.12 glusterfsd 11461 root 20 0 234m 10m 1960 S 0.0 0.1 0:14.12 glusterd
Created attachment 699833 [details] sosreport from server1
Created attachment 699845 [details] sosreport from server2
Created attachment 699856 [details] sosreport from server3
Created attachment 699857 [details] sosreport from server3-1
Created attachment 699858 [details] sosreport from server4
Shishir, would be great to see the logs once and see what are the issues. Lets see which module is causing this bottleneck.
Have gone through the logs, and there are a lot of repeating getxattr failures (for selinux) on the bricks and fuse client, which might be one of the reasons for degradation of performance. bricks: [2013-02-20 09:33:25.917931] E [posix.c:2600:posix_getxattr] 0-vmstore0-posix: getxattr failed on /data/store-0/rhsqe-repo/rhsqe-repo.raw: security.capability (No data available) [2013-02-20 09:33:25.938996] E [posix.c:2600:posix_getxattr] 0-vmstore0-posix: getxattr failed on /data/store-0/rhsqe-repo/rhsqe-repo.raw: security.capability (No data available) client: [2013-02-19 15:00:37.933648] W [fuse-bridge.c:2840:fuse_xattr_cbk] 0-glusterfs-fuse: 324156: GETXATTR(security.capability) <gfid:d8c95dfd-ce4a-4721-a3b3-c72d4621ce40> => -1 (No such file or directory) [2013-02-19 15:00:37.936414] W [fuse-bridge.c:2840:fuse_xattr_cbk] 0-glusterfs-fuse: 324159: GETXATTR(security.capability) <gfid:d8c95dfd-ce4a-4721-a3b3-c72d4621ce40> => -1 (No such file or directory) One more thing that would be helpful in such scenario's are the read/write performance stats of the backends involved, and from the mount point. Please run the following command from the cli: 'gluster volume top <volname> read-perf|write-perf bs <size> count <cnt>' and provide the dd o/p from one of the client mount points. Also, turning on profiling "gluster volume profile <volname> start" would help. After a interval, running "gluster volume profile <volname> info" will provide some more stats.
Created attachment 702365 [details] top read and write performance results
Look at the numbers in above attachment. With such backends, we are destined to be slow :-o Brick: rhsvm4:/data/store-1 Throughput 1.25 MBps time 0.3279 secs Brick: rhsvm4:/data/store-0 Throughput 2.86 MBps time 0.1432 secs Brick: rhsvm4:/data/store-2 Throughput 3.87 MBps time 0.1057 secs Brick: rhsvm1:/data/store-1 Throughput 14.01 MBps time 0.0292 secs Brick: rhsvm1:/data/store-0 Throughput 16.60 MBps time 0.0247 secs Brick: rhsvm1:/data/store-2 Throughput 355.86 MBps time 0.0012 secs Ideally, the performance should have been like below: Brick: rhsvm2:/data/store-0 Throughput 611.34 MBps time 0.0007 secs Brick: rhsvm2:/data/store-2 Throughput 675.91 MBps time 0.0006 secs Brick: rhsvm2:/data/store-1 Throughput 410.42 MBps time 0.0010 secs
I just went on the brick itself and tested: [root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-ben1.img bs=1024k count=1000; sync` 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 354.812 s, 3.0 MB/s real 5m56.611s user 0m0.031s sys 5m52.400s So on the brick itself we are getting ~2.80 MB/sec, no good there. Now to check a read: [root@rhsvm1 store-0]# echo 3 > /proc/sys/vm/drop_caches [root@rhsvm1 store-0]# dd if=./dd-ben1.img of=/dev/null bs=1024k 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 6.56717 s, 160 MB/s Wowo on reads we are getting 160 MB/sec! Lets try the write again: [root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-ben1.img bs=1024k count=1000; sync` 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.03621 s, 1.0 GB/s real 0m5.939s user 0m0.001s sys 0m1.200s That time the write took under 5 seconds which works out to ~169 MB/sec. I spent some time testing different things after this and came to the conclusion that when I dropped caches before both reads and writes I saw full speed. These full speed reads/writes would continue until I filled RAM up again with dirty pages. Once RAM was filled with dirty pages I would hit slow performance again. I thought that this may be caused by something on the hypervisor so I ran the following system tap script on the PID of rhsvm1: /usr/share/doc/systemtap-client-1.8/examples/profiling/latencytap.stp --all-modules -v -x 16429 I did a run with dirty pages and one without. The clean run looked like: Reason Count Average(us) Maximum(us) Percent% 1 -8 -8 0% synchronous write 3 -8 -8 0% 5 -8 -8 0% 6 -8 -8 0% synchronous write 8 -8 -8 0% synchronous write 23 -8 -8 0% Userspace lock contention 561 -8 -8 7% Waiting for event (select) 2037 -8 -8 27% KVM ioctl 4897 -8 -8 64% The dirty run had: Reason Count Average(us) Maximum(us) Percent% 1 -8 -8 0% KVM ioctl 1 -8 -8 0% Page fault 1 -8 -8 0% synchronous write 2 -8 -8 0% Page fault 2 -8 -8 0% synchronous write 5 -8 -8 0% 6 -8 -8 0% 6 -8 -8 0% synchronous write 19 -8 -8 0% Userspace lock contention 659 -8 -8 5% Page fault 2120 -8 -8 18% Waiting for event (select) 2270 -8 -8 19% KVM ioctl 6521 -8 -8 56% The big difference looked to be the Page faults on the dirty run(I am following up with the systemtap maintainer on why the averages and maximums were -8s, that seems to be a bug with the stap script). I don't have anything conclusive on RCA here yet but I do think this has something to do with allocating memory on/to the guest. A couple things to look at: -Why are the disks getting presented to the guest as IDE? -Are we using any sort of memory saving features on the HV? ksmtuned? look at BZ 672216. -Look at more stap scripts, ioblktime.stp? pfaults.stp?
dd on the mount point: [root@matrix vmstore]# dd if=/dev/zero of=dd.out bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 2403.95 s, 893 kB/s [root@matrix vmstore]#
-Are we using any sort of memory saving features on the HV? ksmtuned? I have started them now. ============= MEMORY USAGE ========== top - 11:00:23 up 13 days, 18:12, 3 users, load average: 0.23, 0.12, 0.09 Tasks: 172 total, 1 running, 171 sleeping, 0 stopped, 0 zombie Cpu0 : 0.4%us, 2.3%sy, 0.0%ni, 97.0%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.7%us, 0.7%sy, 0.0%ni, 98.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.3%us, 1.4%sy, 0.0%ni, 97.9%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.0%us, 1.8%sy, 0.0%ni, 97.9%id, 0.4%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8059900k total, 6590896k used, 1469004k free, 37264k buffers Swap: 10289144k total, 0k used, 10289144k free, 5975068k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20950 root 20 0 300m 77m 1924 S 0.3 1.0 1:36.84 glusterfs 12077 root 20 0 402m 72m 1260 S 0.3 0.9 16:37.24 glusterfs 4040 root 20 0 1558m 29m 1732 S 39.5 0.4 420:11.01 glusterfsd 4045 root 20 0 1519m 27m 1720 S 0.0 0.4 907:08.12 glusterfsd 4051 root 20 0 1240m 24m 1724 S 0.0 0.3 196:49.63 glusterfsd 12064 root 20 0 235m 13m 1508 S 0.3 0.2 6:41.03 glusterd [root@matrix vmstore]# dd if=/dev/zero of=dd.out bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 1183.97 s, 1.8 MB/s [root@matrix vmstore]# [root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-shanks1.img bs=1024k count=1000; sync` 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 6.51697 s, 161 MB/s real 0m10.136s user 0m0.002s sys 0m6.470s [root@rhsvm1 store-0]# [root@rhsvm1 store-0]# echo 3 > /proc/sys/vm/drop_caches [root@rhsvm1 store-0]# time `dd if=/dev/zero of=./dd-shanks2.img bs=1024k count=1000; sync` 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 4.28284 s, 245 MB/s real 0m8.236s user 0m0.004s sys 0m4.263s [root@rhsvm1 store-0]#
Created attachment 702646 [details] Profile info for the volume
- Are guests using virtio-blk ? - Are guests using io=threads? - Are you using RHEL6.3 kernel 2.6.32-279.22.1.el6 available in RHEL6.3 z-stream? This has scatter-gather patch. - Are you using KVM cache=none (direct I/O?) If not, expect that I/O will start out fast and then degrade when dirty pages fill up on host. - Are you using 10-GbE with jumbo frames? - How is storage configured and does it have writeback caching? - is the network link used by Gluster dedicated to Gluster? plz post either the KVM virsh dumpxml output or a "ps awux | grep qemu" so we can see how the guest is configured. If you are using 1-GbE, you can try bonding mode 6 on the KVM hosts to try and get more throughput from the network, but if the network is an order of magnitude slower than the storage then you can expect a problem. Gluster is totally dependent on the underlying network for performance, always has been.
qemu 15914 6.8 24.8 13273912 8179112 ? Sl Feb12 1565:26 /usr/libexec/qemu-kvm -S -M rhel6.3.0 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name rhsvm4 -uuid 068bbfc5-1ec2-53d5-df9a-05e12083ef2f -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhsvm4.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/images/rhs/rhsvm4/rhsvm4.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sdk,if=none,id=drive-ide0-0-0,format=raw,cache=none,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/dev/sdl,if=none,id=drive-ide0-0-1,format=raw,cache=none,aio=native -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive file=/dev/sdm,if=none,id=drive-ide0-1-1,format=raw,cache=none,aio=native -device ide-drive,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -netdev tap,fd=27,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:57:c0:c5,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 [root@matrix ~]# uname -r 2.6.32-279.el6.x86_64 [root@matrix ~]# Updating the kernel and shall keep you posted.
Created attachment 703794 [details] Adding xml dump of the vms
You are using aio=native, please try aio=threads, and consider using the kernel 2.6.32-279.22.1.el6 or later on your KVM hypervisor host. We know there is a problem with io_submit() system call performance for Gluster, using aio=threads in KVM guest avoids this problem. A fix for this problem is planned in a future RHEL release (Brian, which one?). If you do a wireshark trace of Gluster traffic, I bet you see really small I/O requests for your write tests.
qemu 8815 165 25.2 8978068 8311936 ? Sl Mar01 3743:00 /usr/libexec/qemu-kvm -S -M rhel6.3.0 -enable-kvm -m 8192 -smp 4,sockets=4,cores=1,threads=1 -name rhsvm3 -uuid b89166f6-475b-4415-cc30-dd5e090156cc -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhsvm3.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/images/rhs/rhsvm3/rhsvm3.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=writeback,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/dev/sdh,if=none,id=drive-ide0-0-0,format=raw,cache=writeback,aio=threads -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/dev/sdi,if=none,id=drive-ide0-0-1,format=raw,cache=writeback,aio=threads -device ide-drive,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive file=/dev/sdj,if=none,id=drive-ide0-1-1,format=raw,cache=writeback,aio=threads -device ide-drive,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -netdev tap,fd=23,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:73:79:8c,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 # uname -r 2.6.32-279.22.1.el6.x86_64 We are seeing a major performance improvement. Thanks bene. Think we should document these qemu/kvm performance tuning options in our Quick Start Guide.
After turning on the ksm services I saw much better performance. Upon inspection the 4 RHS virts running on this HV each used 8 GB of RAM and we only had 32 GB on the host with other guests running as well. We relocated all other VMs off the HV, turned on ksm stuff, and implemented the suggested tuning from bene and things are running OK now. I am going to close this as not a bug.