Bug 1007866

Summary: sequential read performance not optimized for libgfapi
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ben England <bengland>
Component: glusterdAssignee: Anand Avati <aavati>
Status: CLOSED ERRATA QA Contact: Ben England <bengland>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: amarts, bturner, chrisw, grajaiya, kparthas, shaines, ssaha, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.34rhs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1009134 (view as bug list) Environment:
Last Closed: 2013-11-27 15:37:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1009134    

Description Ben England 2013-09-13 13:17:48 UTC
Description of problem:

When running KVM guests on top of libgfapi, sequential read performance is significantly slower than it was with FUSE (see below), as much as 50%.  

The problem is traceable in part to stat-prefetch=off in /var/lib/glusterd/groups/virt, resulting in a FSTAT round-trip to the server for every READ RPC.  Setting stat-prefetch on gets rid of the FSTAT RPCs and brings libgfapi sequential read performance up to FUSE speed.

But after that optimization, qemu-kvm thread doing reads reaches 100% utilization of 1 core, which means qemu-kvm is CPU bottlenecked.  Avati has a patch  http://review.gluster.org/5897 that removes this bottleneck, allowing libgfapi to get up to line speed.

Version-Release number of selected component (if applicable):

server: RHS 2.1 iso dated Sep 9

client (KVM host):
http://download.devel.redhat.com/rel-eng/RHEL-6.5-Alpha-1.1/6.5/Server/x86_64/os/
+
glusterfs-libs-3.4.0.33rhs-1.el6_4.x86_64
glusterfs-3.4.0.33rhs-1.el6_4.x86_64
glusterfs-fuse-3.4.0.33rhs-1.el6_4.x86_64
glusterfs-api-3.4.0.33rhs-1.el6_4.x86_64

How reproducible:

every time

Steps to Reproduce:
1. create a Gluster-FUSE-backed guest using virt-manager, add 4 virtual block devices to it which are associated with 30-GB files in the Gluster volume
2.after setting this up correctly, manually clone the guest and configure the new guest to use libgfapi for its 4 virtual block devices as shown below.
3.on Gluster fuse guest and gfapi guest, create 4 ext4 filesystems with mountpoints /mnt/vbd-{b,c,d,e}, add them to fstab and mount them..
4. run this command to create a file

# iozone -w -c -e -i 0 -+n -r 64k -s 4g -f /mnt/vbd-b/f.tmp

5.  You need to make guest do prefetching with  "yum install tuned" followed by "tuned-adm profile virtual-guest", then add this line to /etc/rc.local on the guest:

for n in sys/block/vd? ; do echo 8192 > $n/queue/read_ahead_kb ; done

6. run this command to test speed:

# iozone -w -c -e -i 1 -+n -r 64k -s 4g -f /mnt/vbd-b/f.tmp



Actual results:

You should see something like 450 MB/s on libgfapi guest, and 650 MB/s on the FUSE guest with a single thread.  If you expand test to 4 virtual block devices using 

iozone -w -c -e -i 0 -+n -r 64k -s 4g -t 4 -F /mnt/vbd-{b,c,d,e}/f.tmp
iozone -w -c -e -i 1 -+n -r 64k -s 4g -t 4 -f /mnt/vbd-{b,c,d,e}/f.tmp

You should see a bigger difference.

Expected results:


Additional info:

How to manually clone the guest:

- do "virsh dumpxml gluster-fuse > gfapi.xml" while guest is running
- Edit the XML by hand (no virt-manager support yet) to have different UUID, name, MAC addresses, and backing system disk image "/home/kvm_images/gfapi".  
- copy the original system disk image out of /var/lib/libvirt/images to /home/kvm_images/gfapi 
- convert the 4 virtual block devices used for testing to use libgfapi.  XML must look like this for each one -- only the "disk" and "source" tags need to change.

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source protocol='gluster' name='/v-b1/gfapi_disk1'>
        <host name='gprfs047-10ge' port='24007'/>
      </source>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>

- boot up the gfapi guest to single-user mode, go to /etc/sysconfig/network-scripts, edit the MAC addresses in ifcfg-eth0 and ifcfg-eth1, and edit static IP address for 10-GbE NIC, then remove /etc/udev/rules.d/70-persistent-net.rules .  Change HOSTNAME in /etc/sysconfig/network to gfapi.  halt and restart gfapi guest, it should come up clean now and should be accessible over the network.

Comment 2 Ben England 2013-09-16 12:53:05 UTC
I have collected data showing the impact of these changes in a spreadsheet at below URL:

http://perf1.perf.lab.eng.bos.redhat.com/bengland/public/rhs/virt/kvm-libgfapi.ods

This basically shows about a 60% improvement on reads by setting stat-prefetch volume parameter to on, and then for reads with multiple virtual block devices (1 iozone thread per device), another 40% improvement, allowing us to attain 10-GbE LINE SPEED with KVM/libgfapi on sequential reads with 2 virtual block devices, far better than FUSE.  There is little variance in the read results.

Avati's patch basically lowers the amount of CPU consumed by QEMU I/O threads.  We observed that without the patch, single-thread reads caused one of QEMU threads, presumably an I/O thread, to consume 95% of a CPU core, so there was a "hot thread" CPU bottleneck caused by libgfapi (Avati verified this by using gdb).  After the patch, there was still a CPU bottleneck on the single thread case, but the bottleneck moved from memset_sse() to memcpy(), a more reasonable bottleneck associated with actually moving data.

I also measured write performance, but there was some variance in the test, so the small differences graphed should not be considered statistically significant.  There is some variance between runs in the case of writes, may be due to NUMA+power-management+scheduler effects, etc.

Comment 3 Amar Tumballi 2013-09-24 07:30:07 UTC
Upstream patch @ : http://review.gluster.org/5897

Comment 5 Gowrishankar Rajaiyan 2013-10-08 08:39:31 UTC
Fixed in version please.

Comment 6 Ben England 2013-10-29 19:17:22 UTC
I think RHS 2.1 U2 contains the fix, I re-ran the same tests that I ran before and got the same improvement with libgfapi.  Again I used "group virt" tuning, rhs-virtualization tuned profile, stat-prefetch on, guest readahead 8 MB.  

WARNING: I'm seeing a major self-heal operation when I run this test described in bz 1007948.  Wish you had fixed that.  This will cause alarm among sysadmins.  However, it pulls itself together.  This seems to affect writes more than reads.  So I'm turning off self-heal-daemon for now so I can see what write performance would be if this was fixed.
  
I used the RHS 2.1 U2 build dated Oct 27. top utility with "H" option shows that there are no hot threads blocking libgfapi performance on reads anymore.  with 4 virtual block devices being read, I can see glusterfsd on server gprfs047 approaching 4.5 cores used, while server gprfs048, the other member of the replica pair, is idle.  This can be corrected with gluster volume set my-vol read-hash-mode 2.

I'm running rpms from build at

baseurl=http://download.lab.bos.redhat.com/nightly/RHSS-2.1u2-20131027.n.0/2.1u2/RHS/x86_64/os

on client:

[root@gprfc093 ~]# rpm -qa | grep glusterfs
glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.35.1u2rhs-1.el6rhs.x86_64

and on server:

[root@gprfc093 ~]# ssh gprfs048 rpm -qa | grep glusterfs
glusterfs-geo-replication-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-3.4.0.35.1u2rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.35.1u2rhs-1.el6rhs.x86_64

Since libgfapi glfs.h no longer is in devel RPM, I have to pull it from the source RPM to compile my test program, which hasn't changed.

Comment 8 Ben England 2013-11-18 21:17:08 UTC
what does comment 7 mean?

Comment 9 Vivek Agarwal 2013-11-19 05:52:05 UTC
That was dropped off by mistake, already added to errata a few days back

Comment 10 Ben Turner 2013-11-21 04:00:40 UTC
I tested this on with the u1 rc(3.4.0.44) bits on the gqas server.  I setup two different VMs, 1 back ended with fuse, 1 with libgfapi.  In my tests libgfapi out performed fuse in single threaded sequential IOs and I didn't see any hot threads.  Marking verified.

Comment 11 errata-xmlrpc 2013-11-27 15:37:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1769.html