Description of problem: when many instances of qemu-img utility are invoked to clone a VM image to qcow2 format, errors of this form: qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-9' are observed after 256 KB of the target image are written. This might be similar to bz 846968. Workaround: if you retry the qemu-img command later, it works. So this is a transient error condition. But it points to something more serious, a failure of glusterfs under high load conditions. It also limits scalability testing of KVM/RHS. Version-Release number of selected component (if applicable): client machines: rhsvirt1-6 (early version of RHS 2.0+). RHEL6.3 server machines: RHS 2.0 GA pxe-installed ,then upgraded gluster rpms to rhsvirt1-6 How reproducible: every time with sufficient concurrent qemu-img processes Steps to Reproduce: 1. configure Gluster volume on 8 servers with following settings [root@gprfs025 ~]# gluster volume info Volume Name: kvmfs Type: Distributed-Replicate Volume ID: f1e65f36-224b-4e93-aa55-8acaaa899b6e Status: Started Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: gprfs025-10ge:/mnt/brick0 Brick2: gprfs026-10ge:/mnt/brick0 Brick3: gprfs027-10ge:/mnt/brick0 Brick4: gprfs028-10ge:/mnt/brick0 Brick5: gprfs029-10ge:/mnt/brick0 Brick6: gprfs030-10ge:/mnt/brick0 Brick7: gprfs015-10ge:/mnt/brick0 Brick8: gprfs032-10ge:/mnt/brick0 Options Reconfigured: performance.write-behind-window-size: 1048576 performance.quick-read: off performance.io-cache: off performance.stat-prefetch: off performance.read-ahead: off performance.write-behind: on cluster.eager-lock: on 2. mount gluster volume from 8 separate clients using command: # mount -t glusterfs -o background-qlen=64 gprfs025-10ge:/kvmfs /mnt/glusterfs 3. execute parallel qemu-img workload for n concurrent processes/client as follows: for n in 16 8 4 2 1 ; do par-for-all.sh clients.list "bash /tmp/fire-n.sh $n 0.3 /mnt/glusterfs /mnt/glusterfs/virt/rhs-vm.img" | tee virt/parallel-qemu-mtpt-$n.log ; done where KVM guest master image is /mnt/glusterfs/virt/rhs-vm.img and fire-n.sh script is: [root@gprfc032-10ge gluster_test]# more fire-n.sh threads=$1 delay=$2 dir=$3 src=$4 rm -f $dir/test-`hostname -s`-* for n in `seq 1 $threads` ; do eval "qemu-img convert -f raw -O qcow2 $src $dir/test-`hostname -s`-$n &" pace.py $delay done time wait Actual results: [root@gprfc032-10ge gluster_test]# grep qemu-img: virt/parallel-qemu-*log virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc018-12' virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc019-13' virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc019-14' virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc020-10' virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc020-12' virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc021-14' virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-9' virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-11' It appears that the first client to launch the qemu-img processes, gprfc017, completes its qemu-img processes 2 or 3 times faster than the other clients,and its processes do not seem to get the errors that the other ones do. suggesting that excessive resources were given to the first client. Expected results: Since gluster is supposed to be scalable, we should not get an error like this no matter how many qemu-img processes we run. It's one thing to have some performance issues with this workload, it's another thing to have it fail outright. Additional info: network is 10-GbE with MTU=9000 jumbo frames, storage is RAID6 volume, /dev/sdb is partitioned in half, multipath device is used instead of /dev/sdb, the first partition is made into LVM PV, an LVM LV is allocated from it. -- gprfs025 -- /dev/mapper/vg_brick0-lv on /mnt/brick0 type xfs (rw,nobarrier,context="system_u:object_r:usr_t:s0",user_xattr)
> # mount -t glusterfs -o background-qlen=64 gprfs025-10ge:/kvmfs /mnt/glusterfs background-qlen is 64 by default, can you check if making it 128 helps?
Created attachment 619926 [details] script that runs a command on specified list of hosts in parallel this script is used to fire up qemu-img processes on all 8 clients in parallel.
Here are two traces of qemu-img process failing under heavy gluster load, the command used to generate the traces was run outside the above workload using: rm -fv /mnt/glusterfs/junk.tmp* ; strace -ttT -f qemu-img convert -f raw -O qcow2 /mnt/glusterfs/virt/rhs-vm.img /mnt/glusterfs/junk.img8 2>&1 | tee r2.log http://perf1.lab.bos.redhat.com/bengland/laptop/matte/virt/qemu-img-fail1.log http://perf1.lab.bos.redhat.com/bengland/laptop/matte/virt/qemu-img-fail2.log Maybe someone with KVM expertise can help explain what happened inside qemu-img.
I tried both background-qlen=16 and background-qlen=256, neither one helped. If we could understand what qemu-img was doing with the filesystem when it saw an error, we could at least come up with an easier reproducer that would help us isolate the problem. I'll just have my qemu-img scripts retry when the failure occurs as a workaround.
Even with 10 retries and 20-second delay between retries, I still see some failures when using 32 clients, 1 qemu-img per client. As a result, even with the workaround the throughput is still pretty bad. I think this is a scalability issue, perhaps because with qcow2 all of the read pressure is on a single server or two. Will see if this happens with raw format or with Gluster/NFS.
I was using qemu-img wrong, was not creating image backed by master image, but copied from master image instead. When you do this right, the cloned VM image size is only 256 KB, but it grows after you boot it as VM writes data to the disk image.