Red Hat Bugzilla – Bug 861806
some of parallel qemu-img convert processes fail to write output file
Last modified: 2015-05-13 11:41:05 EDT
Description of problem:
when many instances of qemu-img utility are invoked to clone a VM image to qcow2 format, errors of this form:
qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-9'
are observed after 256 KB of the target image are written. This might be similar to bz 846968.
Workaround: if you retry the qemu-img command later, it works. So this is a transient error condition. But it points to something more serious, a failure of glusterfs under high load conditions. It also limits scalability testing of KVM/RHS.
Version-Release number of selected component (if applicable):
rhsvirt1-6 (early version of RHS 2.0+).
RHS 2.0 GA pxe-installed ,then upgraded gluster rpms to rhsvirt1-6
every time with sufficient concurrent qemu-img processes
Steps to Reproduce:
1. configure Gluster volume on 8 servers with following settings
[root@gprfs025 ~]# gluster volume info
Volume Name: kvmfs
Volume ID: f1e65f36-224b-4e93-aa55-8acaaa899b6e
Number of Bricks: 4 x 2 = 8
2. mount gluster volume from 8 separate clients using command:
# mount -t glusterfs -o background-qlen=64 gprfs025-10ge:/kvmfs /mnt/glusterfs
3. execute parallel qemu-img workload for n concurrent processes/client as follows:
for n in 16 8 4 2 1 ; do par-for-all.sh clients.list "bash /tmp/fire-n.sh $n 0.3 /mnt/glusterfs /mnt/glusterfs/virt/rhs-vm.img" | tee virt/parallel-qemu-mtpt-$n.log ; done
where KVM guest master image is /mnt/glusterfs/virt/rhs-vm.img and fire-n.sh script is:
[root@gprfc032-10ge gluster_test]# more fire-n.sh
rm -f $dir/test-`hostname -s`-*
for n in `seq 1 $threads` ; do
eval "qemu-img convert -f raw -O qcow2 $src $dir/test-`hostname -s`-$n &"
[root@gprfc032-10ge gluster_test]# grep qemu-img: virt/parallel-qemu-*log
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc018-12'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc019-13'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc019-14'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc020-10'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc020-12'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc021-14'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-9'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-11'
It appears that the first client to launch the qemu-img processes, gprfc017, completes its qemu-img processes 2 or 3 times faster than the other clients,and its processes do not seem to get the errors that the other ones do. suggesting that excessive resources were given to the first client.
Since gluster is supposed to be scalable, we should not get an error like this no matter how many qemu-img processes we run. It's one thing to have some performance issues with this workload, it's another thing to have it fail outright.
network is 10-GbE with MTU=9000 jumbo frames, storage is RAID6 volume, /dev/sdb is partitioned in half, multipath device is used instead of /dev/sdb, the first partition is made into LVM PV, an LVM LV is allocated from it.
-- gprfs025 --
/dev/mapper/vg_brick0-lv on /mnt/brick0 type xfs (rw,nobarrier,context="system_u:object_r:usr_t:s0",user_xattr)
> # mount -t glusterfs -o background-qlen=64 gprfs025-10ge:/kvmfs /mnt/glusterfs
background-qlen is 64 by default, can you check if making it 128 helps?
Created attachment 619926 [details]
script that runs a command on specified list of hosts in parallel
this script is used to fire up qemu-img processes on all 8 clients in parallel.
Here are two traces of qemu-img process failing under heavy gluster load, the command used to generate the traces was run outside the above workload using:
rm -fv /mnt/glusterfs/junk.tmp* ; strace -ttT -f qemu-img convert -f raw -O qcow2 /mnt/glusterfs/virt/rhs-vm.img /mnt/glusterfs/junk.img8 2>&1 | tee r2.log
Maybe someone with KVM expertise can help explain what happened inside qemu-img.
I tried both background-qlen=16 and background-qlen=256, neither one helped.
If we could understand what qemu-img was doing with the filesystem when it saw an error, we could at least come up with an easier reproducer that would help us isolate the problem.
I'll just have my qemu-img scripts retry when the failure occurs as a workaround.
Even with 10 retries and 20-second delay between retries, I still see some failures when using 32 clients, 1 qemu-img per client. As a result, even with the workaround the throughput is still pretty bad. I think this is a scalability issue, perhaps because with qcow2 all of the read pressure is on a single server or two. Will see if this happens with raw format or with Gluster/NFS.
I was using qemu-img wrong, was not creating image backed by master image, but copied from master image instead. When you do this right, the cloned VM image size is only 256 KB, but it grows after you boot it as VM writes data to the disk image.