Bug 861806

Summary:

some of parallel qemu-img convert processes fail to write output file

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Ben England <bengland>

Component:

glusterfs

Assignee:

Amar Tumballi <amarts>

Status:

CLOSED NOTABUG

QA Contact:

Sudhir D <sdharane>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.0

CC:

perfbz, rhs-bugs, vbellur, vraman

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-10-09 17:55:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
script that runs a command on specified list of hosts in parallel	none

Description Ben England 2012-09-30 20:52:50 UTC

Description of problem:

when many instances of qemu-img utility are invoked to clone a VM image to qcow2 format, errors of this form:

qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-9'

are observed after 256 KB of the target image are written.  This might be similar to bz 846968.  

Workaround: if you retry the qemu-img command later, it works.  So this is a transient error condition.  But it points to something more serious, a failure of glusterfs under high load conditions.  It also limits scalability testing of KVM/RHS.

Version-Release number of selected component (if applicable):

client machines:
rhsvirt1-6 (early version of RHS 2.0+).
RHEL6.3

server machines:
RHS 2.0 GA pxe-installed ,then upgraded gluster rpms to rhsvirt1-6

How reproducible:

every time with sufficient concurrent qemu-img processes


Steps to Reproduce:

1. configure Gluster volume on 8 servers with following settings

[root@gprfs025 ~]# gluster volume info
 
Volume Name: kvmfs
Type: Distributed-Replicate
Volume ID: f1e65f36-224b-4e93-aa55-8acaaa899b6e
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: gprfs025-10ge:/mnt/brick0
Brick2: gprfs026-10ge:/mnt/brick0
Brick3: gprfs027-10ge:/mnt/brick0
Brick4: gprfs028-10ge:/mnt/brick0
Brick5: gprfs029-10ge:/mnt/brick0
Brick6: gprfs030-10ge:/mnt/brick0
Brick7: gprfs015-10ge:/mnt/brick0
Brick8: gprfs032-10ge:/mnt/brick0
Options Reconfigured:
performance.write-behind-window-size: 1048576
performance.quick-read: off
performance.io-cache: off
performance.stat-prefetch: off
performance.read-ahead: off
performance.write-behind: on
cluster.eager-lock: on

2.  mount gluster volume from 8 separate clients using command:

# mount -t glusterfs -o background-qlen=64 gprfs025-10ge:/kvmfs /mnt/glusterfs

3.  execute parallel qemu-img workload for n concurrent processes/client as follows:

for n in 16 8 4 2 1 ; do par-for-all.sh clients.list "bash /tmp/fire-n.sh $n 0.3 /mnt/glusterfs /mnt/glusterfs/virt/rhs-vm.img" | tee virt/parallel-qemu-mtpt-$n.log  ; done

where KVM guest master image is /mnt/glusterfs/virt/rhs-vm.img and fire-n.sh script is:

[root@gprfc032-10ge gluster_test]# more fire-n.sh 
threads=$1
delay=$2
dir=$3
src=$4
rm -f $dir/test-`hostname -s`-*
for n in `seq 1 $threads` ; do 
  eval "qemu-img convert -f raw -O qcow2 $src $dir/test-`hostname -s`-$n &" 
  pace.py $delay
done
time wait


Actual results:

[root@gprfc032-10ge gluster_test]# grep qemu-img: virt/parallel-qemu-*log

virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc018-12'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc019-13'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc019-14'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc020-10'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc020-12'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc021-14'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-9'
virt/parallel-qemu-mtpt-16.log:qemu-img: Could not open '/mnt/glusterfs/test-gprfc027-11'

It appears that the first client to launch the qemu-img processes, gprfc017, completes its qemu-img processes 2 or 3 times faster than the other clients,and its processes do not seem to get the errors that the other ones do. suggesting that excessive resources were given to the first client.

Expected results:

Since gluster is supposed to be scalable, we should not get an error like this no matter how many qemu-img processes we run.  It's one thing to have some performance issues with this workload, it's another thing to have it fail outright.


Additional info:

network is 10-GbE with MTU=9000 jumbo frames, storage is RAID6 volume, /dev/sdb is partitioned in half, multipath device is used instead of /dev/sdb, the first partition is made into LVM PV, an LVM LV is allocated from it.
 -- gprfs025 -- 
/dev/mapper/vg_brick0-lv on /mnt/brick0 type xfs (rw,nobarrier,context="system_u:object_r:usr_t:s0",user_xattr)

Comment 2 Amar Tumballi 2012-10-01 03:16:35 UTC

> # mount -t glusterfs -o background-qlen=64 gprfs025-10ge:/kvmfs /mnt/glusterfs

background-qlen is 64 by default, can you check if making it 128 helps?

Comment 3 Ben England 2012-10-01 18:02:30 UTC

Created attachment 619926 [details]
script that runs a command on specified list of hosts in parallel

this script is used to fire up qemu-img processes on all 8 clients in parallel.

Comment 4 Ben England 2012-10-01 18:05:43 UTC

Here are two traces of qemu-img process failing under heavy gluster load, the command used to generate the traces was run outside the above workload using:

rm -fv /mnt/glusterfs/junk.tmp* ; strace -ttT -f qemu-img convert -f raw -O qcow2 /mnt/glusterfs/virt/rhs-vm.img /mnt/glusterfs/junk.img8 2>&1 | tee r2.log

http://perf1.lab.bos.redhat.com/bengland/laptop/matte/virt/qemu-img-fail1.log
http://perf1.lab.bos.redhat.com/bengland/laptop/matte/virt/qemu-img-fail2.log

Maybe someone with KVM expertise can help explain what happened inside qemu-img.

Comment 5 Ben England 2012-10-01 18:51:50 UTC

I tried both background-qlen=16 and background-qlen=256, neither one helped.

If we could understand what qemu-img was doing with the filesystem when it saw an error, we could at least come up with an easier reproducer that would help us isolate the problem.

I'll just have my qemu-img scripts retry when the failure occurs as a workaround.

Comment 6 Ben England 2012-10-08 14:16:51 UTC

Even with 10 retries and 20-second delay between retries, I still see some failures when using 32 clients, 1 qemu-img per client.   As a result, even with the workaround the throughput is still pretty bad.  I think this is a scalability issue, perhaps because with qcow2 all of the read pressure is on a single server or two.  Will see if this happens with raw format or with Gluster/NFS.

Comment 7 Ben England 2012-10-09 17:55:12 UTC

I was using qemu-img wrong, was not creating image backed by master image, but copied from master image instead.  When you do this right, the cloned VM image size is only 256 KB, but it grows after you boot it as VM writes data to the disk image.