Bug 1535914

Summary:	Disable io throttling for one member disk of a group during io will induce the other one hang with io
Product:	Red Hat Enterprise Linux 7	Reporter:	Gu Nini <ngu>
Component:	qemu-kvm-rhev	Assignee:	Stefan Hajnoczi <stefanha>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.5	CC:	jen, knoel, michen, mrezanin, qzhang, stefanha, virt-maint, yhong
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.12.0-8.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-01 11:04:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Gu Nini 2018-01-18 09:15:37 UTC

Description of problem:


Version-Release number of selected component (if applicable):
Host kernel: 3.10.0-826.el7.x86_64
Qemu-kvm-rhev: qemu-kvm-rhev-2.10.0-17.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start a guest with 2 io throttling data disks and in the same group:

/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -vga std \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado1,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -device nec-usb-xhci,id=usbtest \
    -device virtio-scsi-pci,id=virtio_scsi_pci0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/rhel75-64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/hd1,iops=100,group=foo \
    -device virtio-blk-pci,id=image2,drive=drive_image2 \
    -drive id=drive_image3,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/hd2,iops=10,group=foo \
    -device virtio-blk-pci,id=image3,drive=drive_image3 \
    -device usb-tablet,id=tablet1 \
    -device virtio-net-pci,mac=9a:2e:2f:30:31:23,id=idIQCTOk,netdev=iduYytLv \
    -netdev tap,id=iduYytLv,vhost=on \
    -m 2048  \
    -smp 2,maxcpus=2,cores=2,threads=1,sockets=1 \
    -vnc :0 \
    -monitor stdio

2. Run fio for the 2 data disks in a while loop respectively
3. Disable io throttling for the 1st disk in qmp:

# { "execute": "block_set_io_throttle", "arguments": { "device": "drive_image2","bps": 0,"bps_rd": 0,"bps_wr": 0,"iops": 0,"iops_rd": 0,"iops_wr": 0} }


Actual results:
IO for the 2nd disk hang as follows, it turns well until disabling io throttling for the disk as that for the 1st one in above step.

fio-2.1.10
Starting 1 process
^Cbs: 1 (f=1): [R] [1.1% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 02h:19m:40s]
fio: terminating on signal 2


Expected results:
IO for the 2nd disk won't hang after disabling throttling for the 1st disk.

Additional info:

Comment 4 Stefan Hajnoczi 2018-06-11 09:37:25 UTC

Reproduces with qemu-kvm-rhev-2.10.0-21.el7_5.3.

Works for me with qemu-kvm-rhev-2.12.0-3.el7.

Therefore this is RHEL 7.5-only.  Customers will get the fix when they upgrade to 7.6.  The z-stream flag has been dropped so there is nothing we need to do here.

Please verify that RHEL 7.6 is fixed.  Thanks!

Comment 7 cliao 2018-06-12 09:10:49 UTC

Reproduces with qemu-kvm-rhev-2.12.0-3.el7.

version:
qemu-kvm-rhev: qemu-kvm-rhev-2.12.0-3.el7
host kernel: kernel-3.10.0-897.el7.x86_64
guest kernel: kernel-3.10.0-901.el7.x86_64

Comment 8 cliao 2018-06-19 05:21:44 UTC

(In reply to cliao from comment #7)
> Reproduces with qemu-kvm-rhev-2.12.0-3.el7.
> 
> version:
> qemu-kvm-rhev: qemu-kvm-rhev-2.12.0-3.el7
> host kernel: kernel-3.10.0-897.el7.x86_64
> guest kernel: kernel-3.10.0-901.el7.x86_64

I reproduce this problem with qemu-kvm-rhev-2.12.0-3.el7.

After disable io throttling for the 1st disk, the 2nd disk hang:

fio --filename=/dev/vdb --direct=1 --rw=randrw --bs=256k --size=1000M --name=test --iodepth=1 --runtime=180
test: (g=0): rw=randrw, bs=256K-256K/256K-256K/256K-256K, ioengine=sync, iodepth=1
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [m] [1.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 05h:20m:28s]

Comment 10 Stefan Hajnoczi 2018-07-05 09:49:41 UTC

Thanks for catching the bug with qemu-kvm-rhev-2.12.0-3.el7.  I have confirmed that it also reproduces with qemu.git/master.  It's non-deterministic so I missed it the first time.

A patch has been sent upstream:
https://patchwork.ozlabs.org/patch/939417/

Comment 13 Miroslav Rezanina 2018-07-24 14:13:21 UTC

Fix included in qemu-kvm-rhev-2.12.0-8.el7

Comment 15 cliao 2018-07-25 05:22:37 UTC

It is work with qemu-kvm-rhev-2.12.0-8.el7.

1. boot guest:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -vga std \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado1,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -device nec-usb-xhci,id=usbtest \
    -device virtio-scsi-pci,id=virtio_scsi_pci0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/nfsdir/rhel76-64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/disk.qcow2,iops=100,group=foo \
    -device virtio-blk-pci,id=image2,drive=drive_image2 \
    -drive id=drive_image3,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/disk1.qcow2,iops=10,group=foo \
    -device virtio-blk-pci,id=image3,drive=drive_image3 \
    -device usb-tablet,id=tablet1 \
    -device virtio-net-pci,mac=9a:2e:2f:30:31:23,id=idIQCTOk,netdev=iduYytLv \
    -netdev tap,id=iduYytLv,vhost=on \
    -m 2048  \
    -smp 2,maxcpus=2,cores=2,threads=1,sockets=1 \
    -vnc :0 \
    -monitor stdio -qmp tcp:0:4444,server,nowait

2. run fio  and then disable io throttling for the 1st disk in qmp
{ "execute": "block_set_io_throttle", "arguments": { "device": "drive_image2","bps": 0,"bps_rd": 0,"bps_wr": 0,"iops": 0,"iops_rd": 0,"iops_wr": 0} }

3.fio results:
./fio --filename=/dev/vdb --direct=1 --rw=randrw --bs=256k --size=1000M --name=test --iodepth=1 --runtime=180
test: (g=0): rw=randrw, bs=256K-256K/256K-256K/256K-256K, ioengine=sync, iodepth=1
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [m] [100.0% done] [767KB/1790KB/0KB /s] [2/6/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2492: Wed Jul 25 13:16:43 2018
  read : io=211712KB, bw=1176.2KB/s, iops=4, runt=180001msec
 .......
  write: io=195584KB, bw=1086.6KB/s, iops=4, runt=180001msec
 ......

Run status group 0 (all jobs):
   READ: io=211712KB, aggrb=1176KB/s, minb=1176KB/s, maxb=1176KB/s, mint=180001msec, maxt=180001msec
  WRITE: io=195584KB, aggrb=1086KB/s, minb=1086KB/s, maxb=1086KB/s, mint=180001msec, maxt=180001msec

Disk stats (read/write):
  vdb: ios=869/763, merge=0/0, ticks=120097/85731, in_queue=205860, util=100.00%

 ./fio --filename=/dev/vda --direct=1 --rw=randrw --bs=256k --size=1000M --name=test --iodepth=1 --runtime=180
test: (g=0): rw=randrw, bs=256K-256K/256K-256K/256K-256K, ioengine=sync, iodepth=1
fio-2.1.10
Starting 1 process
Jobs: 1 (f=1): [m] [100.0% done] [29922KB/30177KB/0KB /s] [116/117/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=2488: Wed Jul 25 13:14:33 2018
  read : io=522240KB, bw=10203KB/s, iops=39, runt= 51187msec
  ......
  write: io=501760KB, bw=9802.5KB/s, iops=38, runt= 51187msec
  ......

Run status group 0 (all jobs):
   READ: io=522240KB, aggrb=10202KB/s, minb=10202KB/s, maxb=10202KB/s, mint=51187msec, maxt=51187msec
  WRITE: io=501760KB, aggrb=9802KB/s, minb=9802KB/s, maxb=9802KB/s, mint=51187msec, maxt=51187msec

Disk stats (read/write):
  vda: ios=2047/1933, merge=0/0, ticks=44575/29206, in_queue=73781, util=99.43%

Comment 17 errata-xmlrpc 2018-11-01 11:04:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443