Bug 1492295

Summary:	Guest hit call trace with iothrottling(iops) after the status from stop to cont during doing io testing
Product:	Red Hat Enterprise Linux 7	Reporter:	Yongxue Hong <yhong>
Component:	qemu-kvm-rhev	Assignee:	Stefan Hajnoczi <stefanha>
Status:	CLOSED ERRATA	QA Contact:	Gu Nini <ngu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.4-Alt	CC:	chayang, coli, juzhang, knoel, lmiksik, michen, ngu, qzhang, rbalakri, virt-maint
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.10.0-8.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-11 00:36:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yongxue Hong 2017-09-16 08:22:01 UTC

Description of problem:
Guest hit call trace with iothrottling(iops) after the status from stop to cont during doing io testing on (DD2)POWER9.

Version-Release number of selected component (if applicable):
Host: 4.11.0-32.el7a.ppc64le
Guest: 4.11.0-32.el7a.ppc64le
Qemu-kvm:qemu-kvm-2.9.0-22.el7a.ppc64le

How reproducible:
100%

Steps to Reproduce:
1.boot a guest with iothrottling
eg:
[root@virt4 command]# cat rhel-alt74-8381-bug.sh
/usr/libexec/qemu-kvm \
-name 'yhong-rhel-alt-74-8381' \
-machine pseries-rhel7.4.0alt \
-m 4G \
-nodefaults \
-smp 4,cores=4,threads=1,sockets=1 \
-boot order=cdn,once=d,menu=on,strict=on  \
-device nec-usb-xhci,id=xhci \
-device usb-tablet,id=usb-tablet0 \
-device usb-kbd,id=usb-kbd0 \
-enable-kvm \
-drive file=/home/yhong/image/RHEL-ALT-7.4-20170830.0-throttling-virtio-blk-20G.raw,format=raw,if=none,cache=none,media=disk,werror=stop,rerror=stop,id=drive-0 \
-device virtio-blk-pci,bus=pci.0,addr=0x04,drive=drive-0,id=virtio-blk-0,bootindex=0 \
-drive file=/home/yhong/image/data-disk1-11G.raw,format=raw,if=none,cache=none,media=disk,werror=stop,rerror=stop,id=drive-1,iops=100 \
-device virtio-blk-pci,bus=pci.0,addr=0x05,drive=drive-1,id=virtio-blk-1 \
-netdev tap,id=hostnet0,script=/etc/qemu-ifup \
-device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=40:f2:e9:5d:9c:90 \
-qmp tcp:0:4000,server,nowait \
-chardev socket,id=serial_id_serial,path=/var/tmp/serial-yhong-throttling-virtio-blk-0,server,nowait \
-device spapr-vty,reg=0x30000000,chardev=serial_id_serial \
-monitor stdio

2.do dd testing with data disk
eg:
[root@localhost home]# cat debug-auto-dd.sh
#!/bin/bash
for BS in 512b 1K 4K 8K 16K
do
		echo "dd if=/dev/zero of=/dev/vdb bs=$BS count=1000 oflag=direct status=progress" 
		dd if=/dev/zero of=/dev/vdb bs=$BS count=1000 oflag=direct status=progress
done

3.stop the guest.
eg:
{"execute": "stop"}

4.cont the guest after a while (60s ~ 120S)
eg:
{"execute": "cont"}

Actual results:
The testing is hung and hit call trace.
[root@localhost ~]# [  243.693446] INFO: task dd:1454 blocked for more than 120 seconds.
[  243.693552]       Not tainted 4.11.0-32.el7a.ppc64le #1
[  243.693693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.693836] dd              D    0  1454   1453 0x00040080
[  243.693888] Call Trace:
[  243.693917] [c0000000f3b2f6c0] [c0000000f3ec84a8] 0xc0000000f3ec84a8 (unreliable)
[  243.693990] [c0000000f3b2f890] [c00000000001cfb0] __switch_to+0x310/0x480
[  243.694052] [c0000000f3b2f8f0] [c000000000beb354] __schedule+0x3b4/0xb90
[  243.694111] [c0000000f3b2f9c0] [c000000000170494] io_schedule+0x84/0x120
[  243.694171] [c0000000f3b2f9f0] [c00000000048da8c] __blkdev_direct_IO_simple+0x19c/0x400
[  243.694243] [c0000000f3b2fb20] [c00000000048e150] blkdev_direct_IO+0x460/0x4a0
[  243.694319] [c0000000f3b2fbd0] [c0000000002feae8] generic_file_direct_write+0x298/0x3a0
[  243.694393] [c0000000f3b2fc60] [c0000000002fed4c] __generic_file_write_iter+0x15c/0x240
[  243.694463] [c0000000f3b2fcc0] [c00000000048f474] blkdev_write_iter+0x94/0x140
[  243.694549] [c0000000f3b2fd30] [c000000000414a28] __vfs_write+0x178/0x1f0
[  243.694611] [c0000000f3b2fdd0] [c00000000041929c] SyS_write+0x16c/0x340
[  243.694675] [c0000000f3b2fe30] [c00000000000b004] system_call+0x38/0xe0

Expected results:
the testing could be go on doing and no call trace.

Additional info:
It is also reproduced on X86 and POWER8

X86:
Host :3.10.0-693.2.1.el7.x86_64
Guest :3.10.0-693.2.1.el7.x86_64
Qemu-kvm:qemu-kvm-rhev-2.9.0-16.el7_4.6.x86_64

Sep 16 03:41:26 localhost kernel: INFO: task dd:1406 blocked for more than 120 seconds.
Sep 16 03:41:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 16 03:41:26 localhost kernel: dd              D ffff880136388000     0  1406   1404 0x00000080
Sep 16 03:41:26 localhost kernel: ffff88003663b920 0000000000000082 ffff8800364a4f10 ffff88003663bfd8
Sep 16 03:41:26 localhost kernel: ffff88003663bfd8 ffff88003663bfd8 ffff8800364a4f10 ffff88013fd16cc0
Sep 16 03:41:26 localhost kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff880136388000
Sep 16 03:41:26 localhost kernel: Call Trace:
Sep 16 03:41:26 localhost kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Sep 16 03:41:26 localhost kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
Sep 16 03:41:26 localhost kernel: [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
Sep 16 03:41:26 localhost kernel: [<ffffffff816a8b4d>] io_schedule_timeout+0xad/0x130
Sep 16 03:41:26 localhost kernel: [<ffffffff816a8be8>] io_schedule+0x18/0x20
Sep 16 03:41:26 localhost kernel: [<ffffffff812421d5>] do_blockdev_direct_IO+0x1c45/0x2020
Sep 16 03:41:26 localhost kernel: [<ffffffff8123cb30>] ? I_BDEV+0x10/0x10
Sep 16 03:41:26 localhost kernel: [<ffffffff81242605>] __blockdev_direct_IO+0x55/0x60
Sep 16 03:41:26 localhost kernel: [<ffffffff8123cb30>] ? I_BDEV+0x10/0x10
Sep 16 03:41:26 localhost kernel: [<ffffffff8123d420>] blkdev_direct_IO+0x60/0xa0
Sep 16 03:41:26 localhost kernel: [<ffffffff8123cb30>] ? I_BDEV+0x10/0x10
Sep 16 03:41:26 localhost kernel: [<ffffffff8118464d>] generic_file_direct_write+0xcd/0x190
Sep 16 03:41:26 localhost kernel: [<ffffffff81184947>] __generic_file_aio_write+0x237/0x400
Sep 16 03:41:26 localhost kernel: [<ffffffff8123de16>] blkdev_aio_write+0x56/0xb0
Sep 16 03:41:26 localhost kernel: [<ffffffff8120026d>] do_sync_write+0x8d/0xd0
Sep 16 03:41:26 localhost kernel: [<ffffffff81200d2d>] vfs_write+0xbd/0x1e0
Sep 16 03:41:26 localhost kernel: [<ffffffff81201b3f>] SyS_write+0x7f/0xe0
Sep 16 03:41:26 localhost kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b

POWER8:
Host :3.10.0-693.el7.ppc64le
Guest :3.10.0-693.el7.ppc64le
Qemu-kvm:qemu-kvm-rhev-2.9.0-16.el7_4.3.ppc64le

[root@dhcp71-168 ~]# [10800.790048] INFO: task fio:23344 blocked for more than 120 seconds.
[10800.790149] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10800.790197] fio             D 00003fffb78e47bc     0 23344  23342 0x00040080
[10800.790249] Call Trace:
[10800.790372] [c0000000a12d74a0] [c0000000a12d75f0] 0xc0000000a12d75f0 (unreliable)
[10800.790464] [c0000000a12d7670] [c000000000019dbc] __switch_to+0x25c/0x470
[10800.790529] [c0000000a12d76d0] [c0000000009c98e8] __schedule+0x438/0xb00
[10800.790580] [c0000000a12d77b0] [c0000000009c59a8] schedule_timeout+0x398/0x460
[10800.790638] [c0000000a12d78b0] [c0000000009c93fc] io_schedule+0xcc/0x180
[10800.790695] [c0000000a12d78f0] [c000000000399bac] __blockdev_direct_IO+0xf7c/0x2170
[10800.790755] [c0000000a12d7ae0] [c000000000392590] blkdev_direct_IO+0x60/0xc0
[10800.790806] [c0000000a12d7b60] [c00000000025ab38] generic_file_aio_read+0x548/0xd80
[10800.790862] [c0000000a12d7c60] [c0000000003930bc] blkdev_aio_read+0x5c/0xb0
[10800.790914] [c0000000a12d7ca0] [c000000000324fb4] do_sync_read+0x94/0x110
[10800.790965] [c0000000a12d7da0] [c000000000325f50] vfs_read+0x110/0x290
[10800.791015] [c0000000a12d7de0] [c000000000327f70] SyS_pread64+0xb0/0xd0
[10800.791065] [c0000000a12d7e30] [c00000000000a184] system_call+0x38/0xb4

Comment 1 Stefan Hajnoczi 2017-09-19 15:59:29 UTC

Patch posted upstream:
https://patchwork.ozlabs.org/patch/815627/

Comment 2 Gu Nini 2017-11-03 08:42:04 UTC

The bug also occurred on following qemu-kvm-rhev versions:

qemu-kvm-rhev-2.9.0-16.el7.x86_64 (RHEL7.4 released)
qemu-kvm-rhev-2.10.0-4.el7.x86_64 (Latest RHEL7.5)

It's not a regression.

Comment 4 Miroslav Rezanina 2017-11-27 14:46:52 UTC

Fix included in qemu-kvm-rhev-2.10.0-8.el7

Comment 5 Gu Nini 2017-11-30 06:11:36 UTC

Verify the bug with the same steps as that for the bug report:

host kernel: 3.10.0-799.el7.x86_64
guest kernel: 3.10.0-768.el7.x86_64
qemu: qemu-kvm-rhev-2.10.0-9.el7.x86_64

Comment 7 errata-xmlrpc 2018-04-11 00:36:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104