Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1492295 - Guest hit call trace with iothrottling(iops) after the status from stop to cont during doing io testing
Guest hit call trace with iothrottling(iops) after the status from stop to co...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev (Show other bugs)
7.4-Alt
All Linux
medium Severity medium
: rc
: ---
Assigned To: Stefan Hajnoczi
Gu Nini
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-09-16 04:22 EDT by Yongxue Hong
Modified: 2018-04-10 20:37 EDT (History)
10 users (show)

See Also:
Fixed In Version: qemu-kvm-rhev-2.10.0-8.el7
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-04-10 20:36:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:1104 None None None 2018-04-10 20:37 EDT

  None (edit)
Description Yongxue Hong 2017-09-16 04:22:01 EDT
Description of problem:
Guest hit call trace with iothrottling(iops) after the status from stop to cont during doing io testing on (DD2)POWER9.

Version-Release number of selected component (if applicable):
Host: 4.11.0-32.el7a.ppc64le
Guest: 4.11.0-32.el7a.ppc64le
Qemu-kvm:qemu-kvm-2.9.0-22.el7a.ppc64le

How reproducible:
100%

Steps to Reproduce:
1.boot a guest with iothrottling
eg:
[root@virt4 command]# cat rhel-alt74-8381-bug.sh
/usr/libexec/qemu-kvm \
-name 'yhong-rhel-alt-74-8381' \
-machine pseries-rhel7.4.0alt \
-m 4G \
-nodefaults \
-smp 4,cores=4,threads=1,sockets=1 \
-boot order=cdn,once=d,menu=on,strict=on  \
-device nec-usb-xhci,id=xhci \
-device usb-tablet,id=usb-tablet0 \
-device usb-kbd,id=usb-kbd0 \
-enable-kvm \
-drive file=/home/yhong/image/RHEL-ALT-7.4-20170830.0-throttling-virtio-blk-20G.raw,format=raw,if=none,cache=none,media=disk,werror=stop,rerror=stop,id=drive-0 \
-device virtio-blk-pci,bus=pci.0,addr=0x04,drive=drive-0,id=virtio-blk-0,bootindex=0 \
-drive file=/home/yhong/image/data-disk1-11G.raw,format=raw,if=none,cache=none,media=disk,werror=stop,rerror=stop,id=drive-1,iops=100 \
-device virtio-blk-pci,bus=pci.0,addr=0x05,drive=drive-1,id=virtio-blk-1 \
-netdev tap,id=hostnet0,script=/etc/qemu-ifup \
-device virtio-net-pci,netdev=hostnet0,id=virtio-net-pci0,mac=40:f2:e9:5d:9c:90 \
-qmp tcp:0:4000,server,nowait \
-chardev socket,id=serial_id_serial,path=/var/tmp/serial-yhong-throttling-virtio-blk-0,server,nowait \
-device spapr-vty,reg=0x30000000,chardev=serial_id_serial \
-monitor stdio

2.do dd testing with data disk
eg:
[root@localhost home]# cat debug-auto-dd.sh
#!/bin/bash
for BS in 512b 1K 4K 8K 16K
do
		echo "dd if=/dev/zero of=/dev/vdb bs=$BS count=1000 oflag=direct status=progress" 
		dd if=/dev/zero of=/dev/vdb bs=$BS count=1000 oflag=direct status=progress
done

3.stop the guest.
eg:
{"execute": "stop"}

4.cont the guest after a while (60s ~ 120S)
eg:
{"execute": "cont"}

Actual results:
The testing is hung and hit call trace.
[root@localhost ~]# [  243.693446] INFO: task dd:1454 blocked for more than 120 seconds.
[  243.693552]       Not tainted 4.11.0-32.el7a.ppc64le #1
[  243.693693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  243.693836] dd              D    0  1454   1453 0x00040080
[  243.693888] Call Trace:
[  243.693917] [c0000000f3b2f6c0] [c0000000f3ec84a8] 0xc0000000f3ec84a8 (unreliable)
[  243.693990] [c0000000f3b2f890] [c00000000001cfb0] __switch_to+0x310/0x480
[  243.694052] [c0000000f3b2f8f0] [c000000000beb354] __schedule+0x3b4/0xb90
[  243.694111] [c0000000f3b2f9c0] [c000000000170494] io_schedule+0x84/0x120
[  243.694171] [c0000000f3b2f9f0] [c00000000048da8c] __blkdev_direct_IO_simple+0x19c/0x400
[  243.694243] [c0000000f3b2fb20] [c00000000048e150] blkdev_direct_IO+0x460/0x4a0
[  243.694319] [c0000000f3b2fbd0] [c0000000002feae8] generic_file_direct_write+0x298/0x3a0
[  243.694393] [c0000000f3b2fc60] [c0000000002fed4c] __generic_file_write_iter+0x15c/0x240
[  243.694463] [c0000000f3b2fcc0] [c00000000048f474] blkdev_write_iter+0x94/0x140
[  243.694549] [c0000000f3b2fd30] [c000000000414a28] __vfs_write+0x178/0x1f0
[  243.694611] [c0000000f3b2fdd0] [c00000000041929c] SyS_write+0x16c/0x340
[  243.694675] [c0000000f3b2fe30] [c00000000000b004] system_call+0x38/0xe0

Expected results:
the testing could be go on doing and no call trace.

Additional info:
It is also reproduced on X86 and POWER8

X86:
Host :3.10.0-693.2.1.el7.x86_64
Guest :3.10.0-693.2.1.el7.x86_64
Qemu-kvm:qemu-kvm-rhev-2.9.0-16.el7_4.6.x86_64

Sep 16 03:41:26 localhost kernel: INFO: task dd:1406 blocked for more than 120 seconds.
Sep 16 03:41:26 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 16 03:41:26 localhost kernel: dd              D ffff880136388000     0  1406   1404 0x00000080
Sep 16 03:41:26 localhost kernel: ffff88003663b920 0000000000000082 ffff8800364a4f10 ffff88003663bfd8
Sep 16 03:41:26 localhost kernel: ffff88003663bfd8 ffff88003663bfd8 ffff8800364a4f10 ffff88013fd16cc0
Sep 16 03:41:26 localhost kernel: 0000000000000000 7fffffffffffffff 0000000000000000 ffff880136388000
Sep 16 03:41:26 localhost kernel: Call Trace:
Sep 16 03:41:26 localhost kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Sep 16 03:41:26 localhost kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
Sep 16 03:41:26 localhost kernel: [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
Sep 16 03:41:26 localhost kernel: [<ffffffff816a8b4d>] io_schedule_timeout+0xad/0x130
Sep 16 03:41:26 localhost kernel: [<ffffffff816a8be8>] io_schedule+0x18/0x20
Sep 16 03:41:26 localhost kernel: [<ffffffff812421d5>] do_blockdev_direct_IO+0x1c45/0x2020
Sep 16 03:41:26 localhost kernel: [<ffffffff8123cb30>] ? I_BDEV+0x10/0x10
Sep 16 03:41:26 localhost kernel: [<ffffffff81242605>] __blockdev_direct_IO+0x55/0x60
Sep 16 03:41:26 localhost kernel: [<ffffffff8123cb30>] ? I_BDEV+0x10/0x10
Sep 16 03:41:26 localhost kernel: [<ffffffff8123d420>] blkdev_direct_IO+0x60/0xa0
Sep 16 03:41:26 localhost kernel: [<ffffffff8123cb30>] ? I_BDEV+0x10/0x10
Sep 16 03:41:26 localhost kernel: [<ffffffff8118464d>] generic_file_direct_write+0xcd/0x190
Sep 16 03:41:26 localhost kernel: [<ffffffff81184947>] __generic_file_aio_write+0x237/0x400
Sep 16 03:41:26 localhost kernel: [<ffffffff8123de16>] blkdev_aio_write+0x56/0xb0
Sep 16 03:41:26 localhost kernel: [<ffffffff8120026d>] do_sync_write+0x8d/0xd0
Sep 16 03:41:26 localhost kernel: [<ffffffff81200d2d>] vfs_write+0xbd/0x1e0
Sep 16 03:41:26 localhost kernel: [<ffffffff81201b3f>] SyS_write+0x7f/0xe0
Sep 16 03:41:26 localhost kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b

POWER8:
Host :3.10.0-693.el7.ppc64le
Guest :3.10.0-693.el7.ppc64le
Qemu-kvm:qemu-kvm-rhev-2.9.0-16.el7_4.3.ppc64le

[root@dhcp71-168 ~]# [10800.790048] INFO: task fio:23344 blocked for more than 120 seconds.
[10800.790149] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10800.790197] fio             D 00003fffb78e47bc     0 23344  23342 0x00040080
[10800.790249] Call Trace:
[10800.790372] [c0000000a12d74a0] [c0000000a12d75f0] 0xc0000000a12d75f0 (unreliable)
[10800.790464] [c0000000a12d7670] [c000000000019dbc] __switch_to+0x25c/0x470
[10800.790529] [c0000000a12d76d0] [c0000000009c98e8] __schedule+0x438/0xb00
[10800.790580] [c0000000a12d77b0] [c0000000009c59a8] schedule_timeout+0x398/0x460
[10800.790638] [c0000000a12d78b0] [c0000000009c93fc] io_schedule+0xcc/0x180
[10800.790695] [c0000000a12d78f0] [c000000000399bac] __blockdev_direct_IO+0xf7c/0x2170
[10800.790755] [c0000000a12d7ae0] [c000000000392590] blkdev_direct_IO+0x60/0xc0
[10800.790806] [c0000000a12d7b60] [c00000000025ab38] generic_file_aio_read+0x548/0xd80
[10800.790862] [c0000000a12d7c60] [c0000000003930bc] blkdev_aio_read+0x5c/0xb0
[10800.790914] [c0000000a12d7ca0] [c000000000324fb4] do_sync_read+0x94/0x110
[10800.790965] [c0000000a12d7da0] [c000000000325f50] vfs_read+0x110/0x290
[10800.791015] [c0000000a12d7de0] [c000000000327f70] SyS_pread64+0xb0/0xd0
[10800.791065] [c0000000a12d7e30] [c00000000000a184] system_call+0x38/0xb4
Comment 1 Stefan Hajnoczi 2017-09-19 11:59:29 EDT
Patch posted upstream:
https://patchwork.ozlabs.org/patch/815627/
Comment 2 Gu Nini 2017-11-03 04:42:04 EDT
The bug also occurred on following qemu-kvm-rhev versions:

qemu-kvm-rhev-2.9.0-16.el7.x86_64 (RHEL7.4 released)
qemu-kvm-rhev-2.10.0-4.el7.x86_64 (Latest RHEL7.5)

It's not a regression.
Comment 4 Miroslav Rezanina 2017-11-27 09:46:52 EST
Fix included in qemu-kvm-rhev-2.10.0-8.el7
Comment 5 Gu Nini 2017-11-30 01:11:36 EST
Verify the bug with the same steps as that for the bug report:

host kernel: 3.10.0-799.el7.x86_64
guest kernel: 3.10.0-768.el7.x86_64
qemu: qemu-kvm-rhev-2.10.0-9.el7.x86_64
Comment 7 errata-xmlrpc 2018-04-10 20:36:03 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104

Note You need to log in before you can comment on or make changes to this bug.