Bug 1614623
| Summary: | qemu and guest hang when guest poweroff after live commit with data-plane | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Yihuang Yu <yihyu> |
| Component: | qemu-kvm | Assignee: | Kevin Wolf <kwolf> |
| qemu-kvm sub component: | General | QA Contact: | aihua liang <aliang> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aliang, chayang, coli, juzhang, kwolf, ngu, qzhang, rbalakri, virt-maint, xutian, xuwei, yihyu |
| Version: | --- | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-03-03 06:11:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Yihuang,
After step2, you didn't issue an { "execute": "block-job-complete", "arguments": { "device": "dataqcow"}} command to switch the active block BEFORE power off the guest. Could you redo the test with 'block-job-complete' cmd included?
Thanks,
Niny
(In reply to Gu Nini from comment #1) > Yihuang, > > After step2, you didn't issue an { "execute": "block-job-complete", > "arguments": { "device": "dataqcow"}} command to switch the active block > BEFORE power off the guest. Could you redo the test with > 'block-job-complete' cmd included? > > Thanks, > Niny Hi Niny, Sorry that I ignored "block-job-complete" command before poweroff. when I do { "execute": "block-job-complete", "arguments": { "device": "dataqcow" } }, both guest and qemu can exit normally, qmp output and gdb info like below: { "execute": "blockdev-snapshot-sync", "arguments": { "device": "dataqcow", "snapshot-file": "/home/qemukvm/testsn1", "format": "qcow2" } } {"return": {}} { "execute": "block-commit", "arguments": { "device": "dataqcow", "base": "/home/qemukvm/image.qcow2", "top": "/home/qemukvm/testsn1", "speed": 1000000000 } } {"timestamp": {"seconds": 1533886670, "microseconds": 980390}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "dataqcow"}} {"timestamp": {"seconds": 1533886670, "microseconds": 980541}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "dataqcow"}} {"return": {}} {"timestamp": {"seconds": 1533886670, "microseconds": 981704}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "dataqcow"}} {"timestamp": {"seconds": 1533886670, "microseconds": 981819}, "event": "BLOCK_JOB_READY", "data": {"device": "dataqcow", "len": 0, "offset": 0, "speed": 1000000000, "type": "commit"}} { "execute": "block-job-complete", "arguments": { "device": "dataqcow" } } {"return": {}} {"timestamp": {"seconds": 1533886674, "microseconds": 267251}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "dataqcow"}} {"timestamp": {"seconds": 1533886674, "microseconds": 267303}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "dataqcow"}} {"timestamp": {"seconds": 1533886674, "microseconds": 267352}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "dataqcow", "len": 0, "offset": 0, "speed": 1000000000, "type": "commit"}} {"timestamp": {"seconds": 1533886674, "microseconds": 267407}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "dataqcow"}} {"timestamp": {"seconds": 1533886674, "microseconds": 267455}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "dataqcow"}} {"timestamp": {"seconds": 1533886684, "microseconds": 271839}, "event": "SHUTDOWN", "data": {"guest": true}} Connection closed by foreign host. # gdb -batch -ex bt -p 31551 [New LWP 31731] [New LWP 31720] [New LWP 31719] [New LWP 31718] [New LWP 31717] [New LWP 31716] [New LWP 31715] [New LWP 31714] [New LWP 31573] [New LWP 31572] [New LWP 31571] [New LWP 31570] [New LWP 31554] [New LWP 31553] [New LWP 31552] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00003fff89807a10 in ppoll () from /lib64/libc.so.6 #0 0x00003fff89807a10 in ppoll () from /lib64/libc.so.6 #1 0x0000000133ecc500 in ppoll (__ss=0x0, __timeout=<optimized out>, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:334 #3 0x0000000133ecdcb8 in os_host_main_loop_wait (timeout=574967274) at util/main-loop.c:233 #4 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:497 #5 0x0000000133a58b78 in main_loop () at vl.c:1963 #6 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4768 But I still think it may be a bug. Because I redo it in rhel8 without { "execute": "block-job-complete", "arguments": { "device": "dataqcow" } } , guest and qemu not hang. Finally,I update gdb info when guest and QEMU hang: # gdb -batch -ex bt -p 33152 [New LWP 33174] [New LWP 33173] [New LWP 33172] [New LWP 33171] [New LWP 33155] [New LWP 33154] [New LWP 33153] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00003fffa3697a10 in ppoll () from /lib64/libc.so.6 #0 0x00003fffa3697a10 in ppoll () from /lib64/libc.so.6 #1 0x000000010760c500 in ppoll (__ss=0x0, __timeout=<optimized out>, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:334 #3 0x000000010760f330 in aio_poll (ctx=0x10032d21680, blocking=<optimized out>) at util/aio-posix.c:629 #4 0x000000010754a6e4 in bdrv_drain_recurse (bs=0x10032db3800) at block/io.c:197 #5 0x000000010754b1d8 in bdrv_do_drained_begin (bs=0x10032db3800, recursive=<optimized out>, parent=0x0) at block/io.c:290 #6 0x00000001074df99c in bdrv_child_cb_drained_begin (child=<optimized out>) at block.c:821 #7 0x000000010754b1b0 in bdrv_parent_drained_begin (ignore=0x0, bs=0x10032db0400) at block/io.c:53 #8 bdrv_do_drained_begin (bs=0x10032db0400, recursive=<optimized out>, parent=0x0) at block/io.c:288 #9 0x00000001074df99c in bdrv_child_cb_drained_begin (child=<optimized out>) at block.c:821 #10 0x000000010754b1b0 in bdrv_parent_drained_begin (ignore=0x0, bs=0x10032da6800) at block/io.c:53 #11 bdrv_do_drained_begin (bs=0x10032da6800, recursive=<optimized out>, parent=0x0) at block/io.c:288 #12 0x000000010753a768 in blk_drain (blk=0x10032f1c8c0) at block/block-backend.c:1591 #13 0x0000000107546518 in mirror_drain (job=<optimized out>) at block/mirror.c:984 #14 0x00000001074eb9b0 in block_job_drain (job=0x10032e9be00) at blockjob.c:125 #15 0x00000001074ee648 in job_drain (job=0x10032e9be00) at job.c:529 #16 0x00000001074eecc8 in job_finish_sync (job=0x10032e9be00, finish=<optimized out>, errp=<optimized out>) at job.c:1004 #17 0x00000001074ef4b0 in job_cancel_sync (job=0x10032e9be00) at job.c:917 #18 job_cancel_sync_all () at job.c:928 #19 0x0000000107198c28 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4775 Thanks Yihuang. Agree that it is a real bug about negative test. Let's waiting for dev's response. I could produce core dump on RHEL7.5z, I think they are the same cause, so the bz is not a regression; besides, I failed to reproduce either guest hang or core dump by powering guest after executing 'virsh blockcommit rhel76 --active' before executing 'virsh blockjob rhel76 sda --pivot'.
#### core dump info in hmp:
qemu-kvm: block/io.c:199: bdrv_drain_recurse: Assertion `!wait_->need_kick' failed.
./vm22.sh: line 36: 14835 Aborted (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1'
####gdb debug info:
(gdb)
(gdb) bt full
#0 0x00007f52c1623207 in raise () at /lib64/libc.so.6
#1 0x00007f52c16248f8 in abort () at /lib64/libc.so.6
#2 0x00007f52c161c026 in __assert_fail_base () at /lib64/libc.so.6
#3 0x00007f52c161c0d2 in () at /lib64/libc.so.6
#4 0x000055bab926b1e0 in bdrv_drain_recurse (bs=0x55babcfa2000, begin=true) at block/io.c:199
waited_ = false
busy_ = true
wait_ = 0x55babcfa5234
ctx_ = 0x55babb9bfb80
bs_ = 0x55babcfa2000
child = <optimized out>
tmp = <optimized out>
waited = <optimized out>
__PRETTY_FUNCTION__ = "bdrv_drain_recurse"
#5 0x000055bab926baa6 in bdrv_drained_begin (bs=0x55babe63c800) at block/io.c:47
c = <optimized out>
next = 0x0
bs = 0x55babe63c800
#6 0x000055bab926baa6 in bdrv_drained_begin (bs=0x55babe63c800) at block/io.c:276
bs = 0x55babe63c800
#7 0x000055bab926bfcd in bdrv_co_drain_bh_cb (opaque=0x7f52b857ee60) at block/io.c:233
data = 0x7f52b857ee60
co = 0x55babbd8a8c0
bs = 0x55babe63c800
#8 0x000055bab92e20f1 in aio_bh_poll (bh=0x55babe696cf0) at util/async.c:90
bh = <optimized out>
bhp = <optimized out>
next = 0x55babdd46d80
ret = 1
deleted = false
#9 0x000055bab92e20f1 in aio_bh_poll (ctx=ctx@entry=0x55babb9bfb80) at util/async.c:118
bh = <optimized out>
bhp = <optimized out>
next = 0x55babdd46d80
ret = 1
deleted = false
#10 0x000055bab92e53e4 in aio_poll (ctx=ctx@entry=0x55babb9bfb80, blocking=blocking@entry=false) at util/aio-posix.c:689
node = <optimized out>
i = <optimized out>
ret = 0
progress = <optimized out>
timeout = <optimized out>
start = 1366454032467482
__PRETTY_FUNCTION__ = "aio_poll"
#11 0x000055bab926b061 in bdrv_drain_recurse (bs=0x55babcfa2000, begin=true) at block/io.c:199
waited_ = true
busy_ = true
wait_ = 0x55babcfa5234
ctx_ = 0x55babb9bfb80
bs_ = 0x55babcfa2000
child = <optimized out>
---Type <return> to continue, or q <return> to quit---
tmp = <optimized out>
waited = <optimized out>
__PRETTY_FUNCTION__ = "bdrv_drain_recurse"
#12 0x000055bab926bbe9 in bdrv_drain (bs=0x55babcfa2000) at block/io.c:318
#13 0x000055bab925f4c6 in blk_drain (blk=<optimized out>) at block/block-backend.c:1479
#14 0x000055bab9227022 in block_job_drain (job=0x55babe60d680) at blockjob.c:206
#15 0x000055bab9227171 in block_job_finish_sync (job=job@entry=0x55babe60d680, finish=finish@entry=0x55bab92277e0 <block_job_cancel_err>, errp=errp@entry=0x0) at blockjob.c:406
local_err = 0x0
ret = <optimized out>
__PRETTY_FUNCTION__ = "block_job_finish_sync"
#16 0x000055bab922726d in block_job_cancel_sync_all (job=0x55babe60d680) at blockjob.c:554
job = 0x55babe60d680
aio_context = 0x55babb9bfb80
#17 0x000055bab922726d in block_job_cancel_sync_all () at blockjob.c:565
job = 0x55babe60d680
aio_context = 0x55babb9bfb80
#18 0x000055bab92208f9 in bdrv_close_all () at block.c:3225
__PRETTY_FUNCTION__ = "bdrv_close_all"
#19 0x000055bab8fc3bb4 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4833
i = <optimized out>
snapshot = <optimized out>
linux_boot = <optimized out>
initrd_filename = <optimized out>
kernel_filename = <optimized out>
kernel_cmdline = <optimized out>
boot_order = <optimized out>
boot_once = 0x55babb8faab8 "c"
cyls = <optimized out>
heads = <optimized out>
secs = <optimized out>
translation = <optimized out>
opts = <optimized out>
machine_opts = <optimized out>
hda_opts = <optimized out>
icount_opts = <optimized out>
accel_opts = <optimized out>
olist = <optimized out>
optind = 69
optarg = 0x7ffe011624cc "tcp:0:5555,server,nowait"
loadvm = <optimized out>
machine_class = 0x0
cpu_model = <optimized out>
vga_model = 0x7ffe01161e48 "std"
qtest_chrdev = <optimized out>
qtest_log = <optimized out>
pid_file = <optimized out>
incoming = <optimized out>
defconfig = <optimized out>
userconfig = <optimized out>
nographic = <optimized out>
display_type = <optimized out>
---Type <return> to continue, or q <return> to quit---
display_remote = <optimized out>
log_mask = <optimized out>
log_file = <optimized out>
trace_file = <optimized out>
maxram_size = <optimized out>
ram_slots = <optimized out>
vmstate_dump_file = <optimized out>
main_loop_err = 0x0
err = 0x0
list_data_dirs = <optimized out>
bdo_queue = {sqh_first = 0x0, sqh_last = 0x7ffe011615a0}
__func__ = "main"
__FUNCTION__ = "main"
Gu Nini suggested that this could be related to bug 1601212, and after doing some debugging, I agree that it's very likely at least a related root cause. I'm not closing this as a duplicate just yet because it might be a different instance, but I expect that we can close it as such later. In short, while debugging the other report, I found several bugs in two categories: 1. Block jobs making progress that would allows their .drained_poll callback to go from true to false, but the AioContext wasn't kicked. This is solved by calling bdrv_wakeup() for all BlockDriverStates involved in a job whenever state relevant for .drained_poll changes (essentially when a job coroutine yields or terminates). 2. Some instances of bugs where we get recursive locking of the I/O thread AioContext and drop only one of the locks in AIO_WAIT_WHILE(). This is a deadlock and the I/O thread can't make progress. Typical symptoms for this are I/O request functions spawning a coroutine and waiting for its result with BDRV_POLL_WHILE(), but bs->in_flight = 0 because the coroutine didn't even start running. I'm hoping to send some patches upstream today, which should also make a difference for this report. The "Assertion `!wait_->need_kick' failed" bug is fixed in 2.12 with 7376eda7c2e0451e "block: make BDRV_POLL_WHILE() re-entrancy safe". The hang problem, I think, is the same as bug 1601212. I agree we should keep it open until we've definitive understandings of all these issues. Can also reproduce it on qemu-kvm-rhev-2.12.0-18.el7_6.2.x86_64. kernel version: 3.10.0-957.1.2.el7.x86_64 (In reply to Kevin Wolf from comment #5) > Gu Nini suggested that this could be related to bug 1601212, and after doing > some debugging, I agree that it's very likely at least a related root cause. > I'm not closing this as a duplicate just yet because it might be a different > instance, but I expect that we can close it as such later. > > In short, while debugging the other report, I found several bugs in two > categories: > > 1. Block jobs making progress that would allows their .drained_poll callback > to go from true to false, but the AioContext wasn't kicked. This is solved > by calling bdrv_wakeup() for all BlockDriverStates involved in a job > whenever state relevant for .drained_poll changes (essentially when a job > coroutine yields or terminates). > > 2. Some instances of bugs where we get recursive locking of the I/O thread > AioContext and drop only one of the locks in AIO_WAIT_WHILE(). This is a > deadlock and the I/O thread can't make progress. Typical symptoms for this > are I/O request functions spawning a coroutine and waiting for its result > with BDRV_POLL_WHILE(), but bs->in_flight = 0 because the coroutine didn't > even start running. > > I'm hoping to send some patches upstream today, which should also make a > difference for this report. Kevin, reassigning this to you for a follow-up. See also comment #6 and comment #7. (In reply to aihua liang from comment #7) > Can also reproduce it on qemu-kvm-rhev-2.12.0-18.el7_6.2.x86_64. > kernel version: 3.10.0-957.1.2.el7.x86_64 Does the gdb backtrace of the hanging process still look the same? Run 20 times, only hit this issue on virtio_blk, not on virtio_scsi.
Gdb info when guest and qemu hang on virtio_blk:
# gdb -batch -ex bt -p 30607
warning: the debug information found in "/usr/lib/debug//usr/libexec/qemu-kvm.debug" does not match "/usr/libexec/qemu-kvm" (CRC mismatch).
warning: the debug information found in "/usr/lib/debug/usr/libexec/qemu-kvm.debug" does not match "/usr/libexec/qemu-kvm" (CRC mismatch).
[New LWP 30718]
[New LWP 30669]
[New LWP 30668]
[New LWP 30667]
[New LWP 30665]
[New LWP 30664]
[New LWP 30659]
[New LWP 30658]
[New LWP 30656]
[New LWP 30655]
[New LWP 30653]
[New LWP 30651]
[New LWP 30609]
[New LWP 30608]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00005599207370b5 in bdrv_do_drained_end ()
#0 0x00005599207370b5 in bdrv_do_drained_end ()
#1 0x00005599206ec5d9 in block_job_drain ()
#2 0x00005599206ec558 in block_job_detach_aio_context ()
#3 0x00005599206e8911 in bdrv_detach_aio_context ()
#4 0x00005599206e8ab6 in bdrv_set_aio_context ()
#5 0x00005599204dac58 in virtio_blk_data_plane_stop ()
#6 0x0000559920680145 in virtio_bus_stop_ioeventfd ()
#7 0x0000559920505317 in virtio_vmstate_change ()
#8 0x0000559920595642 in vm_state_notify ()
#9 0x00005599204a5cda in do_vm_stop ()
#10 0x00005599204619a1 in main ()
**********Details**************
Test Env:
kernel version: 3.10.0-957.1.2.el7.x86_64
qemu-kvm-rhev version: qemu-kvm-rhev-2.12.0-18.el7_6.2.x86_64
Test steps:
1.Start guest with qemu cmds:
/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1' \
-sandbox off \
-machine pc \
-nodefaults \
-device VGA,bus=pci.0,addr=0x2 \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20181107-005924-PkIxnG9p,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20181107-005924-PkIxnG9p,server,nowait \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pvpanic,ioport=0x505,id=idkp9HYI \
-chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20181107-005924-PkIxnG9p,server,nowait \
-device isa-serial,chardev=serial_id_serial0 \
-chardev socket,id=seabioslog_id_20181107-005924-PkIxnG9p,path=/var/tmp/seabios-20181107-005924-PkIxnG9p,server,nowait \
-device isa-debugcon,chardev=seabioslog_id_20181107-005924-PkIxnG9p,iobase=0x402 \
-device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \
-device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \
-device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \
-device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \
-device virtio-net-pci,mac=9a:44:45:46:47:48,id=iddDGLIi,vectors=4,netdev=idDdrbRp,bus=pci.0,addr=0x7 \
-netdev tap,id=idDdrbRp,vhost=on \
-m 2048 \
-smp 10,maxcpus=10,cores=5,threads=1,sockets=2 \
-cpu host \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-vnc :0 \
-rtc base=localtime,clock=host,driftfix=slew \
-boot order=cdn,once=c,menu=off,strict=off \
-enable-kvm \
-monitor stdio \
-object iothread,id=iothread0 \
-drive if=none,id=drive_image1,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel76-64-virtio-scsi.qcow2 \
-device virtio-blk-pci,id=image1,drive=drive_image1,bootindex=0,bus=pci.0,iothread=iothread0 \
-drive if=none,id=drive_data1,aio=threads,cache=none,format=qcow2,file=/home/data1.qcow2 \
-device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0,iothread=iothread0 \
2. Create live snapshot on data disk.
{"execute":"qmp_capabilities"}
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_data1", "snapshot-file": "/home/testsn1", "format": "qcow2" } }
3. Do block commit on data disk.
{ "execute": "block-commit", "arguments": { "device": "drive_data1", "base": "/home/data1.qcow2", "top": "/home/testsn1", "speed": 1000000000 } }
4. After block commit job reach ready status, powerdown vm.
{"execute":"system_powerdown"}
Actual Result:
qemu and guest hang.
Pstack info:
[root@intel-e52650-16-4 ~]# pstack 30607
Thread 15 (Thread 0x7fa6d83f8700 (LWP 30608)):
#0 0x00007fa6dfbc31c9 in syscall () at /lib64/libc.so.6
#1 0x00005599207c1970 in qemu_event_wait ()
#2 0x00005599207d1e9e in call_rcu_thread ()
#3 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#4 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 14 (Thread 0x7fa6d7bf7700 (LWP 30609)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea1de6 in _L_lock_941 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea1cdf in pthread_mutex_lock () at /lib64/libpthread.so.0
#3 0x00005599207c1299 in qemu_mutex_lock_impl ()
#4 0x00005599207bc35c in co_schedule_bh_cb ()
#5 0x00005599207bc4d1 in aio_bh_poll ()
#6 0x00005599207bf75b in aio_poll ()
#7 0x000055992058f70e in iothread_run ()
#8 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#9 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 13 (Thread 0x7fa6d6af4700 (LWP 30651)):
#0 0x00007fa6dfbbe2cf in ppoll () at /lib64/libc.so.6
#1 0x00005599207bd96b in qemu_poll_ns ()
#2 0x00005599207bf6e7 in aio_poll ()
#3 0x000055992058f70e in iothread_run ()
#4 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#5 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 12 (Thread 0x7fa6d62f3700 (LWP 30653)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 11 (Thread 0x7fa6d5af2700 (LWP 30655)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 10 (Thread 0x7fa6d52f1700 (LWP 30656)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 9 (Thread 0x7fa6d4af0700 (LWP 30658)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 8 (Thread 0x7fa6d42ef700 (LWP 30659)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 7 (Thread 0x7fa6d3aee700 (LWP 30664)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 6 (Thread 0x7fa6d32ed700 (LWP 30665)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 5 (Thread 0x7fa6d2aec700 (LWP 30667)):
#0 0x00007fa6dfea64ed in __lll_lock_wait () at /lib64/libpthread.so.0
#1 0x00007fa6dfea8cdc in _L_cond_lock_847 () at /lib64/libpthread.so.0
#2 0x00007fa6dfea8b71 in __pthread_mutex_cond_lock () at /lib64/libpthread.so.0
#3 0x00007fa6dfea39f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#4 0x00005599207c1549 in qemu_cond_wait_impl ()
#5 0x00005599204a3b9f in qemu_wait_io_event ()
#6 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#7 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#8 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 4 (Thread 0x7fa6d22eb700 (LWP 30668)):
#0 0x00007fa6dfea3965 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1 0x00005599207c1549 in qemu_cond_wait_impl ()
#2 0x00005599204a3b9f in qemu_wait_io_event ()
#3 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#4 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#5 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 3 (Thread 0x7fa6d1aea700 (LWP 30669)):
#0 0x00007fa6dfea3965 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1 0x00005599207c1549 in qemu_cond_wait_impl ()
#2 0x00005599204a3b9f in qemu_wait_io_event ()
#3 0x00005599204a52a0 in qemu_kvm_cpu_thread_fn ()
#4 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#5 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 2 (Thread 0x7fa64f7ff700 (LWP 30718)):
#0 0x00007fa6dfea3965 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1 0x00005599207c1549 in qemu_cond_wait_impl ()
#2 0x00005599206dc28f in vnc_worker_thread_loop ()
#3 0x00005599206dc858 in vnc_worker_thread ()
#4 0x00007fa6dfe9fdd5 in start_thread () at /lib64/libpthread.so.0
#5 0x00007fa6dfbc8ead in clone () at /lib64/libc.so.6
Thread 1 (Thread 0x7fa6f92afdc0 (LWP 30607)):
#0 0x00005599206ec526 in child_job_drained_poll ()
#1 0x0000559920737568 in bdrv_drain_poll ()
#2 0x0000559920737568 in bdrv_drain_poll ()
#3 0x0000559920737568 in bdrv_drain_poll ()
#4 0x0000559920737790 in bdrv_do_drained_begin ()
#5 0x000055992072a3e8 in blk_drain ()
#6 0x0000559920732b0d in mirror_drain ()
#7 0x00005599206ec558 in block_job_detach_aio_context ()
#8 0x00005599206e8911 in bdrv_detach_aio_context ()
#9 0x00005599206e8ab6 in bdrv_set_aio_context ()
#10 0x00005599204dac58 in virtio_blk_data_plane_stop ()
#11 0x0000559920680145 in virtio_bus_stop_ioeventfd ()
#12 0x0000559920505317 in virtio_vmstate_change ()
#13 0x0000559920595642 in vm_state_notify ()
#14 0x00005599204a5cda in do_vm_stop ()
#15 0x00005599204619a1 in main ()
Test with qemu-kvm-rhev-2.12.0-29.el7.x86_64 on RHEL7.7 with -drive/device, also hit this issue as described in comment 11. QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks Test on qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae, don't hit this issue any more, close it as currentrelease.
Test Steps:
1. Start guest with qemu cmds:(iothread enable)
/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1' \
-sandbox on \
-machine q35 \
-nodefaults \
-device VGA,bus=pcie.0,addr=0x1 \
-m 14336 \
-smp 16,maxcpus=16,cores=8,threads=1,dies=1,sockets=2 \
-cpu 'EPYC',+kvm_pv_unhalt \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20200203-033416-61dmcn92,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20200203-033416-61dmcn92,server,nowait \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pvpanic,ioport=0x505,id=idy8YPXp \
-chardev socket,path=/var/tmp/serial-serial0-20200203-033416-61dmcn92,server,nowait,id=chardev_serial0 \
-device isa-serial,id=serial0,chardev=chardev_serial0 \
-chardev socket,id=seabioslog_id_20200203-033416-61dmcn92,path=/var/tmp/seabios-20200203-033416-61dmcn92,server,nowait \
-device isa-debugcon,chardev=seabioslog_id_20200203-033416-61dmcn92,iobase=0x402 \
-device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
-device qemu-xhci,id=usb1,bus=pcie.0-root-port-2,addr=0x0 \
-object iothread,id=iothread0 \
-device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \
-blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/kvm_autotest_root/images/rhel820-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \
-device virtio-blk-pci,id=image1,drive=drive_image1,write-cache=on,bus=pcie.0-root-port-3,iothread=iothread0 \
-device pcie-root-port,id=pcie.0-root-port-6,slot=6,chassis=6,addr=0x6,bus=pcie.0 \
-blockdev node-name=file_data1,driver=file,aio=threads,filename=/home/data.qcow2,cache.direct=on,cache.no-flush=off \
-blockdev node-name=drive_data1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_data1 \
-device virtio-blk-pci,id=data1,drive=drive_data1,write-cache=on,bus=pcie.0-root-port-6 \
-device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \
-device virtio-net-pci,mac=9a:6c:ca:b7:36:85,id=idz4QyVp,netdev=idNnpx5D,bus=pcie.0-root-port-4,addr=0x0 \
-netdev tap,id=idNnpx5D,vhost=on \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-vnc :0 \
-rtc base=utc,clock=host,driftfix=slew \
-boot menu=off,order=cdn,once=c,strict=off \
-enable-kvm \
-device pcie-root-port,id=pcie_extra_root_port_0,slot=5,chassis=5,addr=0x5,bus=pcie.0 \
-monitor stdio \
-qmp tcp:0:3000,server,nowait \
2. Create snapshot target
{'execute':'blockdev-create','arguments':{'options': {'driver':'file','filename':'/root/sn1','size':21474836480},'job-id':'job1'}}
{'execute':'blockdev-add','arguments':{'driver':'file','node-name':'drive_sn1','filename':'/root/sn1'}}
{'execute':'blockdev-create','arguments':{'options': {'driver': 'qcow2','file':'drive_sn1','size':21474836480,'backing-file':'/home/kvm_autotest_root/images/rhel820-64-virtio-scsi.qcow2','backing-fmt':'qcow2'},'job-id':'job2'}}
{'execute':'blockdev-add','arguments':{'driver':'qcow2','node-name':'sn1','file':'drive_sn1','backing':null}}
3. Do live snapshot
{"execute":"blockdev-snapshot","arguments":{"node":"drive_image1","overlay":"sn1"}}
4. Check block info
(qemu) info block
sn1: /root/sn1 (qcow2)
Attached to: /machine/peripheral/image1/virtio-backend
Cache mode: writeback
Backing file: /home/kvm_autotest_root/images/rhel820-64-virtio-scsi.qcow2 (chain depth: 1)
drive_data1: /home/data.qcow2 (qcow2)
Attached to: /machine/peripheral/data1/virtio-backend
Cache mode: writeback, direct
5. In guest, dd a file
(guest)# dd if=/dev/urandom of=te bs=1M count=1000
6. Do live commit
{'execute': 'block-commit', 'arguments': { 'device': 'sn1','job-id':'j1'}}
{"timestamp": {"seconds": 1583215467, "microseconds": 248030}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1583215467, "microseconds": 248129}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1583215468, "microseconds": 494774}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "j1"}}
{"timestamp": {"seconds": 1583215468, "microseconds": 494862}, "event": "BLOCK_JOB_READY", "data": {"device": "j1", "len": 1050542080, "offset": 1050542080, "speed": 0, "type": "commit"}}
7. Powerdown vm in guest
(guest)#powerdown
After step7, guest can powerdown successfully.
|
Description of problem: qemu and guest hang when guest poweroff after live commit with data-plane Version-Release number of selected component (if applicable): host kernel version: 3.10.0-931.el7.ppc64le qemu version: qemu-kvm-rhev-2.12.0-9.el7.ppc64le guest kernel version: 3.10.0-931.el7.ppc64le How reproducible: 75% Steps to Reproduce: 1.Lauch a guest using data-plane: /usr/libexec/qemu-kvm \ -name 'ppc-block-testvm1' \ -sandbox off \ -machine pseries \ -nodefaults \ -vga std \ -chardev socket,id=qmp_catch_monitorppc1,path=/var/tmp/monitor-catch_monitorppc1,server,nowait \ -mon chardev=qmp_catch_monitorppc1,mode=control \ -chardev socket,id=serial_id_serialppc1,path=/var/tmp/serial-serialppc1,server,nowait \ -device spapr-vty,reg=0x30000000,chardev=serial_id_serialppc1 \ -device nec-usb-xhci,id=usbppc1,bus=pci.0,addr=0x3 \ -object iothread,id=iothread0 \ -drive file=/home/qemukvm/rhel76.qcow2,id=drive_imageppc1,if=none,cache=none,format=qcow2,werror=stop,rerror=stop \ -m 8192 \ -smp 2,maxcpus=4,cores=2,threads=1,sockets=2 \ -device usb-tablet,id=usb-tabletppc1,bus=usbppc1.0,port=1 \ -device usb-kbd,id=usbppc1sb-kbdppc1,bus=usbppc1.0,port=2 \ -vnc :1 \ -boot menu=off,strict=off,order=cdn,once=c \ -rtc base=utc,clock=host \ -enable-kvm \ -qmp tcp:0:4441,server,nowait \ -netdev tap,id=tap1,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \ -device spapr-vlan,netdev=tap1,id=nic1,mac=1a:59:0a:4b:5a:94 \ -serial tcp:0:3331,server,nowait \ -monitor stdio \ -device virtio-scsi-pci,id=virtio_scsi_pcippc1,bus=pci.0,addr=0x4,iothread=iothread0 \ -device scsi-hd,id=imageppc1,drive=drive_imageppc1 \ -object iothread,id=iothread1 \ -device virtio-scsi-pci,id=datapscsi1,iothread=iothread1 \ -drive file=/home/qemukvm/image.qcow2,if=none,id=dataqcow,format=qcow2,cache=none,rerror=stop,werror=stop \ -device scsi-hd,drive=dataqcow,bus=datapscsi1.0,scsi-id=1,lun=1,id=scsi-disk1 2.do "block-snapshot-sync" and "block-commit" { "execute": "blockdev-snapshot-sync", "arguments": { "device": "dataqcow", "snapshot-file": "/home/qemukvm/testsn1", "format": "qcow2" } } {"return": {}} { "execute": "block-commit", "arguments": { "device": "dataqcow", "base": "/home/qemukvm/image.qcow2", "top": "/home/qemukvm/testsn1", "speed": 1000000000 } } {"timestamp": {"seconds": 1533868674, "microseconds": 941103}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "dataqcow"}} {"timestamp": {"seconds": 1533868674, "microseconds": 941250}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "dataqcow"}} {"return": {}} {"timestamp": {"seconds": 1533868674, "microseconds": 942412}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "dataqcow"}} {"timestamp": {"seconds": 1533868674, "microseconds": 942563}, "event": "BLOCK_JOB_READY", "data": {"device": "dataqcow", "len": 0, "offset": 0, "speed": 1000000000, "type": "commit"}} 3.guest poweroff 1) [root@dhcp70-153 ~]# poweroff or 2) {"execute":"system_powerdown"} {"return": {}} {"timestamp": {"seconds": 1533869348, "microseconds": 700005}, "event": "POWERDOWN"} {"timestamp": {"seconds": 1533869354, "microseconds": 215695}, "event": "SHUTDOWN", "data": {"guest": true}} Actual results: Both qemu and guest hang. Expected results: Guest powerdown successed, QEMU can quit normal. Additional info: x86 also has this problem. when qemu and guest hang: gdb -batch -ex bt -p 14976 [New LWP 14998] [New LWP 14997] [New LWP 14996] [New LWP 14995] [New LWP 14979] [New LWP 14978] [New LWP 14977] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". 0x00003fff86c37a10 in ppoll () from /lib64/libc.so.6 #0 0x00003fff86c37a10 in ppoll () from /lib64/libc.so.6 #1 0x00000001127fc500 in qemu_poll_ns () #2 0x00000001127ff330 in aio_poll () #3 0x000000011273a6e4 in bdrv_drain_recurse () #4 0x000000011273a77c in bdrv_drain_recurse () #5 0x000000011273a77c in bdrv_drain_recurse () #6 0x000000011273b1d8 in bdrv_do_drained_begin () #7 0x00000001126cf99c in bdrv_child_cb_drained_begin () #8 0x000000011273b1b0 in bdrv_do_drained_begin () #9 0x000000011272a768 in blk_drain () #10 0x0000000112736518 in mirror_drain () #11 0x00000001126db9b0 in block_job_drain () #12 0x00000001126de648 in job_drain () #13 0x00000001126decc8 in job_finish_sync () #14 0x00000001126df4b0 in job_cancel_sync_all () #15 0x0000000112388c28 in main ()