Bug 1707733

Summary:	Qemu hang when do backup jobs with completion-mode "grouped" and with one job failed
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	aihua liang <aliang>
Component:	qemu-kvm	Assignee:	John Snow <jsnow>
qemu-kvm sub component:	Incremental Live Backup	QA Contact:	aihua liang <aliang>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	high	CC:	coli, jsnow, juzhang, ngu, qzhang, virt-maint, xutian
Version:	---
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-10 18:36:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description aihua liang 2019-05-08 08:33:15 UTC

Description of problem:
 Qemu hang when do backup jobs with completion-mode "grouped" and one job failed

Version-Release number of selected component (if applicable):
 kernel version:3.10.0-1041.el7.x86_64
 qemu-kvm-rhev version:qemu-kvm-rhev-2.12.0-27.el7.x86_64

How reproducible:
 100%

Steps to Reproduce:
1.Start guest with qemu cmds:
    /usr/libexec/qemu-kvm \
    -S \
    -name 'avocado-vt-vm1' \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20190123-032240-rOoB4cgD,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20190123-032240-rOoB4cgD,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=id8Ec4Bn  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20190123-032240-rOoB4cgD,server,nowait \
    -device isa-serial,chardev=serial_id_serial0  \
    -chardev socket,id=seabioslog_id_20190123-032240-rOoB4cgD,path=/var/tmp/seabios-20190123-032240-rOoB4cgD,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20190123-032240-rOoB4cgD,iobase=0x402 \
    -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \
    -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \
    -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel77-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bus=pci.0 \
    -drive id=drive_data1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/data1.qcow2 \
    -device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0 \
    -device virtio-net-pci,mac=9a:39:3a:3b:3c:3d,id=id1JNQsL,vectors=4,netdev=idVpZZ6A,bus=pci.0  \
    -netdev tap,id=idVpZZ6A,vhost=on \
    -m 4096  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu host,+kvm_pv_unhalt \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -qmp tcp:0:3000,server,nowait \

2.Create a full.img with size 1G

3.Do full backup on system and data disk, and add bitmaps to them at the same time with transaction mode and completion-mode "grouped".
  { "execute": "transaction", "arguments": { "actions": [ {"type":"drive-backup","data":{ "device": "drive_image1","target": "full.img", "sync": "full","format": "qcow2","mode":"existing"}},{"type":"drive-backup","data":{ "device": "drive_data1","target": "full_data.img", "sync": "full","format": "qcow2"}},{"type":"block-dirty-bitmap-add", "data":{"node": "drive_image1", "name": "bitmap0"}},{"type":"block-dirty-bitmap-add", "data":{"node": "drive_data1", "name": "bitmap1"}}],"properties":{"completion-mode":"grouped"}}}
 {"timestamp": {"seconds": 1557303285, "microseconds": 495252}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557303285, "microseconds": 573912}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557303285, "microseconds": 573973}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557303285, "microseconds": 573999}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557303285, "microseconds": 574023}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_data1"}}

Actual results:
 After step3, qemu hang.
 
Expected results:
 After step3, backup jobs should fail and no qemu hang.

Additional info:
 #pstack 3895
Thread 6 (Thread 0x7fe919836700 (LWP 3896)):
#0  0x00007fe921026ba9 in syscall () at /lib64/libc.so.6
#1  0x0000561985e237a0 in qemu_event_wait (val=<optimized out>, f=<optimized out>) at /usr/src/debug/qemu-2.12.0/include/qemu/futex.h:29
#2  0x0000561985e237a0 in qemu_event_wait (ev=ev@entry=0x561986aada28 <rcu_call_ready_event>) at util/qemu-thread-posix.c:445
#3  0x0000561985e33e7e in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:261
#4  0x00007fe921303e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007fe92102c88d in clone () at /lib64/libc.so.6
Thread 5 (Thread 0x7fe918733700 (LWP 3907)):
#0  0x00007fe921021caf in ppoll () at /lib64/libc.so.6
#1  0x0000561985e1f79b in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x0000561985e1f79b in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:322
#3  0x0000561985e21517 in aio_poll (ctx=0x5619882c7b80, blocking=blocking@entry=true) at util/aio-posix.c:645
#4  0x0000561985bef04e in iothread_run (opaque=0x561988734000) at iothread.c:64
#5  0x00007fe921303e65 in start_thread () at /lib64/libpthread.so.0
#6  0x00007fe92102c88d in clone () at /lib64/libc.so.6
Thread 4 (Thread 0x7fe917f32700 (LWP 3908)):
#0  0x00007fe9213079f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x0000561985e23379 in qemu_cond_wait_impl (cond=<optimized out>, mutex=mutex@entry=0x561986673ee0 <qemu_global_mutex>, file=file@entry=0x561985eb7908 "/builddir/build/BUILD/qemu-2.12.0/cpus.c", line=line@entry=1176) at util/qemu-thread-posix.c:164
#2  0x0000561985b02d9f in qemu_wait_io_event (cpu=cpu@entry=0x56198873c000) at /usr/src/debug/qemu-2.12.0/cpus.c:1176
#3  0x0000561985b044a0 in qemu_kvm_cpu_thread_fn (arg=0x56198873c000) at /usr/src/debug/qemu-2.12.0/cpus.c:1220
#4  0x00007fe921303e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007fe92102c88d in clone () at /lib64/libc.so.6
Thread 3 (Thread 0x7fe917731700 (LWP 3910)):
#0  0x00007fe9213079f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x0000561985e23379 in qemu_cond_wait_impl (cond=<optimized out>, mutex=mutex@entry=0x561986673ee0 <qemu_global_mutex>, file=file@entry=0x561985eb7908 "/builddir/build/BUILD/qemu-2.12.0/cpus.c", line=line@entry=1176) at util/qemu-thread-posix.c:164
#2  0x0000561985b02d9f in qemu_wait_io_event (cpu=cpu@entry=0x5619887a0000) at /usr/src/debug/qemu-2.12.0/cpus.c:1176
#3  0x0000561985b044a0 in qemu_kvm_cpu_thread_fn (arg=0x5619887a0000) at /usr/src/debug/qemu-2.12.0/cpus.c:1220
#4  0x00007fe921303e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007fe92102c88d in clone () at /lib64/libc.so.6
Thread 2 (Thread 0x7fe8153ff700 (LWP 3986)):
#0  0x00007fe9213079f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x0000561985e23379 in qemu_cond_wait_impl (cond=cond@entry=0x5619882b0ab0, mutex=mutex@entry=0x5619882b0ae8, file=file@entry=0x561985f8d287 "ui/vnc-jobs.c", line=line@entry=212) at util/qemu-thread-posix.c:164
#2  0x0000561985d3c0bf in vnc_worker_thread_loop (queue=queue@entry=0x5619882b0ab0) at ui/vnc-jobs.c:212
#3  0x0000561985d3c688 in vnc_worker_thread (arg=0x5619882b0ab0) at ui/vnc-jobs.c:319
#4  0x00007fe921303e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007fe92102c88d in clone () at /lib64/libc.so.6
Thread 1 (Thread 0x7fe93a719dc0 (LWP 3895)):
#0  0x00007fe921021caf in ppoll () at /lib64/libc.so.6
#1  0x0000561985e1f79b in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x0000561985e1f79b in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:322
#3  0x0000561985e21517 in aio_poll (ctx=ctx@entry=0x5619882c77c0, blocking=blocking@entry=true) at util/aio-posix.c:645
#4  0x0000561985d4e51c in job_finish_sync (job=job@entry=0x561989291900, finish=finish@entry=0x0, errp=errp@entry=0x0) at job.c:989
#5  0x0000561985d4e672 in job_completed_txn_abort (job=0x561988307b80) at job.c:752
#6  0x0000561985d4e439 in job_finish_sync (job=0x561988307b80, finish=finish@entry=0x561985d4ea80 <job_cancel_err>, errp=errp@entry=0x0) at job.c:981
#7  0x0000561985d4ea9e in job_cancel_sync (job=<optimized out>) at job.c:936
#8  0x0000561985be607d in drive_backup_abort (common=0x561989bba200) at blockdev.c:1970
#9  0x0000561985bea533 in qmp_transaction (dev_list=<optimized out>, has_props=<optimized out>, props=<optimized out>, errp=errp@entry=0x7ffd903de3e8) at blockdev.c:2460
#10 0x0000561985c055bd in qmp_marshal_transaction (args=<optimized out>, ret=<optimized out>, errp=0x7ffd903de458) at qapi/qapi-commands-transaction.c:46
#11 0x0000561985e143ba in qmp_dispatch (errp=0x7ffd903de450, request=0x7ffd903de450, cmds=<optimized out>) at qapi/qmp-dispatch.c:111
#12 0x0000561985e143ba in qmp_dispatch (cmds=<optimized out>, request=request@entry=0x5619897d8400) at qapi/qmp-dispatch.c:160
#13 0x0000561985b0d041 in monitor_qmp_dispatch_one (req_obj=<optimized out>) at /usr/src/debug/qemu-2.12.0/monitor.c:4102
#14 0x0000561985b0d2a5 in monitor_qmp_bh_dispatcher (data=<optimized out>) at /usr/src/debug/qemu-2.12.0/monitor.c:4160
#15 0x0000561985e1e301 in aio_bh_poll (bh=0x56198872e450) at util/async.c:90
#16 0x0000561985e1e301 in aio_bh_poll (ctx=ctx@entry=0x5619882c7680) at util/async.c:118
#17 0x0000561985e213b0 in aio_dispatch (ctx=0x5619882c7680) at util/aio-posix.c:440
#18 0x0000561985e1e1de in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#19 0x00007fe939e1d049 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#20 0x0000561985e206a7 in main_loop_wait () at util/main-loop.c:215
#21 0x0000561985e206a7 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#22 0x0000561985e206a7 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#23 0x0000561985ac10e7 in main () at vl.c:1964
#24 0x0000561985ac10e7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4782

Comment 2 aihua liang 2019-05-08 08:35:59 UTC

completion-mode "individual" works ok.

When test with only one backup job and bitmap-add job, it gives error info "block-dirty-bitmap-add" don't support completion-mode "grouped".
 Details:
 { "execute": "transaction", "arguments": { "actions": [ {"type":"drive-backup","data":{ "device": "drive_image1","target": "full.img", "sync": "full","format": "qcow2","mode":"existing"}},{"type":"block-dirty-bitmap-add", "data":{"node": "drive_image1", "name": "bitmap0"}}],"properties":{"completion-mode":"grouped"}}}
{"timestamp": {"seconds": 1557303031, "microseconds": 879414}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557303031, "microseconds": 879471}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557303031, "microseconds": 879514}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557303031, "microseconds": 879555}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image1"}}
{"error": {"class": "GenericError", "desc": "Action 'block-dirty-bitmap-add' does not support Transaction property completion-mode = grouped"}}

Comment 3 aihua liang 2019-05-08 08:39:08 UTC

And when check completion-mode "grouped" with only backup jobs, it works ok.
 { "execute": "transaction", "arguments": { "actions": [ {"type":"drive-backup","data":{ "device": "drive_image1","target": "full.img", "sync": "full","format": "qcow2","mode":"existing"}},{"type":"drive-backup","data":{ "device": "drive_data1","target": "full_data.img", "sync": "full","format": "qcow2"}}],"properties":{"completion-mode":"grouped"}}}
{"timestamp": {"seconds": 1557301819, "microseconds": 868356}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557301819, "microseconds": 966543}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557301819, "microseconds": 966604}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557301819, "microseconds": 966626}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557301819, "microseconds": 966662}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557301819, "microseconds": 966692}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557301819, "microseconds": 966728}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557301819, "microseconds": 966802}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_data1"}}
{"return": {}}
{"execute":"query-block-jobs"}
{"return": [{"auto-finalize": true, "io-status": "ok", "device": "drive_data1", "auto-dismiss": true, "busy": true, "len": 1073741824, "offset": 232194048, "status": "running", "paused": false, "speed": 0, "ready": false, "type": "backup"}, {"auto-finalize": true, "io-status": "ok", "device": "drive_image1", "auto-dismiss": true, "busy": true, "len": 21474836480, "offset": 114098176, "status": "running", "paused": false, "speed": 0, "ready": false, "type": "backup"}]}
{"timestamp": {"seconds": 1557301861, "microseconds": 709364}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 941528}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive_image1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 941576}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 941596}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 941647}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive_data1", "len": 1073741824, "offset": 1073741824, "speed": 0, "type": "backup"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 941682}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 941750}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_data1"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 973974}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 21474836480, "offset": 1073741824, "speed": 0, "type": "backup", "error": "Input/output error"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 974021}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557301863, "microseconds": 974047}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image1"}}

Comment 4 aihua liang 2019-05-10 06:14:09 UTC

The same hang issue also exist, when i do a backup with exposed nbd disk as target
(NBD server)
 # qemu-img create -f qcow2 full.img 20G
 # qemu-nbd full.img -t -p 9000 -f raw

(NBD Client)
{"execute":"drive-backup","arguments":{"device": "drive_image1", "target": "nbd://10.66.144.75:9000", "sync": "full", "format": "qcow2","mode":"existing"}}

{"timestamp": {"seconds": 1557468632, "microseconds": 421636}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557468632, "microseconds": 421686}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image1"}}
{"return": {}}
{"timestamp": {"seconds": 1557468632, "microseconds": 466582}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive_image1", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1557468632, "microseconds": 466731}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557468632, "microseconds": 466810}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 21474836480, "offset": 0, "speed": 0, "type": "backup", "error": "Invalid argument"}}
{"timestamp": {"seconds": 1557468632, "microseconds": 466885}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image1"}}
{"timestamp": {"seconds": 1557468632, "microseconds": 466917}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image1"}}

Then qemu hang, pstack info as bellow:
 # pstack 22850
Thread 7 (Thread 0x7f9ab262c700 (LWP 22851)):
#0  0x00007f9ab9e1cba9 in syscall () at /lib64/libc.so.6
#1  0x0000555d8fdb67a0 in qemu_event_wait (val=<optimized out>, f=<optimized out>) at /usr/src/debug/qemu-2.12.0/include/qemu/futex.h:29
#2  0x0000555d8fdb67a0 in qemu_event_wait (ev=ev@entry=0x555d90a40a28 <rcu_call_ready_event>) at util/qemu-thread-posix.c:445
#3  0x0000555d8fdc6e7e in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:261
#4  0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007f9ab9e2288d in clone () at /lib64/libc.so.6
Thread 6 (Thread 0x7f9ab1e2b700 (LWP 22852)):
#0  0x00007f9ab9e17caf in ppoll () at /lib64/libc.so.6
#1  0x0000555d8fdb2779 in qemu_poll_ns (__ss=0x0, __timeout=0x7f9ab1e28780, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x0000555d8fdb2779 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:334
#3  0x0000555d8fdb4517 in aio_poll (ctx=0x555d91517900, blocking=blocking@entry=true) at util/aio-posix.c:645
#4  0x0000555d8fb8204e in iothread_run (opaque=0x555d91551ea0) at iothread.c:64
#5  0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0
#6  0x00007f9ab9e2288d in clone () at /lib64/libc.so.6
Thread 5 (Thread 0x7f9ab0e29700 (LWP 22865)):
#0  0x00007f9ab9e17caf in ppoll () at /lib64/libc.so.6
#1  0x0000555d8fdb279b in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x0000555d8fdb279b in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:322
#3  0x0000555d8fdb4517 in aio_poll (ctx=0x555d91517b80, blocking=blocking@entry=true) at util/aio-posix.c:645
#4  0x0000555d8fb8204e in iothread_run (opaque=0x555d9162a0e0) at iothread.c:64
#5  0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0
#6  0x00007f9ab9e2288d in clone () at /lib64/libc.so.6
Thread 4 (Thread 0x7f9ab0628700 (LWP 22867)):
#0  0x00007f9ab9e192b7 in ioctl () at /lib64/libc.so.6
#1  0x0000555d8fab9e85 in kvm_vcpu_ioctl (cpu=cpu@entry=0x555d9191e000, type=type@entry=44672) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:2109
#2  0x0000555d8fab9f53 in kvm_cpu_exec (cpu=cpu@entry=0x555d9191e000) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:1946
#3  0x0000555d8fa974c6 in qemu_kvm_cpu_thread_fn (arg=0x555d9191e000) at /usr/src/debug/qemu-2.12.0/cpus.c:1215
#4  0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007f9ab9e2288d in clone () at /lib64/libc.so.6
Thread 3 (Thread 0x7f9aafe27700 (LWP 22868)):
#0  0x00007f9ab9e192b7 in ioctl () at /lib64/libc.so.6
#1  0x0000555d8fab9e85 in kvm_vcpu_ioctl (cpu=cpu@entry=0x555d9197e000, type=type@entry=44672) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:2109
#2  0x0000555d8fab9f53 in kvm_cpu_exec (cpu=cpu@entry=0x555d9197e000) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:1946
#3  0x0000555d8fa974c6 in qemu_kvm_cpu_thread_fn (arg=0x555d9197e000) at /usr/src/debug/qemu-2.12.0/cpus.c:1215
#4  0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007f9ab9e2288d in clone () at /lib64/libc.so.6
Thread 2 (Thread 0x7f99adbff700 (LWP 22946)):
#0  0x00007f9aba0fd9f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x0000555d8fdb6379 in qemu_cond_wait_impl (cond=cond@entry=0x555d91500ab0, mutex=mutex@entry=0x555d91500ae8, file=file@entry=0x555d8ff20287 "ui/vnc-jobs.c", line=line@entry=212) at util/qemu-thread-posix.c:164
#2  0x0000555d8fccf0bf in vnc_worker_thread_loop (queue=queue@entry=0x555d91500ab0) at ui/vnc-jobs.c:212
#3  0x0000555d8fccf688 in vnc_worker_thread (arg=0x555d91500ab0) at ui/vnc-jobs.c:319
#4  0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0
#5  0x00007f9ab9e2288d in clone () at /lib64/libc.so.6
Thread 1 (Thread 0x7f9ad350fdc0 (LWP 22850)):
#0  0x00007f9ab9e17caf in ppoll () at /lib64/libc.so.6
#1  0x0000555d8fdb279b in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  0x0000555d8fdb279b in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:322
#3  0x0000555d8fdb4517 in aio_poll (ctx=0x555d915177c0, blocking=blocking@entry=true) at util/aio-posix.c:645
#4  0x0000555d8fd35f5a in nbd_client_close (bs=0x555d91909400) at block/nbd-client.c:62
#5  0x0000555d8fd35f5a in nbd_client_close (bs=0x555d91909400) at block/nbd-client.c:984
#6  0x0000555d8fd3387a in nbd_close (bs=<optimized out>) at block/nbd.c:455
#7  0x0000555d8fcdc1b2 in bdrv_unref (bs=0x555d91909400) at block.c:3414
#8  0x0000555d8fcdc1b2 in bdrv_unref (bs=0x555d91909400) at block.c:3598
#9  0x0000555d8fcdc1b2 in bdrv_unref (bs=0x555d91909400) at block.c:4681
#10 0x0000555d8fcdc1df in bdrv_unref (bs=0x555d9190c800) at block.c:3421
#11 0x0000555d8fcdc1df in bdrv_unref (bs=0x555d9190c800) at block.c:3598
#12 0x0000555d8fcdc1df in bdrv_unref (bs=0x555d9190c800) at block.c:4681
#13 0x0000555d8fcdf804 in block_job_remove_all_bdrv (job=job@entry=0x555d92471680) at blockjob.c:200
#14 0x0000555d8fcdf853 in block_job_free (job=0x555d92471680) at blockjob.c:94
#15 0x0000555d8fce0bad in job_unref (job=0x555d92471680) at job.c:360
#16 0x0000555d8fce0da8 in job_finalize_single (job=0x555d92471680) at job.c:626
#17 0x0000555d8fce0da8 in job_finalize_single (job=0x555d92471680) at job.c:694
#18 0x0000555d8fce167a in job_completed_txn_abort (job=0x555d92471680) at job.c:754
#19 0x0000555d8fce19d0 in job_exit (opaque=0x555d92471680) at job.c:869
#20 0x0000555d8fdb1301 in aio_bh_poll (bh=0x555d91911f20) at util/async.c:90
#21 0x0000555d8fdb1301 in aio_bh_poll (ctx=ctx@entry=0x555d915177c0) at util/async.c:118
#22 0x0000555d8fdb43b0 in aio_dispatch (ctx=0x555d915177c0) at util/aio-posix.c:440
#23 0x0000555d8fdb11de in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#24 0x00007f9ad2c13049 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#25 0x0000555d8fdb36a7 in main_loop_wait () at util/main-loop.c:215
#26 0x0000555d8fdb36a7 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#27 0x0000555d8fdb36a7 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#28 0x0000555d8fa540e7 in main () at vl.c:1964
#29 0x0000555d8fa540e7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4782

Comment 6 John Snow 2020-01-24 22:42:32 UTC

Let's try this again with 4.2.0; a lot has changed since then.

block-dirty-bitmap-add should not be able to be used in grouped completion mode; it's a problem that QEMU is accepting this command. I am not presently sure of how it was ever able to.

From what I can guess, QEMU will try roll back these commands in reverse order, but it's likely that "drive_image1" and "drive_data1" now refer to different nodes, which might have some negative effects. Not sure why it's exhibiting as a hang, exactly.

Some thoughts, if it is still reproducible on 4.2:

1. Try this with blockdev node names and see if behavior changes. It might "just work". I suspect that when it is trying to remove the bitmaps on failure, it is trying to get a lock on the wrong nodes, because the jobs are still running.

2. Using either blockdev OR drive, try putting the bitmap-add commands FIRST.


Will investigate.

Comment 7 aihua liang 2020-02-03 09:12:20 UTC

Test on qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904 with -blockdev:
  1. Yes, "block-dirty-bitmap-add" is not supported in grouped completion mode.
      { "execute": "transaction", "arguments": { "actions": [ {"type": "blockdev-backup", "data": { "device": "drive_image1", "target": "sn1", "sync": "full", "job-id":"j1" } }, {"type": "block-dirty-bitmap-add", "data": { "node": "drive_image1", "name": "bitmap0" } },{"type": "blockdev-backup", "data": { "device": "drive_data1", "target": "sn2", "sync": "full", "job-id":"j2" } }, {"type": "block-dirty-bitmap-add", "data": { "node": "drive_data2", "name": "bitmap0"}}],"properties":{"completion-mode":"grouped"}}}

{"timestamp": {"seconds": 1580720422, "microseconds": 545532}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1580720422, "microseconds": 545606}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1580720422, "microseconds": 545677}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1580720422, "microseconds": 545733}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}
{"error": {"class": "GenericError", "desc": "Action 'block-dirty-bitmap-add' does not support Transaction property completion-mode = grouped"}}

  2. grouped completion mode works normally on multi-disks backup jobs.
      { "execute": "transaction", "arguments": { "actions": [ {"type": "blockdev-backup", "data": { "device": "drive_image1", "target": "sn1", "sync": "full", "job-id":"j1" } },{"type": "blockdev-backup", "data": { "device": "drive_data1", "target": "sn2", "sync": "full", "job-id":"j2" } }],"properties":{"completion-mode":"grouped"}}}

{"timestamp": {"seconds": 1580720520, "microseconds": 463511}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1580720520, "microseconds": 463738}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j2"}}
{"timestamp": {"seconds": 1580720520, "microseconds": 463790}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"timestamp": {"seconds": 1580720520, "microseconds": 463831}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "j1"}}
{"timestamp": {"seconds": 1580720520, "microseconds": 463870}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j2"}}
{"timestamp": {"seconds": 1580720520, "microseconds": 463908}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "j2"}}
{"timestamp": {"seconds": 1580720520, "microseconds": 463947}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"timestamp": {"seconds": 1580720520, "microseconds": 464003}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j2"}}
{"return": {}}
{"timestamp": {"seconds": 1580720522, "microseconds": 223694}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j2", "operation": "write", "action": "report"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 223822}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j2"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 223899}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j2", "len": 1073741824, "offset": 492503040, "speed": 0, "type": "backup", "error": "No space left on device"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 223953}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j2"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 223991}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j2"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 376294}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 376354}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "j1", "len": 21474836480, "offset": 269549568, "speed": 0, "type": "backup"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 376401}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1580720522, "microseconds": 376439}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}

{"execute":"query-block-jobs"}
{"return": []}

So, John, if you agree, i will close it as CURRENTRELEASE.

Comment 8 Ademar Reis 2020-02-05 22:57:11 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 9 John Snow 2020-03-10 18:36:08 UTC

Yes, this looks like the correct behavior. CLOSED CURRENTRELEASE works.