Bug 1707733
Summary: | Qemu hang when do backup jobs with completion-mode "grouped" and with one job failed | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | aihua liang <aliang> |
Component: | qemu-kvm | Assignee: | John Snow <jsnow> |
qemu-kvm sub component: | Incremental Live Backup | QA Contact: | aihua liang <aliang> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | coli, jsnow, juzhang, ngu, qzhang, virt-maint, xutian |
Version: | --- | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-03-10 18:36:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
aihua liang
2019-05-08 08:33:15 UTC
completion-mode "individual" works ok. When test with only one backup job and bitmap-add job, it gives error info "block-dirty-bitmap-add" don't support completion-mode "grouped". Details: { "execute": "transaction", "arguments": { "actions": [ {"type":"drive-backup","data":{ "device": "drive_image1","target": "full.img", "sync": "full","format": "qcow2","mode":"existing"}},{"type":"block-dirty-bitmap-add", "data":{"node": "drive_image1", "name": "bitmap0"}}],"properties":{"completion-mode":"grouped"}}} {"timestamp": {"seconds": 1557303031, "microseconds": 879414}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image1"}} {"timestamp": {"seconds": 1557303031, "microseconds": 879471}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_image1"}} {"timestamp": {"seconds": 1557303031, "microseconds": 879514}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image1"}} {"timestamp": {"seconds": 1557303031, "microseconds": 879555}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image1"}} {"error": {"class": "GenericError", "desc": "Action 'block-dirty-bitmap-add' does not support Transaction property completion-mode = grouped"}} And when check completion-mode "grouped" with only backup jobs, it works ok. { "execute": "transaction", "arguments": { "actions": [ {"type":"drive-backup","data":{ "device": "drive_image1","target": "full.img", "sync": "full","format": "qcow2","mode":"existing"}},{"type":"drive-backup","data":{ "device": "drive_data1","target": "full_data.img", "sync": "full","format": "qcow2"}}],"properties":{"completion-mode":"grouped"}}} {"timestamp": {"seconds": 1557301819, "microseconds": 868356}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image1"}} {"timestamp": {"seconds": 1557301819, "microseconds": 966543}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_data1"}} {"timestamp": {"seconds": 1557301819, "microseconds": 966604}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image1"}} {"timestamp": {"seconds": 1557301819, "microseconds": 966626}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "drive_image1"}} {"timestamp": {"seconds": 1557301819, "microseconds": 966662}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_data1"}} {"timestamp": {"seconds": 1557301819, "microseconds": 966692}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "drive_data1"}} {"timestamp": {"seconds": 1557301819, "microseconds": 966728}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image1"}} {"timestamp": {"seconds": 1557301819, "microseconds": 966802}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_data1"}} {"return": {}} {"execute":"query-block-jobs"} {"return": [{"auto-finalize": true, "io-status": "ok", "device": "drive_data1", "auto-dismiss": true, "busy": true, "len": 1073741824, "offset": 232194048, "status": "running", "paused": false, "speed": 0, "ready": false, "type": "backup"}, {"auto-finalize": true, "io-status": "ok", "device": "drive_image1", "auto-dismiss": true, "busy": true, "len": 21474836480, "offset": 114098176, "status": "running", "paused": false, "speed": 0, "ready": false, "type": "backup"}]} {"timestamp": {"seconds": 1557301861, "microseconds": 709364}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "drive_data1"}} {"timestamp": {"seconds": 1557301863, "microseconds": 941528}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive_image1", "operation": "write", "action": "report"}} {"timestamp": {"seconds": 1557301863, "microseconds": 941576}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_image1"}} {"timestamp": {"seconds": 1557301863, "microseconds": 941596}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_data1"}} {"timestamp": {"seconds": 1557301863, "microseconds": 941647}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "drive_data1", "len": 1073741824, "offset": 1073741824, "speed": 0, "type": "backup"}} {"timestamp": {"seconds": 1557301863, "microseconds": 941682}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_data1"}} {"timestamp": {"seconds": 1557301863, "microseconds": 941750}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_data1"}} {"timestamp": {"seconds": 1557301863, "microseconds": 973974}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 21474836480, "offset": 1073741824, "speed": 0, "type": "backup", "error": "Input/output error"}} {"timestamp": {"seconds": 1557301863, "microseconds": 974021}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image1"}} {"timestamp": {"seconds": 1557301863, "microseconds": 974047}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image1"}} The same hang issue also exist, when i do a backup with exposed nbd disk as target (NBD server) # qemu-img create -f qcow2 full.img 20G # qemu-nbd full.img -t -p 9000 -f raw (NBD Client) {"execute":"drive-backup","arguments":{"device": "drive_image1", "target": "nbd://10.66.144.75:9000", "sync": "full", "format": "qcow2","mode":"existing"}} {"timestamp": {"seconds": 1557468632, "microseconds": 421636}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "drive_image1"}} {"timestamp": {"seconds": 1557468632, "microseconds": 421686}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "drive_image1"}} {"return": {}} {"timestamp": {"seconds": 1557468632, "microseconds": 466582}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive_image1", "operation": "write", "action": "report"}} {"timestamp": {"seconds": 1557468632, "microseconds": 466731}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "drive_image1"}} {"timestamp": {"seconds": 1557468632, "microseconds": 466810}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive_image1", "len": 21474836480, "offset": 0, "speed": 0, "type": "backup", "error": "Invalid argument"}} {"timestamp": {"seconds": 1557468632, "microseconds": 466885}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "drive_image1"}} {"timestamp": {"seconds": 1557468632, "microseconds": 466917}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "drive_image1"}} Then qemu hang, pstack info as bellow: # pstack 22850 Thread 7 (Thread 0x7f9ab262c700 (LWP 22851)): #0 0x00007f9ab9e1cba9 in syscall () at /lib64/libc.so.6 #1 0x0000555d8fdb67a0 in qemu_event_wait (val=<optimized out>, f=<optimized out>) at /usr/src/debug/qemu-2.12.0/include/qemu/futex.h:29 #2 0x0000555d8fdb67a0 in qemu_event_wait (ev=ev@entry=0x555d90a40a28 <rcu_call_ready_event>) at util/qemu-thread-posix.c:445 #3 0x0000555d8fdc6e7e in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:261 #4 0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0 #5 0x00007f9ab9e2288d in clone () at /lib64/libc.so.6 Thread 6 (Thread 0x7f9ab1e2b700 (LWP 22852)): #0 0x00007f9ab9e17caf in ppoll () at /lib64/libc.so.6 #1 0x0000555d8fdb2779 in qemu_poll_ns (__ss=0x0, __timeout=0x7f9ab1e28780, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 0x0000555d8fdb2779 in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:334 #3 0x0000555d8fdb4517 in aio_poll (ctx=0x555d91517900, blocking=blocking@entry=true) at util/aio-posix.c:645 #4 0x0000555d8fb8204e in iothread_run (opaque=0x555d91551ea0) at iothread.c:64 #5 0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0 #6 0x00007f9ab9e2288d in clone () at /lib64/libc.so.6 Thread 5 (Thread 0x7f9ab0e29700 (LWP 22865)): #0 0x00007f9ab9e17caf in ppoll () at /lib64/libc.so.6 #1 0x0000555d8fdb279b in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 0x0000555d8fdb279b in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:322 #3 0x0000555d8fdb4517 in aio_poll (ctx=0x555d91517b80, blocking=blocking@entry=true) at util/aio-posix.c:645 #4 0x0000555d8fb8204e in iothread_run (opaque=0x555d9162a0e0) at iothread.c:64 #5 0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0 #6 0x00007f9ab9e2288d in clone () at /lib64/libc.so.6 Thread 4 (Thread 0x7f9ab0628700 (LWP 22867)): #0 0x00007f9ab9e192b7 in ioctl () at /lib64/libc.so.6 #1 0x0000555d8fab9e85 in kvm_vcpu_ioctl (cpu=cpu@entry=0x555d9191e000, type=type@entry=44672) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:2109 #2 0x0000555d8fab9f53 in kvm_cpu_exec (cpu=cpu@entry=0x555d9191e000) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:1946 #3 0x0000555d8fa974c6 in qemu_kvm_cpu_thread_fn (arg=0x555d9191e000) at /usr/src/debug/qemu-2.12.0/cpus.c:1215 #4 0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0 #5 0x00007f9ab9e2288d in clone () at /lib64/libc.so.6 Thread 3 (Thread 0x7f9aafe27700 (LWP 22868)): #0 0x00007f9ab9e192b7 in ioctl () at /lib64/libc.so.6 #1 0x0000555d8fab9e85 in kvm_vcpu_ioctl (cpu=cpu@entry=0x555d9197e000, type=type@entry=44672) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:2109 #2 0x0000555d8fab9f53 in kvm_cpu_exec (cpu=cpu@entry=0x555d9197e000) at /usr/src/debug/qemu-2.12.0/accel/kvm/kvm-all.c:1946 #3 0x0000555d8fa974c6 in qemu_kvm_cpu_thread_fn (arg=0x555d9197e000) at /usr/src/debug/qemu-2.12.0/cpus.c:1215 #4 0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0 #5 0x00007f9ab9e2288d in clone () at /lib64/libc.so.6 Thread 2 (Thread 0x7f99adbff700 (LWP 22946)): #0 0x00007f9aba0fd9f5 in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0 #1 0x0000555d8fdb6379 in qemu_cond_wait_impl (cond=cond@entry=0x555d91500ab0, mutex=mutex@entry=0x555d91500ae8, file=file@entry=0x555d8ff20287 "ui/vnc-jobs.c", line=line@entry=212) at util/qemu-thread-posix.c:164 #2 0x0000555d8fccf0bf in vnc_worker_thread_loop (queue=queue@entry=0x555d91500ab0) at ui/vnc-jobs.c:212 #3 0x0000555d8fccf688 in vnc_worker_thread (arg=0x555d91500ab0) at ui/vnc-jobs.c:319 #4 0x00007f9aba0f9e65 in start_thread () at /lib64/libpthread.so.0 #5 0x00007f9ab9e2288d in clone () at /lib64/libc.so.6 Thread 1 (Thread 0x7f9ad350fdc0 (LWP 22850)): #0 0x00007f9ab9e17caf in ppoll () at /lib64/libc.so.6 #1 0x0000555d8fdb279b in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 0x0000555d8fdb279b in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at util/qemu-timer.c:322 #3 0x0000555d8fdb4517 in aio_poll (ctx=0x555d915177c0, blocking=blocking@entry=true) at util/aio-posix.c:645 #4 0x0000555d8fd35f5a in nbd_client_close (bs=0x555d91909400) at block/nbd-client.c:62 #5 0x0000555d8fd35f5a in nbd_client_close (bs=0x555d91909400) at block/nbd-client.c:984 #6 0x0000555d8fd3387a in nbd_close (bs=<optimized out>) at block/nbd.c:455 #7 0x0000555d8fcdc1b2 in bdrv_unref (bs=0x555d91909400) at block.c:3414 #8 0x0000555d8fcdc1b2 in bdrv_unref (bs=0x555d91909400) at block.c:3598 #9 0x0000555d8fcdc1b2 in bdrv_unref (bs=0x555d91909400) at block.c:4681 #10 0x0000555d8fcdc1df in bdrv_unref (bs=0x555d9190c800) at block.c:3421 #11 0x0000555d8fcdc1df in bdrv_unref (bs=0x555d9190c800) at block.c:3598 #12 0x0000555d8fcdc1df in bdrv_unref (bs=0x555d9190c800) at block.c:4681 #13 0x0000555d8fcdf804 in block_job_remove_all_bdrv (job=job@entry=0x555d92471680) at blockjob.c:200 #14 0x0000555d8fcdf853 in block_job_free (job=0x555d92471680) at blockjob.c:94 #15 0x0000555d8fce0bad in job_unref (job=0x555d92471680) at job.c:360 #16 0x0000555d8fce0da8 in job_finalize_single (job=0x555d92471680) at job.c:626 #17 0x0000555d8fce0da8 in job_finalize_single (job=0x555d92471680) at job.c:694 #18 0x0000555d8fce167a in job_completed_txn_abort (job=0x555d92471680) at job.c:754 #19 0x0000555d8fce19d0 in job_exit (opaque=0x555d92471680) at job.c:869 #20 0x0000555d8fdb1301 in aio_bh_poll (bh=0x555d91911f20) at util/async.c:90 #21 0x0000555d8fdb1301 in aio_bh_poll (ctx=ctx@entry=0x555d915177c0) at util/async.c:118 #22 0x0000555d8fdb43b0 in aio_dispatch (ctx=0x555d915177c0) at util/aio-posix.c:440 #23 0x0000555d8fdb11de in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261 #24 0x00007f9ad2c13049 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #25 0x0000555d8fdb36a7 in main_loop_wait () at util/main-loop.c:215 #26 0x0000555d8fdb36a7 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238 #27 0x0000555d8fdb36a7 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497 #28 0x0000555d8fa540e7 in main () at vl.c:1964 #29 0x0000555d8fa540e7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4782 Let's try this again with 4.2.0; a lot has changed since then. block-dirty-bitmap-add should not be able to be used in grouped completion mode; it's a problem that QEMU is accepting this command. I am not presently sure of how it was ever able to. From what I can guess, QEMU will try roll back these commands in reverse order, but it's likely that "drive_image1" and "drive_data1" now refer to different nodes, which might have some negative effects. Not sure why it's exhibiting as a hang, exactly. Some thoughts, if it is still reproducible on 4.2: 1. Try this with blockdev node names and see if behavior changes. It might "just work". I suspect that when it is trying to remove the bitmaps on failure, it is trying to get a lock on the wrong nodes, because the jobs are still running. 2. Using either blockdev OR drive, try putting the bitmap-add commands FIRST. Will investigate. Test on qemu-kvm-4.2.0-8.module+el8.2.0+5607+dc756904 with -blockdev: 1. Yes, "block-dirty-bitmap-add" is not supported in grouped completion mode. { "execute": "transaction", "arguments": { "actions": [ {"type": "blockdev-backup", "data": { "device": "drive_image1", "target": "sn1", "sync": "full", "job-id":"j1" } }, {"type": "block-dirty-bitmap-add", "data": { "node": "drive_image1", "name": "bitmap0" } },{"type": "blockdev-backup", "data": { "device": "drive_data1", "target": "sn2", "sync": "full", "job-id":"j2" } }, {"type": "block-dirty-bitmap-add", "data": { "node": "drive_data2", "name": "bitmap0"}}],"properties":{"completion-mode":"grouped"}}} {"timestamp": {"seconds": 1580720422, "microseconds": 545532}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}} {"timestamp": {"seconds": 1580720422, "microseconds": 545606}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}} {"timestamp": {"seconds": 1580720422, "microseconds": 545677}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}} {"timestamp": {"seconds": 1580720422, "microseconds": 545733}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}} {"error": {"class": "GenericError", "desc": "Action 'block-dirty-bitmap-add' does not support Transaction property completion-mode = grouped"}} 2. grouped completion mode works normally on multi-disks backup jobs. { "execute": "transaction", "arguments": { "actions": [ {"type": "blockdev-backup", "data": { "device": "drive_image1", "target": "sn1", "sync": "full", "job-id":"j1" } },{"type": "blockdev-backup", "data": { "device": "drive_data1", "target": "sn2", "sync": "full", "job-id":"j2" } }],"properties":{"completion-mode":"grouped"}}} {"timestamp": {"seconds": 1580720520, "microseconds": 463511}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}} {"timestamp": {"seconds": 1580720520, "microseconds": 463738}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j2"}} {"timestamp": {"seconds": 1580720520, "microseconds": 463790}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}} {"timestamp": {"seconds": 1580720520, "microseconds": 463831}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "j1"}} {"timestamp": {"seconds": 1580720520, "microseconds": 463870}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j2"}} {"timestamp": {"seconds": 1580720520, "microseconds": 463908}, "event": "JOB_STATUS_CHANGE", "data": {"status": "paused", "id": "j2"}} {"timestamp": {"seconds": 1580720520, "microseconds": 463947}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}} {"timestamp": {"seconds": 1580720520, "microseconds": 464003}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j2"}} {"return": {}} {"timestamp": {"seconds": 1580720522, "microseconds": 223694}, "event": "BLOCK_JOB_ERROR", "data": {"device": "j2", "operation": "write", "action": "report"}} {"timestamp": {"seconds": 1580720522, "microseconds": 223822}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j2"}} {"timestamp": {"seconds": 1580720522, "microseconds": 223899}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j2", "len": 1073741824, "offset": 492503040, "speed": 0, "type": "backup", "error": "No space left on device"}} {"timestamp": {"seconds": 1580720522, "microseconds": 223953}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j2"}} {"timestamp": {"seconds": 1580720522, "microseconds": 223991}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j2"}} {"timestamp": {"seconds": 1580720522, "microseconds": 376294}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "j1"}} {"timestamp": {"seconds": 1580720522, "microseconds": 376354}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "j1", "len": 21474836480, "offset": 269549568, "speed": 0, "type": "backup"}} {"timestamp": {"seconds": 1580720522, "microseconds": 376401}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}} {"timestamp": {"seconds": 1580720522, "microseconds": 376439}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}} {"execute":"query-block-jobs"} {"return": []} So, John, if you agree, i will close it as CURRENTRELEASE. QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks Yes, this looks like the correct behavior. CLOSED CURRENTRELEASE works. |