Bug 1408653
Summary: | qemu aborts when taking internal snapshot if vcpus are not resumed after migration | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | yisun | ||||
Component: | qemu-kvm-rhev | Assignee: | Dr. David Alan Gilbert <dgilbert> | ||||
Status: | CLOSED DEFERRED | QA Contact: | Li Xiaohui <xiaohli> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 7.3 | CC: | ailan, aliang, chayang, coli, dyuan, hachen, hhan, hhuang, jinzhao, juzhang, lvivier, ngu, peterx, pingl, quintela, qzhang, virt-maint, xuwei, xuzhang, yisun | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-07-22 20:31:25 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1473046 | ||||||
Attachments: |
|
this issue was **NOT** reproduced with libvirt-2.0.0-10.el7_3.2.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64 but reproduced with: libvirt-2.0.0-10.el7_3.3.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64 <<===== qemu version not changed here So not sure if qemu or libvirt should fix this, if assign to wrong component, pls change it in time. thx. btw, this issue is not reproduced on libvirt-2.5.0-1.el7.x86_64 and qemu-kvm-rhev-2.6.0-29.el7.x86_64 The libvirt bug that allowed the sequence of commands described above was fixed in the original bug, that tracks the patch that broke the code (1403691). The problem reproduces even with current upstream qemu. Steps to reproduce: 1) start qemu with a qcow2 volume 2) { "execute": "migrate", "arguments": { "uri": "tcp:0:4446" } } 3) { "execute": "human-monitor-command", "arguments": { "command-line": "savevm ble" } } After that qemu aborts: Line numbers correspond to commit a92f7fe5a82ac9e8d127e92c5dce1a84064126da Thread 1 "qemu-system-x86" received signal SIGABRT, Aborted. 0x00007f6ce380e137 in raise () from target:/lib64/libc.so.6 (gdb) t a a bt Thread 5 (Thread 0x7f6cd37fe700 (LWP 23180)): #0 0x00007f6ce38b984d in poll () from target:/lib64/libc.so.6 #1 0x00007f6ce5b9a89c in g_main_context_iterate.isra () from target:/usr/lib64/libglib-2.0.so.0 #2 0x00007f6ce5b9ac22 in g_main_loop_run () from target:/usr/lib64/libglib-2.0.so.0 #3 0x00007f6ce4ce9778 in red_worker_main () from target:/usr/lib64/libspice-server.so.1 #4 0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0 #5 0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6 Thread 4 (Thread 0x7f6cd3fff700 (LWP 23178)): #0 0x00007f6ce809a13f in pthread_cond_wait () from target:/lib64/libpthread.so.0 #1 0x000055e521148599 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x55e52177c300 <qemu_global_mutex>) at util/qemu-thread-posix.c:137 #2 0x000055e520dfe913 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/pipo/git/qemu.git/cpus.c:964 #3 qemu_kvm_cpu_thread_fn (arg=0x55e522041b50) at /home/pipo/git/qemu.git/cpus.c:1003 #4 0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0 #5 0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6 Thread 3 (Thread 0x7f6cd8f29700 (LWP 23177)): #0 0x00007f6ce809a13f in pthread_cond_wait () from target:/lib64/libpthread.so.0 #1 0x000055e521148599 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x55e52177c300 <qemu_global_mutex>) at util/qemu-thread-posix.c:137 #2 0x000055e520dfe913 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/pipo/git/qemu.git/cpus.c:964 #3 qemu_kvm_cpu_thread_fn (arg=0x55e521fe1250) at /home/pipo/git/qemu.git/cpus.c:1003 #4 0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0 #5 0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6 Thread 2 (Thread 0x7f6cdb26c700 (LWP 23168)): #0 0x00007f6ce38be429 in syscall () from target:/lib64/libc.so.6 #1 0x000055e5211488a5 in futex_wait (val=<optimized out>, ev=<optimized out>) at util/qemu-thread-posix.c:306 #2 qemu_event_wait (ev=ev@entry=0x55e521ba2a04 <rcu_call_ready_event>) at util/qemu-thread-posix.c:422 #3 0x000055e5211575fe in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:249 #4 0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0 #5 0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6 Thread 1 (Thread 0x7f6ceb935b00 (LWP 23165)): #0 0x00007f6ce380e137 in raise () from target:/lib64/libc.so.6 #1 0x00007f6ce380f5ba in abort () from target:/lib64/libc.so.6 #2 0x00007f6ce38071bd in __assert_fail_base () from target:/lib64/libc.so.6 #3 0x00007f6ce3807272 in __assert_fail () from target:/lib64/libc.so.6 #4 0x000055e5210e6f7e in bdrv_co_pwritev (child=<optimized out>, offset=<optimized out>, bytes=<optimized out>, qiov=<optimized out>, flags=0) at block/io.c:1514 #5 0x000055e5210e7032 in bdrv_rw_co_entry (opaque=0x7ffeaf5e3fa0) at block/io.c:595 #6 0x000055e52115882a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79 #7 0x00007f6ce381ee00 in ?? () from target:/lib64/libc.so.6 #8 0x00007ffeaf5e3810 in ?? () #9 0x0000000000000000 in ?? () To reproduce via libvirt you can use the qemu-monitor-command facility, since the bug described above was already patched. (Qemu is resumed after migration and does not crash). $ virsh qemu-monitor-command test '{ "execute": "migrate", "arguments": { "uri": "tcp:0:4446" } }' $ virsh qemu-monitor-command test '{ "execute": "human-monitor-command", "arguments": { "command-line": "savevm ble" } }' Reproduces with current upstream (commit 5e19aed59). Backtrace: #0 0x00007fffda8d69fb in raise () at /lib64/libc.so.6 #1 0x00007fffda8d8800 in abort () at /lib64/libc.so.6 #2 0x00007fffda8cf0da in __assert_fail_base () at /lib64/libc.so.6 #3 0x00007fffda8cf152 in () at /lib64/libc.so.6 #4 0x0000555555c80a71 in bdrv_co_pwritev (child=0x555556bf83b0, offset=37289984, bytes=65536, qiov=0x7fffc95ffd90, flags=0) at /work/armbru/qemu/block/io.c:1619 #5 0x0000555555c4b98b in do_perform_cow_write (bs=0x555556bed350, cluster_offset=37289984, offset_in_cluster=0, qiov=0x7fffc95ffd90) at /work/armbru/qemu/block/qcow2-cluster.c:488 #6 0x0000555555c4c782 in perform_cow (bs=0x555556bed350, m=0x555556de47e0) at /work/armbru/qemu/block/qcow2-cluster.c:875 #7 0x0000555555c4c943 in qcow2_alloc_cluster_link_l2 (bs=0x555556bed350, m=0x555556de47e0) at /work/armbru/qemu/block/qcow2-cluster.c:924 #8 0x0000555555c3c28c in qcow2_co_pwritev (bs=0x555556bed350, offset=4294967296, bytes=32768, qiov=0x7fffffffbd90, flags=0) at /work/armbru/qemu/block/qcow2.c:1999 #9 0x0000555555c40fd9 in qcow2_save_vmstate (bs=0x555556bed350, qiov=0x7fffffffbd90, pos=0) at /work/armbru/qemu/block/qcow2.c:3875 #10 0x0000555555c820c4 in bdrv_co_rw_vmstate (bs=0x555556bed350, qiov=0x7fffffffbd90, pos=0, is_read=false) at /work/armbru/qemu/block/io.c:2215 #11 0x0000555555c8214d in bdrv_co_rw_vmstate_entry (opaque=0x7fffffffbce0) at /work/armbru/qemu/block/io.c:2228 We flunk assert(!(bs->open_flags & BDRV_O_INACTIVE)); in bdrv_co_prwitev(). I'm not familiar with the logic around BDRV_INACTIVE, so I asked Kevin Wolf. He considers this a migration bug: savevm can't work after migration completed and transferred ownership. With an explicit QMP command to transfer ownership, even a savevm command issued while migration runs would be safe: the (synchronous) savevm command completes before ownership is transferred via explicit command. Without, migration would have to wait for jobs that can't cope with ownership transfer to complete. Reassigning to migration team for further triage. |
Created attachment 1235327 [details] gdb backtrace Description of problem: vm/qemu crashed when snapshot-create-as two snapshots with specific options and sequences. Version-Release number of selected component (if applicable): libvirt-2.0.0-10.el7_3.3.x86_64 qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64 How reproducible: 100% PLEASE NOTE THIS IS NOT REPRODUCIBLE WITH: libvirt-2.0.0-10.el7_3.2.x86_64 Steps to Reproduce: 1. prepare a vm # virsh list Id Name State ---------------------------------------------------- 20 avocado-vt-vm1 running 2.# virsh start avocado-vt-vm1 Domain avocado-vt-vm1 started 3. currently there is no snapshots for that vm # virsh snapshot-list avocado-vt-vm1 Name Creation Time State ------------------------------------------------------------ 4. create a snapshot with --live and --memspec # virsh snapshot-create-as avocado-vt-vm1 snap1 --live --memspec /tmp/1.tmp Domain snapshot snap1 created 5. create another snapshot without any options # virsh snapshot-create-as avocado-vt-vm1 snap2 error: Unable to read from monitor: Connection reset by peer <===== qemu crashed Actual results: qemu crashed at step 5 Expected results: snapshot should be successfully created. Additional info: gdb backtrace pls refer to attachment.