Bug 1408653

Summary:

qemu aborts when taking internal snapshot if vcpus are not resumed after migration

Product:

Red Hat Enterprise Linux 7

Reporter:

yisun

Component:

qemu-kvm-rhev

Assignee:

Dr. David Alan Gilbert <dgilbert>

Status:

CLOSED DEFERRED

QA Contact:

Li Xiaohui <xiaohli>

Severity:

medium

Docs Contact:

Priority:

low

Version:

7.3

CC:

ailan, aliang, chayang, coli, dyuan, hachen, hhan, hhuang, jinzhao, juzhang, lvivier, ngu, peterx, pingl, quintela, qzhang, virt-maint, xuwei, xuzhang, yisun

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-07-22 20:31:25 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1473046

Attachments:

Description	Flags
gdb backtrace	none

Description yisun 2016-12-26 08:44:10 UTC

Created attachment 1235327 [details]
gdb backtrace

Description of problem:
vm/qemu crashed when snapshot-create-as two snapshots with specific options and sequences.

Version-Release number of selected component (if applicable):
libvirt-2.0.0-10.el7_3.3.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64

How reproducible:
100%

PLEASE NOTE THIS IS NOT REPRODUCIBLE WITH: libvirt-2.0.0-10.el7_3.2.x86_64

Steps to Reproduce:
1. prepare a vm
# virsh list
 Id    Name                           State
----------------------------------------------------
 20    avocado-vt-vm1                 running

2.# virsh start avocado-vt-vm1
Domain avocado-vt-vm1 started


3. currently there is no snapshots for that vm
# virsh snapshot-list avocado-vt-vm1
 Name                 Creation Time             State
------------------------------------------------------------

4. create a snapshot with --live and --memspec
# virsh snapshot-create-as avocado-vt-vm1 snap1 --live --memspec /tmp/1.tmp
Domain snapshot snap1 created

5. create another snapshot without any options
# virsh snapshot-create-as avocado-vt-vm1 snap2
error: Unable to read from monitor: Connection reset by peer
<===== qemu crashed 


Actual results:
qemu crashed at step 5

Expected results:
snapshot should be successfully created. 


Additional info:
gdb backtrace pls refer to attachment.

Comment 2 yisun 2016-12-26 09:00:18 UTC

this issue was **NOT** reproduced with
libvirt-2.0.0-10.el7_3.2.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64

but reproduced with:
libvirt-2.0.0-10.el7_3.3.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64  <<===== qemu version not changed here

So not sure if qemu or libvirt should fix this, if assign to wrong component, pls change it in time. thx.

Comment 3 yisun 2016-12-26 11:01:18 UTC

btw, this issue is not reproduced on libvirt-2.5.0-1.el7.x86_64 and qemu-kvm-rhev-2.6.0-29.el7.x86_64

Comment 6 Peter Krempa 2017-01-10 13:44:46 UTC

The libvirt bug that allowed the sequence of commands described above was fixed in the original bug, that tracks the patch that broke the code (1403691).

The problem reproduces even with current upstream qemu.

Steps to reproduce:
1) start qemu with a qcow2 volume
2) { "execute": "migrate", "arguments": { "uri": "tcp:0:4446" } }
3) { "execute": "human-monitor-command", "arguments": { "command-line": "savevm ble" } }

After that qemu aborts:
Line numbers correspond to commit a92f7fe5a82ac9e8d127e92c5dce1a84064126da

Thread 1 "qemu-system-x86" received signal SIGABRT, Aborted.
0x00007f6ce380e137 in raise () from target:/lib64/libc.so.6
(gdb) t a a bt

Thread 5 (Thread 0x7f6cd37fe700 (LWP 23180)):
#0  0x00007f6ce38b984d in poll () from target:/lib64/libc.so.6
#1  0x00007f6ce5b9a89c in g_main_context_iterate.isra () from target:/usr/lib64/libglib-2.0.so.0
#2  0x00007f6ce5b9ac22 in g_main_loop_run () from target:/usr/lib64/libglib-2.0.so.0
#3  0x00007f6ce4ce9778 in red_worker_main () from target:/usr/lib64/libspice-server.so.1
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 4 (Thread 0x7f6cd3fff700 (LWP 23178)):
#0  0x00007f6ce809a13f in pthread_cond_wait () from target:/lib64/libpthread.so.0
#1  0x000055e521148599 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x55e52177c300 <qemu_global_mutex>)
    at util/qemu-thread-posix.c:137
#2  0x000055e520dfe913 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/pipo/git/qemu.git/cpus.c:964
#3  qemu_kvm_cpu_thread_fn (arg=0x55e522041b50) at /home/pipo/git/qemu.git/cpus.c:1003
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 3 (Thread 0x7f6cd8f29700 (LWP 23177)):
#0  0x00007f6ce809a13f in pthread_cond_wait () from target:/lib64/libpthread.so.0
#1  0x000055e521148599 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x55e52177c300 <qemu_global_mutex>)
    at util/qemu-thread-posix.c:137
#2  0x000055e520dfe913 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/pipo/git/qemu.git/cpus.c:964
#3  qemu_kvm_cpu_thread_fn (arg=0x55e521fe1250) at /home/pipo/git/qemu.git/cpus.c:1003
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 2 (Thread 0x7f6cdb26c700 (LWP 23168)):
#0  0x00007f6ce38be429 in syscall () from target:/lib64/libc.so.6
#1  0x000055e5211488a5 in futex_wait (val=<optimized out>, ev=<optimized out>) at util/qemu-thread-posix.c:306
#2  qemu_event_wait (ev=ev@entry=0x55e521ba2a04 <rcu_call_ready_event>) at util/qemu-thread-posix.c:422
#3  0x000055e5211575fe in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:249
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 1 (Thread 0x7f6ceb935b00 (LWP 23165)):
#0  0x00007f6ce380e137 in raise () from target:/lib64/libc.so.6
#1  0x00007f6ce380f5ba in abort () from target:/lib64/libc.so.6
#2  0x00007f6ce38071bd in __assert_fail_base () from target:/lib64/libc.so.6
#3  0x00007f6ce3807272 in __assert_fail () from target:/lib64/libc.so.6
#4  0x000055e5210e6f7e in bdrv_co_pwritev (child=<optimized out>, offset=<optimized out>, bytes=<optimized out>, 
    qiov=<optimized out>, flags=0) at block/io.c:1514
#5  0x000055e5210e7032 in bdrv_rw_co_entry (opaque=0x7ffeaf5e3fa0) at block/io.c:595
#6  0x000055e52115882a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79
#7  0x00007f6ce381ee00 in ?? () from target:/lib64/libc.so.6
#8  0x00007ffeaf5e3810 in ?? ()
#9  0x0000000000000000 in ?? ()

To reproduce via libvirt you can use the qemu-monitor-command facility, since the bug described above was already patched. (Qemu is resumed after migration and does not crash).
 
$ virsh qemu-monitor-command test '{ "execute": "migrate", "arguments": { "uri": "tcp:0:4446" } }'
$ virsh qemu-monitor-command test '{ "execute": "human-monitor-command", "arguments": { "command-line": "savevm ble" } }'

Comment 8 Markus Armbruster 2017-11-28 14:38:42 UTC

Reproduces with current upstream (commit 5e19aed59).  Backtrace:

#0  0x00007fffda8d69fb in raise () at /lib64/libc.so.6
#1  0x00007fffda8d8800 in abort () at /lib64/libc.so.6
#2  0x00007fffda8cf0da in __assert_fail_base () at /lib64/libc.so.6
#3  0x00007fffda8cf152 in  () at /lib64/libc.so.6
#4  0x0000555555c80a71 in bdrv_co_pwritev (child=0x555556bf83b0, offset=37289984, bytes=65536, qiov=0x7fffc95ffd90, flags=0)
    at /work/armbru/qemu/block/io.c:1619
#5  0x0000555555c4b98b in do_perform_cow_write (bs=0x555556bed350, cluster_offset=37289984, offset_in_cluster=0, qiov=0x7fffc95ffd90)
    at /work/armbru/qemu/block/qcow2-cluster.c:488
#6  0x0000555555c4c782 in perform_cow (bs=0x555556bed350, m=0x555556de47e0)
    at /work/armbru/qemu/block/qcow2-cluster.c:875
#7  0x0000555555c4c943 in qcow2_alloc_cluster_link_l2 (bs=0x555556bed350, m=0x555556de47e0) at /work/armbru/qemu/block/qcow2-cluster.c:924
#8  0x0000555555c3c28c in qcow2_co_pwritev (bs=0x555556bed350, offset=4294967296, bytes=32768, qiov=0x7fffffffbd90, flags=0)
    at /work/armbru/qemu/block/qcow2.c:1999
#9  0x0000555555c40fd9 in qcow2_save_vmstate (bs=0x555556bed350, qiov=0x7fffffffbd90, pos=0) at /work/armbru/qemu/block/qcow2.c:3875
#10 0x0000555555c820c4 in bdrv_co_rw_vmstate (bs=0x555556bed350, qiov=0x7fffffffbd90, pos=0, is_read=false) at /work/armbru/qemu/block/io.c:2215
#11 0x0000555555c8214d in bdrv_co_rw_vmstate_entry (opaque=0x7fffffffbce0)
    at /work/armbru/qemu/block/io.c:2228

Comment 9 Markus Armbruster 2017-11-28 15:43:09 UTC

We flunk

    assert(!(bs->open_flags & BDRV_O_INACTIVE));

in bdrv_co_prwitev().  I'm not familiar with the logic around BDRV_INACTIVE, so I asked Kevin Wolf.  He considers this a migration bug: savevm can't work after migration completed and transferred ownership.  

With an explicit QMP command to transfer ownership, even a savevm command issued while migration runs would be safe: the (synchronous) savevm command completes before ownership is transferred via explicit command.

Without, migration would have to wait for jobs that can't cope with ownership transfer to complete.

Reassigning to migration team for further triage.