1408653 – qemu aborts when taking internal snapshot if vcpus are not resumed after migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1408653 - qemu aborts when taking internal snapshot if vcpus are not resumed after migration

Summary: qemu aborts when taking internal snapshot if vcpus are not resumed after migr...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.3
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Dr. David Alan Gilbert
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1473046
TreeView+	depends on / blocked

Reported:	2016-12-26 08:44 UTC by yisun
Modified:	2020-01-16 02:27 UTC (History)
CC List:	20 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-22 20:31:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
gdb backtrace (8.50 KB, text/plain) 2016-12-26 08:44 UTC, yisun	no flags	Details
View All

Description yisun 2016-12-26 08:44:10 UTC

Created attachment 1235327 [details]
gdb backtrace

Description of problem:
vm/qemu crashed when snapshot-create-as two snapshots with specific options and sequences.

Version-Release number of selected component (if applicable):
libvirt-2.0.0-10.el7_3.3.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64

How reproducible:
100%

PLEASE NOTE THIS IS NOT REPRODUCIBLE WITH: libvirt-2.0.0-10.el7_3.2.x86_64

Steps to Reproduce:
1. prepare a vm
# virsh list
 Id    Name                           State
----------------------------------------------------
 20    avocado-vt-vm1                 running

2.# virsh start avocado-vt-vm1
Domain avocado-vt-vm1 started


3. currently there is no snapshots for that vm
# virsh snapshot-list avocado-vt-vm1
 Name                 Creation Time             State
------------------------------------------------------------

4. create a snapshot with --live and --memspec
# virsh snapshot-create-as avocado-vt-vm1 snap1 --live --memspec /tmp/1.tmp
Domain snapshot snap1 created

5. create another snapshot without any options
# virsh snapshot-create-as avocado-vt-vm1 snap2
error: Unable to read from monitor: Connection reset by peer
<===== qemu crashed 


Actual results:
qemu crashed at step 5

Expected results:
snapshot should be successfully created. 


Additional info:
gdb backtrace pls refer to attachment.

Comment 2 yisun 2016-12-26 09:00:18 UTC

this issue was **NOT** reproduced with
libvirt-2.0.0-10.el7_3.2.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64

but reproduced with:
libvirt-2.0.0-10.el7_3.3.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.2.x86_64  <<===== qemu version not changed here

So not sure if qemu or libvirt should fix this, if assign to wrong component, pls change it in time. thx.

Comment 3 yisun 2016-12-26 11:01:18 UTC

btw, this issue is not reproduced on libvirt-2.5.0-1.el7.x86_64 and qemu-kvm-rhev-2.6.0-29.el7.x86_64

Comment 6 Peter Krempa 2017-01-10 13:44:46 UTC

The libvirt bug that allowed the sequence of commands described above was fixed in the original bug, that tracks the patch that broke the code (1403691).

The problem reproduces even with current upstream qemu.

Steps to reproduce:
1) start qemu with a qcow2 volume
2) { "execute": "migrate", "arguments": { "uri": "tcp:0:4446" } }
3) { "execute": "human-monitor-command", "arguments": { "command-line": "savevm ble" } }

After that qemu aborts:
Line numbers correspond to commit a92f7fe5a82ac9e8d127e92c5dce1a84064126da

Thread 1 "qemu-system-x86" received signal SIGABRT, Aborted.
0x00007f6ce380e137 in raise () from target:/lib64/libc.so.6
(gdb) t a a bt

Thread 5 (Thread 0x7f6cd37fe700 (LWP 23180)):
#0  0x00007f6ce38b984d in poll () from target:/lib64/libc.so.6
#1  0x00007f6ce5b9a89c in g_main_context_iterate.isra () from target:/usr/lib64/libglib-2.0.so.0
#2  0x00007f6ce5b9ac22 in g_main_loop_run () from target:/usr/lib64/libglib-2.0.so.0
#3  0x00007f6ce4ce9778 in red_worker_main () from target:/usr/lib64/libspice-server.so.1
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 4 (Thread 0x7f6cd3fff700 (LWP 23178)):
#0  0x00007f6ce809a13f in pthread_cond_wait () from target:/lib64/libpthread.so.0
#1  0x000055e521148599 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x55e52177c300 <qemu_global_mutex>)
    at util/qemu-thread-posix.c:137
#2  0x000055e520dfe913 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/pipo/git/qemu.git/cpus.c:964
#3  qemu_kvm_cpu_thread_fn (arg=0x55e522041b50) at /home/pipo/git/qemu.git/cpus.c:1003
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 3 (Thread 0x7f6cd8f29700 (LWP 23177)):
#0  0x00007f6ce809a13f in pthread_cond_wait () from target:/lib64/libpthread.so.0
#1  0x000055e521148599 in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x55e52177c300 <qemu_global_mutex>)
    at util/qemu-thread-posix.c:137
#2  0x000055e520dfe913 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/pipo/git/qemu.git/cpus.c:964
#3  qemu_kvm_cpu_thread_fn (arg=0x55e521fe1250) at /home/pipo/git/qemu.git/cpus.c:1003
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 2 (Thread 0x7f6cdb26c700 (LWP 23168)):
#0  0x00007f6ce38be429 in syscall () from target:/lib64/libc.so.6
#1  0x000055e5211488a5 in futex_wait (val=<optimized out>, ev=<optimized out>) at util/qemu-thread-posix.c:306
#2  qemu_event_wait (ev=ev@entry=0x55e521ba2a04 <rcu_call_ready_event>) at util/qemu-thread-posix.c:422
#3  0x000055e5211575fe in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:249
#4  0x00007f6ce8094494 in start_thread () from target:/lib64/libpthread.so.0
#5  0x00007f6ce38c294d in clone () from target:/lib64/libc.so.6

Thread 1 (Thread 0x7f6ceb935b00 (LWP 23165)):
#0  0x00007f6ce380e137 in raise () from target:/lib64/libc.so.6
#1  0x00007f6ce380f5ba in abort () from target:/lib64/libc.so.6
#2  0x00007f6ce38071bd in __assert_fail_base () from target:/lib64/libc.so.6
#3  0x00007f6ce3807272 in __assert_fail () from target:/lib64/libc.so.6
#4  0x000055e5210e6f7e in bdrv_co_pwritev (child=<optimized out>, offset=<optimized out>, bytes=<optimized out>, 
    qiov=<optimized out>, flags=0) at block/io.c:1514
#5  0x000055e5210e7032 in bdrv_rw_co_entry (opaque=0x7ffeaf5e3fa0) at block/io.c:595
#6  0x000055e52115882a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79
#7  0x00007f6ce381ee00 in ?? () from target:/lib64/libc.so.6
#8  0x00007ffeaf5e3810 in ?? ()
#9  0x0000000000000000 in ?? ()

To reproduce via libvirt you can use the qemu-monitor-command facility, since the bug described above was already patched. (Qemu is resumed after migration and does not crash).
 
$ virsh qemu-monitor-command test '{ "execute": "migrate", "arguments": { "uri": "tcp:0:4446" } }'
$ virsh qemu-monitor-command test '{ "execute": "human-monitor-command", "arguments": { "command-line": "savevm ble" } }'

Comment 8 Markus Armbruster 2017-11-28 14:38:42 UTC

Reproduces with current upstream (commit 5e19aed59).  Backtrace:

#0  0x00007fffda8d69fb in raise () at /lib64/libc.so.6
#1  0x00007fffda8d8800 in abort () at /lib64/libc.so.6
#2  0x00007fffda8cf0da in __assert_fail_base () at /lib64/libc.so.6
#3  0x00007fffda8cf152 in  () at /lib64/libc.so.6
#4  0x0000555555c80a71 in bdrv_co_pwritev (child=0x555556bf83b0, offset=37289984, bytes=65536, qiov=0x7fffc95ffd90, flags=0)
    at /work/armbru/qemu/block/io.c:1619
#5  0x0000555555c4b98b in do_perform_cow_write (bs=0x555556bed350, cluster_offset=37289984, offset_in_cluster=0, qiov=0x7fffc95ffd90)
    at /work/armbru/qemu/block/qcow2-cluster.c:488
#6  0x0000555555c4c782 in perform_cow (bs=0x555556bed350, m=0x555556de47e0)
    at /work/armbru/qemu/block/qcow2-cluster.c:875
#7  0x0000555555c4c943 in qcow2_alloc_cluster_link_l2 (bs=0x555556bed350, m=0x555556de47e0) at /work/armbru/qemu/block/qcow2-cluster.c:924
#8  0x0000555555c3c28c in qcow2_co_pwritev (bs=0x555556bed350, offset=4294967296, bytes=32768, qiov=0x7fffffffbd90, flags=0)
    at /work/armbru/qemu/block/qcow2.c:1999
#9  0x0000555555c40fd9 in qcow2_save_vmstate (bs=0x555556bed350, qiov=0x7fffffffbd90, pos=0) at /work/armbru/qemu/block/qcow2.c:3875
#10 0x0000555555c820c4 in bdrv_co_rw_vmstate (bs=0x555556bed350, qiov=0x7fffffffbd90, pos=0, is_read=false) at /work/armbru/qemu/block/io.c:2215
#11 0x0000555555c8214d in bdrv_co_rw_vmstate_entry (opaque=0x7fffffffbce0)
    at /work/armbru/qemu/block/io.c:2228

Comment 9 Markus Armbruster 2017-11-28 15:43:09 UTC

We flunk

    assert(!(bs->open_flags & BDRV_O_INACTIVE));

in bdrv_co_prwitev().  I'm not familiar with the logic around BDRV_INACTIVE, so I asked Kevin Wolf.  He considers this a migration bug: savevm can't work after migration completed and transferred ownership.  

With an explicit QMP command to transfer ownership, even a savevm command issued while migration runs would be safe: the (synchronous) savevm command completes before ownership is transferred via explicit command.

Without, migration would have to wait for jobs that can't cope with ownership transfer to complete.

Reassigning to migration team for further triage.

Note You need to log in before you can comment on or make changes to this bug.

ailan
aliang
chayang
coli
dyuan
hachen
hhan
hhuang
jinzhao
juzhang
lvivier
ngu
peterx
pingl
quintela
qzhang
virt-maint
xuwei
xuzhang
yisun