Bug 733993

Summary:	migration target can crash (assert(d->ssd.running))
Product:	Red Hat Enterprise Linux 6	Reporter:	Yonit Halperin <yhalperi>
Component:	qemu-kvm	Assignee:	Yonit Halperin <yhalperi>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	alevy, bcao, dblechte, juzhang, mkenneth, shuang, tburke, virt-maint, xfu
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-0.12.1.2-2.192.el6	Doc Type:	Bug Fix
Doc Text:	Cause qxl->ssd.running=true was set after telling the target spice server to start. Spice server thread can call qxl_send_events while qxl->ssd.running is still false. Consequence target qemu aborts on assert(d->ssd.running) Fix set qxl->ssd.running=true before telling spice to start Result target qemu don't crash	Story Points:	---
Clone Of:
Clones:	734784 (view as bug list)		Environment:
Last Closed:	2011-12-06 15:56:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	734784, 743047

Description Yonit Halperin 2011-08-29 06:01:40 UTC

Description of problem:
After migration completes, the target can crash with assert(d->ssd.running)
in qxl_send_events

When migration completes and the target guest is started, the following
occurs:
qemu_spice_vm_change_state_handler is called
1.1) qemu_spice_vm_change_state_handler calls qemu_spice_start
1.2)  qemu_spice_vm_change_state_handle sets ssd->running = true

The problem is ssd->running is accessed both from spice's red_worker thread and qemu thread.
1) qemu thread: qemu_spice_start (but doesn't set ssd->running=true yet)
2) red_worker thread: red_worker starts
3) red_worker thread: calls qxl->interface_get_command and triggers
   qxl_send_events
4) assert(d->ssd.running)
The simplest solution is to just set ssd.running = true, before calling qemu_spice_start. Alternatively, we can use locks.

Comment 1 Yonit Halperin 2011-08-29 10:05:24 UTC

(In reply to comment #0)
> Description of problem:
> After migration completes, the target can crash with assert(d->ssd.running)
> in qxl_send_events
> 
> When migration completes and the target guest is started, the following
> occurs:
> qemu_spice_vm_change_state_handler is called
> 1.1) qemu_spice_vm_change_state_handler calls qemu_spice_start
> 1.2)  qemu_spice_vm_change_state_handle sets ssd->running = true
> 
> The problem is ssd->running is accessed both from spice's red_worker thread and
> qemu thread.
> 1) qemu thread: qemu_spice_start (but doesn't set ssd->running=true yet)
> 2) red_worker thread: red_worker starts
> 3) red_worker thread: calls qxl->interface_get_command and triggers
>    qxl_send_events
> 4) assert(d->ssd.running)
> The simplest solution is to just set ssd.running = true, before calling
> qemu_spice_start. Alternatively, we can use locks.
correction: we can't just move ssd.running: until start/stop are actually performed in the red_worker, the worker can perform other operations which trigger qxl_send_events, for example, and the ssd->running must be synchronized with the current worker state.
In addition, I think that qemu_spice_start should be changed in spice-server to be synchronous.

Comment 9 juzhang 2011-10-20 03:14:45 UTC

According to comment7 and comment8,would you please tell us these infos can make this issue as verified?

Comment 11 Mike Cao 2011-10-25 08:13:45 UTC

Reporoduce this issue on qemu-kvm-0.12.1.2-2.159.el6
Verified this issue on qemu-kvm-0.12.1.2-2.200.el6

steps:
1.start guest with -spice 
CLI: /usr/libexec/qemu-kvm -M rhel6.2.0 -cpu Westmere -enable-kvm -m 2G -smp 2G -name rhel6 -uuid 716f1b4a-32f7-494a-ae38-d6371b7642c8 -monitor stdio -rtc base=utc -boot dc -drive file=/home/rhel6u2,if=none,id=drive-virtio0-0-0,format=raw,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virtio0-0-0 -netdev tap,script=/etc/qemu-ifup,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d0:4d:60,bus=pci.0,addr=0x4 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -spice port=8002,disable-ticketing -vga qxl -global qxl-vga.vram_size=9437184 -device virtio-balloon-pci,id=balloon1 -usb -device usb-tablet,id=input0
2.install flash-plugin in the guest 
3.Open http://v.youku.com/v_show/id_XMjc0MzU3OTUy.html 
4.during step 3 ,do live migration

Actual Results:
on qemu-kvm-0.12.1.2-2.159.el6
 qemu-kvm: /builddir/build/BUILD/qemu-kvm-0.12.1.2/hw/qxl.c:684: qxl_check_state: Assertion `((&ram->cmd_ring)->cons == (&ram->cmd_ring)->prod)' failed.
(gdb) bt
#0  0x00000033d8432885 in raise () from /lib64/libc.so.6
#1  0x00000033d8434065 in abort () from /lib64/libc.so.6
#2  0x00000033d842b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000033d842bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000047552b in qxl_check_state (d=0x2aa7840)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/qxl.c:684
#5  qxl_soft_reset (d=0x2aa7840) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/qxl.c:707
#6  0x0000000000475ea5 in qxl_hard_reset (d=0x2aa7840, loadvm=1)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/qxl.c:733
#7  0x00000000004764ed in qxl_pre_load (opaque=0x2aa7840)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/qxl.c:1469
#8  0x00000000004c1d4c in vmstate_load_state (f=0x2b1f0a0, vmsd=0x8d7e60, 
    opaque=0x2aa7840, version_id=21) at savevm.c:1301
#9  0x00000000004c2399 in qemu_loadvm_state (f=0x2b1f0a0) at savevm.c:1784
#10 0x00000000004baaf9 in process_incoming_migration (f=<value optimized out>)
    at migration.c:73
#11 0x00000000004bae0f in tcp_accept_incoming_migration (opaque=<value optimized out>)
    at migration-tcp.c:165
#12 0x000000000040ba2f in main_loop_wait (timeout=1000)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4430
#13 0x000000000042b52a in kvm_main_loop ()
    at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:2164
#14 0x000000000040ef55 in main_loop (argc=<value optimized out>, 
    argv=<value optimized out>, envp=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:4640
#15 main (argc=<value optimized out>, argv=<value optimized out>, 
    envp=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/vl.c:6845

on qemu-kvm-0.12.1.2-2.200.el6 ,no coredump during migration.

Comment 12 Mike Cao 2011-10-25 08:15:19 UTC

Hi, Yonit

Could you view my comment #11 ? Does the results in -159 means I reproduce the issue successfully ?

TIA,
Mike

Comment 13 Yonit Halperin 2011-10-25 10:58:39 UTC

(In reply to comment #12)
> Hi, Yonit
> 
> Could you view my comment #11 ? Does the results in -159 means I reproduce the
> issue successfully ?
Sorry, no. You reproduced bug 728984.
> 
> TIA,
> Mike

Comment 14 juzhang 2011-10-25 11:22:09 UTC

Hi Yonit

   According to comment 10 and comment13,we both failed to reproduce this issue but reproduced bz728984 and bz729621 respectively.seems same scenario cause different bugs,would you please double check our steps are right?if our steps are right,can we repeat repeat comment11's steps 1000 times using script in fixed qemu-kvm version(qemu-kvm-0.12.1.2-2.200.el6).if we pass all migration iterations,can change this issue as verified?

Best Regards,
Junyi

Comment 15 Yonit Halperin 2011-10-25 11:43:45 UTC

(In reply to comment #14)
> Hi Yonit
> 
>    According to comment 10 and comment13,we both failed to reproduce this issue
> but reproduced bz728984 and bz729621 respectively.seems same scenario cause
> different bugs,would you please double check our steps are right?if our steps
> are right,can we repeat repeat comment11's steps 1000 times using script in
> fixed qemu-kvm version(qemu-kvm-0.12.1.2-2.200.el6).if we pass all migration
> iterations,can change this issue as verified?
> 
Yes, you can change it to verified if the script passes.
> Best Regards,
> Junyi

Comment 16 FuXiangChun 2011-10-26 08:34:06 UTC

try to verify this bug with qemu-kvm-0.12.1.2-2.200.el6.x86_64. 

  Execute a script to repeat migration 1000 times. guest always core dump when playing video during the migrate, repeat to run 4 times the script and every time will get the same below bt file.  this issue compare with bug 744518 via trace file, Always seems to reproduce this new bug 744518. 

(gdb) bt
#0  0x00000032c2c32945 in raise () from /lib64/libc.so.6
#1  0x00000032c2c34125 in abort () from /lib64/libc.so.6
#2  0x0000003519831639 in handle_dev_update (listener=0x7f87b5851c00, events=<value optimized out>) at red_worker.c:9725
#3  handle_dev_input (listener=0x7f87b5851c00, events=<value optimized out>) at red_worker.c:9982
#4  0x00000035198305d5 in red_worker_main (arg=<value optimized out>) at red_worker.c:10304
#5  0x00000032c34077e1 in start_thread () from /lib64/libpthread.so.0
#6  0x00000032c2ce57bd in clone () from /lib64/libc.so.6

Comment 17 juzhang 2011-10-27 06:32:50 UTC

According to comment16,I will set this issue as verified and track bz744518.after bz744518 fixed,we will run more than 1000 times as well.

Comment 19 Yonit Halperin 2011-11-20 06:45:51 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
 qxl->ssd.running=true was set after telling the target spice server to start.
 Spice server thread can call qxl_send_events while qxl->ssd.running is still false.

Consequence
  target qemu aborts on assert(d->ssd.running)

Fix
  set qxl->ssd.running=true before telling spice to start

Result
  target qemu don't crash

Comment 20 errata-xmlrpc 2011-12-06 15:56:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1531.html