Bug 729621

Summary:	ASSERT worker->running failed on source qemu during migration with Spice session
Product:	Red Hat Enterprise Linux 6	Reporter:	Marian Krcmarik <mkrcmari>
Component:	qemu-kvm	Assignee:	Alon Levy <alevy>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.2	CC:	alevy, bcao, cmeadors, dblechte, gyue, juzhang, kraxel, mkenneth, qzhang, tburke, virt-maint, xwei
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-0.12.1.2-2.192.el6	Doc Type:	Bug Fix
Doc Text:	Cause pausing all vcpus was done as a special registered handler in the kvm_vm_state_change_handler list Consequence on migration, a source vm having been moved to the stopped condition would receive an io exit after it's state change handler was called but before the vcpus are paused, causing an assert on worker->running. Fix make vcpu pause (unpause) happen after (before) all handlers are called, respectively. Result migration doesn't cause this assert.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-12-06 15:56:01 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	743047

Description Marian Krcmarik 2011-08-10 11:50:03 UTC

Description of problem:
ASSERT worker->running failed occurs on source qemu during migration of RHEL guest to target host.
Source qemu process says:
handle_dev_input: stop
handle_dev_update: ASSERT worker->running failed
2011-08-10 13:30:19.183: shutting down

Target process says:
qemu: warning: error while loading state section id 4
load of migration failed
2011-08-10 13:30:09.923: shutting down

- resulting in VM crash

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.177
spice-server-0.8.2-3
spice-client-0.8.2-3

How reproducible:
~ 1/4

Steps to Reproduce:
1. Connect to a VM using spice client (using RHEVM User Portal or spicec cli)
2. Start system monitor application on VM and keep the spice session.
3. Migrate VM while keeping Spice session (Repeatedly until Actual results)
  
Actual results:
ASSERT worker->running failed on source qemu - > VM crashes - > Spice session is terminated

Expected results:
VM is migrated to target host properly -> Spice session is kept.

Additional info:

Comment 2 Mike Cao 2011-08-18 09:44:03 UTC

Reproduced this issue on qemu-kvm-0.12.1.2-2.183
spice-client-0.8.2-3 spice-server-0.8.2.3 

*note* :only can reproduced on linux guest (rhel6u1 64bit ) ,tried with windows guest ,can not reproduce.

steps:
1.start guest with following CLI:
/usr/libexec/qemu-kvm -M rhel6.2.0 -cpu Westmere -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name RHEL6 -uuid 196160f2-7e48-48e1-ab51-8c21e79d74d5 -monitor stdio -rtc base=localtime -boot c -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive file=/rhev/data-center/e390e371-703e-4e36-a95b-cd85aaf0e34b/677134f8-9e35-4bad-a4cb-0f8a088d7756/images/c79d47f2-057c-44e8-97ed-2ff1000826be/2d1f64db-cbb0-4693-9647-f31d0ceb2e65,if=none,id=drive-virtio-disk0,format=raw,serial=e8-97ed-2ff1000826be,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/rhev/data-center/e390e371-703e-4e36-a95b-cd85aaf0e34b/71bd942b-e339-427f-b528-1a45f337483d/images/11111111-1111-1111-1111-111111111111/RHEL6.1-20110510.1-Server-x86_64-DVD1.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:42:0b:03,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/RHEL6.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -usb -spice port=5901,disable-ticketing -k en-us -vga qxl -global qxl-vga.vram_size=67108864

2.start listening port
3.start spice client .
4 in the guest ,open http://www.jiujiuba.com/picnews/shownews.asp?infoid=1363 ,then scroll the mouse 
5,during step4 ,doing live migration.

Actual Results:
migration failed .


additional info :
 bt
#0  0x0000003591e32a45 in raise () from /lib64/libc.so.6
#1  0x0000003591e34225 in abort () from /lib64/libc.so.6
#2  0x0000003627230ad9 in handle_dev_update (listener=0x7ff620bb4c00,
events=<value optimized out>) at red_worker.c:9725
#3  handle_dev_input (listener=0x7ff620bb4c00, events=<value optimized out>)
at red_worker.c:9982
#4  0x000000362722fa75 in red_worker_main (arg=<value optimized out>) at
red_worker.c:10304
#5  0x00000035926077e1 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003591ee68ed in clone () from /lib64/libc.so.6

Comment 4 Alon Levy 2011-09-13 14:44:48 UTC

I have a scratchbuild that I'd like you to test:

 https://brewweb.devel.redhat.com/taskinfo?taskID=3630214

Alon

The patch it applies (it also applies a bunch of unrelated patches to fix
compilation under fedora 16 because of the newer gcc, which are unrelated) is:

commit 00505234300724c18d5758b5b957648f08a127d8
Author: Alon Levy <alevy>
Date:   Tue Sep 13 15:10:18 2011 +0300

    qemu-kvm: pause threads before calling other vmstate handlers

    RHBZ: 729621

    upstream: not relevant (upstream always uses io thread since recently)

    There are two different paths for handling vm_stop:
     1. with io thread - not interesting since it isn't used by RHEL-6.2
        qemu-kvm.spec
     2. without io thread.

    Without io thread:
     during kvm_create_context (via kvm_init_ap) kvm_vm_state_change_handler
     is registered. This handler is run after any handlers already registered.
     It is responsible for stopping the vcpu threads (signaling the threads and
     waiting on a condition variable).

    The problem:
     kvm_vm_state_change_handler is run after qxl_state_change_handler, which
     sets the spice worker thread to stopped state. But since the vcpus are not
     yet stopped, an io exit then occurs, that leads to a call to the spice
     thread, which asserts because it is stopped.

    Possible solutions:
     1. register the kvm_vm_state_change_handler before all others. This seems
        fragile.
     2. Use the already existing pause_all_vcpus in vl.c (and corresponding
        resume_all_vcpus), currenty empty when io thread is not in use, to call
the
        kvm specific functions. This is what this patch does.

    RFC on:
     solution.
     exporting kvm_pause_all_threads and kvm_resume_all_threads (1. is it a
     good solution? 2. is it the right place to put them in the headers)

diff --git a/qemu-kvm.c b/qemu-kvm.c
index 732d6ec..6e9c400 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -1874,7 +1874,7 @@ static int all_threads_paused(void)
     return 1;
 }

-static void pause_all_threads(void)
+void kvm_pause_all_threads(void)
 {
     CPUState *penv = first_cpu;

@@ -1894,7 +1894,7 @@ static void pause_all_threads(void)
         qemu_cond_wait(&qemu_pause_cond);
 }

-static void resume_all_threads(void)
+void kvm_resume_all_threads(void)
 {
     CPUState *penv = first_cpu;

@@ -1908,14 +1908,6 @@ static void resume_all_threads(void)
     }
 }

-static void kvm_vm_state_change_handler(void *context, int running, int
reason)
-{
-    if (running)
-        resume_all_threads();
-    else
-        pause_all_threads();
-}
-
 static void setup_kernel_sigmask(CPUState *env)
 {
     sigset_t set;
@@ -1937,7 +1929,7 @@ static void qemu_kvm_system_reset(void)
 {
     CPUState *penv = first_cpu;

-    pause_all_threads();
+    kvm_pause_all_threads();

     qemu_system_reset();

@@ -1946,7 +1938,7 @@ static void qemu_kvm_system_reset(void)
         penv = (CPUState *) penv->next_cpu;
     }

-    resume_all_threads();
+    kvm_resume_all_threads();
 }

 static void process_irqchip_events(CPUState *env)
@@ -2056,8 +2048,6 @@ int kvm_init_ap(void)
 {
     struct sigaction action;

-    qemu_add_vm_change_state_handler(kvm_vm_state_change_handler, NULL);
-
     signal(SIG_IPI, sig_ipi_handler);

     memset(&action, 0, sizeof(action));
@@ -2221,7 +2211,7 @@ int kvm_main_loop(void)
     }

     bdrv_close_all();
-    pause_all_threads();
+    kvm_pause_all_threads();
     pthread_mutex_unlock(&qemu_mutex);

     return 0;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 1b905f8..0047c16 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -1167,4 +1167,8 @@ int kvm_tpr_enable_vapic(CPUState *env);

 #endif

+// TODO - put it somewhere appropriate
+void kvm_pause_all_threads(void);
+void kvm_resume_all_threads(void);
+
 #endif
diff --git a/vl.c b/vl.c
index 5cec717..2c4906c 100644
--- a/vl.c
+++ b/vl.c
@@ -3397,10 +3397,16 @@ int qemu_cpu_self(void *env)

 static void resume_all_vcpus(void)
 {
+    if (kvm_allowed) {
+        kvm_resume_all_threads();
+    }
 }

 static void pause_all_vcpus(void)
 {
+    if (kvm_allowed) {
+        kvm_pause_all_threads();
+    }
 }

 void qemu_cpu_kick(void *env)

Comment 5 Marian Krcmarik 2011-09-14 08:50:49 UTC

(In reply to comment #4)
> I have a scratchbuild that I'd like you to test:
> 
>  https://brewweb.devel.redhat.com/taskinfo?taskID=3630214
> 
> Alon

I've tested the scratch build and I've not been able to reproduce the assert fail, I could reproduce it on qemu-kvm-0.12.1.2-2.184, but not on the Alon's scratch build.
So It fixes the problem for me. Thanks

Comment 10 Qunfang Zhang 2011-10-09 07:43:21 UTC

Reproduced at the second time migration according to steps in Comment 2.
CLI:
/usr/libexec/qemu-kvm -M rhel6.2.0 -cpu Westmere -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -name RHEL6 -uuid 196160f2-7e48-48e1-ab51-8c21e79d74d5 -monitor stdio -rtc base=localtime -boot c -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive file=/media/RHEL6.1-64-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:42:0b:03,bus=pci.0,addr=0x3 -chardev socket,id=charchannel0,path=/tmp/foo,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -usb -spice port=5901,disable-ticketing -k en-us -vga qxl -global qxl-vga.vram_size=67108864

Result:
handle_dev_update: ASSERT worker->running failed
Aborted (core dumped)

Verified with:
kernel-2.6.32-206.el6.x86_64
qemu-kvm-0.12.1.2-2.195.el6.x86_64
Test with the same steps and command line, tried more than 20 times ping-pong migration, did not hit the issue any more.

So, this bug is fixed.

Comment 12 Dor Laor 2011-10-17 08:02:46 UTC

*** Bug 736631 has been marked as a duplicate of this bug. ***

Comment 13 Xiaoqing Wei 2011-10-19 06:07:30 UTC

I can reproduce this bug on 
qemu-kvm-0.12.1.2-2.198.el6.x86_64

doing migration when guest rebooting

#0  0x0000003e8e832885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#0  0x0000003e8e832885 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003e8e834065 in abort () at abort.c:92
#2  0x0000003e7c231639 in handle_dev_update (listener=0x7f7e72e29c00, events=<value optimized out>) at red_worker.c:9725
#3  handle_dev_input (listener=0x7f7e72e29c00, events=<value optimized out>) at red_worker.c:9982
#4  0x0000003e7c2305d5 in red_worker_main (arg=<value optimized out>) at red_worker.c:10304
#5  0x0000003e8f0077f1 in start_thread (arg=0x7f7e72ffe700) at pthread_create.c:301
#6  0x0000003e8e8e570d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Comment 15 Mike Cao 2011-10-19 06:20:11 UTC

According to comment #13 and comment #14 ,Bug 736631 is not as same as this one .
Move this bug's status to VERIFIED according to comment #10. 
Re-assign BUg 736631 according to comment #13 and comment #14 

Mike

Comment 16 Alon Levy 2011-11-17 18:10:32 UTC

Adding technical note.

Comment 17 Alon Levy 2011-11-17 18:10:32 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause
    pausing all vcpus was done as a special registered handler in the kvm_vm_state_change_handler list
Consequence
    on migration, a source vm having been moved to the stopped condition would receive an io exit after it's state change handler was called but before the vcpus are paused, causing an assert on worker->running.
Fix
    make vcpu pause (unpause) happen after (before) all handlers are called, respectively.
Result
    migration doesn't cause this assert.

Comment 18 errata-xmlrpc 2011-12-06 15:56:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1531.html