Bug 734784

Summary: migration target can crash (assert(d->ssd.running))
Product: Red Hat Enterprise Linux 6 Reporter: Yonit Halperin <yhalperi>
Component: spice-serverAssignee: Yonit Halperin <yhalperi>
Status: CLOSED NOTABUG QA Contact: Desktop QE <desktop-qa-list>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: acathrow, alevy, cfergeau, cmeadors, dblechte, juzhang, mkenneth, shuang, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 733993 Environment:
Last Closed: 2011-09-14 07:34:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 733993    
Bug Blocks: 731472    

Description Yonit Halperin 2011-08-31 12:51:56 UTC
qxl_worker_start (in red_dispatcher.c), should wait till spice-server
completes handle_dev_start in order to avoid inconsistency between the actual running state of the worker to the ssd.running in qemu.
(This is in addition to the fix in qemu-kvm for #733993)

+++ This bug was initially created as a clone of Bug #733993 +++

Description of problem:
After migration completes, the target can crash with assert(d->ssd.running)
in qxl_send_events

When migration completes and the target guest is started, the following
occurs:
qemu_spice_vm_change_state_handler is called
1.1) qemu_spice_vm_change_state_handler calls qemu_spice_start
1.2)  qemu_spice_vm_change_state_handle sets ssd->running = true

The problem is ssd->running is accessed both from spice's red_worker thread and qemu thread.
1) qemu thread: qemu_spice_start (but doesn't set ssd->running=true yet)
2) red_worker thread: red_worker starts
3) red_worker thread: calls qxl->interface_get_command and triggers
   qxl_send_events
4) assert(d->ssd.running)
The simplest solution is to just set ssd.running = true, before calling qemu_spice_start. Alternatively, we can use locks.

--- Additional comment from yhalperi on 2011-08-29 06:05:24 EDT ---

(In reply to comment #0)
> Description of problem:
> After migration completes, the target can crash with assert(d->ssd.running)
> in qxl_send_events
> 
> When migration completes and the target guest is started, the following
> occurs:
> qemu_spice_vm_change_state_handler is called
> 1.1) qemu_spice_vm_change_state_handler calls qemu_spice_start
> 1.2)  qemu_spice_vm_change_state_handle sets ssd->running = true
> 
> The problem is ssd->running is accessed both from spice's red_worker thread and
> qemu thread.
> 1) qemu thread: qemu_spice_start (but doesn't set ssd->running=true yet)
> 2) red_worker thread: red_worker starts
> 3) red_worker thread: calls qxl->interface_get_command and triggers
>    qxl_send_events
> 4) assert(d->ssd.running)
> The simplest solution is to just set ssd.running = true, before calling
> qemu_spice_start. Alternatively, we can use locks.
correction: we can't just move ssd.running: until start/stop are actually performed in the red_worker, the worker can perform other operations which trigger qxl_send_events, for example, and the ssd->running must be synchronized with the current worker state.
In addition, I think that qemu_spice_start should be changed in spice-server to be synchronous.

Comment 1 Cameron Meadors 2011-09-06 17:34:09 UTC
What would an end user need to do to hit this bug?  We should also have a way of directly testing the code change.  We need a valid way of reproducing this in order for us to ACK the bug.

Comment 2 Yonit Halperin 2011-09-06 18:04:00 UTC
(In reply to comment #1)
> What would an end user need to do to hit this bug?  We should also have a way
> of directly testing the code change.  We need a valid way of reproducing this
> in order for us to ACK the bug.

I don't have a deterministic way to reproduce this bug. It depends on when context switching between spice-server and qemu threads occur.
I hit the bug several times when I migrated a Windows 7 guest with 2DTom running on it (and then you can alternatively hit 732949).

Comment 3 Yonit Halperin 2011-09-14 07:34:22 UTC
After all the solution only involves qemu. Closing.