Hide Forgot
qxl_worker_start (in red_dispatcher.c), should wait till spice-server completes handle_dev_start in order to avoid inconsistency between the actual running state of the worker to the ssd.running in qemu. (This is in addition to the fix in qemu-kvm for #733993) +++ This bug was initially created as a clone of Bug #733993 +++ Description of problem: After migration completes, the target can crash with assert(d->ssd.running) in qxl_send_events When migration completes and the target guest is started, the following occurs: qemu_spice_vm_change_state_handler is called 1.1) qemu_spice_vm_change_state_handler calls qemu_spice_start 1.2) qemu_spice_vm_change_state_handle sets ssd->running = true The problem is ssd->running is accessed both from spice's red_worker thread and qemu thread. 1) qemu thread: qemu_spice_start (but doesn't set ssd->running=true yet) 2) red_worker thread: red_worker starts 3) red_worker thread: calls qxl->interface_get_command and triggers qxl_send_events 4) assert(d->ssd.running) The simplest solution is to just set ssd.running = true, before calling qemu_spice_start. Alternatively, we can use locks. --- Additional comment from yhalperi on 2011-08-29 06:05:24 EDT --- (In reply to comment #0) > Description of problem: > After migration completes, the target can crash with assert(d->ssd.running) > in qxl_send_events > > When migration completes and the target guest is started, the following > occurs: > qemu_spice_vm_change_state_handler is called > 1.1) qemu_spice_vm_change_state_handler calls qemu_spice_start > 1.2) qemu_spice_vm_change_state_handle sets ssd->running = true > > The problem is ssd->running is accessed both from spice's red_worker thread and > qemu thread. > 1) qemu thread: qemu_spice_start (but doesn't set ssd->running=true yet) > 2) red_worker thread: red_worker starts > 3) red_worker thread: calls qxl->interface_get_command and triggers > qxl_send_events > 4) assert(d->ssd.running) > The simplest solution is to just set ssd.running = true, before calling > qemu_spice_start. Alternatively, we can use locks. correction: we can't just move ssd.running: until start/stop are actually performed in the red_worker, the worker can perform other operations which trigger qxl_send_events, for example, and the ssd->running must be synchronized with the current worker state. In addition, I think that qemu_spice_start should be changed in spice-server to be synchronous.
What would an end user need to do to hit this bug? We should also have a way of directly testing the code change. We need a valid way of reproducing this in order for us to ACK the bug.
(In reply to comment #1) > What would an end user need to do to hit this bug? We should also have a way > of directly testing the code change. We need a valid way of reproducing this > in order for us to ACK the bug. I don't have a deterministic way to reproduce this bug. It depends on when context switching between spice-server and qemu threads occur. I hit the bug several times when I migrated a Windows 7 guest with 2DTom running on it (and then you can alternatively hit 732949).
After all the solution only involves qemu. Closing.