Bug 674055

Summary: non reproducable abort on spice-server: PANIC_ON(!worker->surfaces.surfaces[surface_id].context.canvas)
Product: Red Hat Enterprise Linux 6 Reporter: Alon Levy <alevy>
Component: spice-serverAssignee: Alon Levy <alevy>
Status: CLOSED DUPLICATE QA Contact: Desktop QE <desktop-qa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: dblechte, djasa, hdegoede, hellolwq, mhasko, mkenneth
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-20 12:17:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alon Levy 2011-01-31 13:47:04 UTC
Description of problem:
spice-server abort at red_worker.c:handle_dev_destroy_primary_surface

Reporting because I think the problem is real. But could not reproduce so far.

The actual problem: we are accessing red_dispatcher from two threads, and it was never designed for that. This can happen like seen in the stack trace below, first user is a vga timer callback from main thread, and the second is a qxl io handler from vcpu thread. They both call red_dispatcher, which writes to the same pipe. So either protect all calls to red_dispatcher with a mutex (independent from qemu_iothread_lock), or serialize them using a per vcpu pipe to main thread, who will be the only user of red_dispatcher.

Version-Release number of selected component (if applicable):


How reproducible:
0% so far

Steps to Reproduce:
1. Boot up a winxp guest, crash during driver initialization.

Actual results:
crash

Expected results:
no crash

Additional info:

 I got a panic in handle_dev_destroy_primary_surface, red_dispatcher.c

 At the moment of panic there was an inconsistent status wrt qxl thinking it's in NATIVE mode in one thread and VGA in another. The main_loop had a timer triggered vga refresh leading to a call to qemu_spice_destroy_host_primary (because of vga_draw_text calling qemu_console_resize), which is only possible if qxl0->mode==QXL_MODE_VGA.

 Otoh, an io triggered qxl_create_guest_primary, which is called after qxl0->mode is set to QXL_MODE_NATIVE, and after ensuring we exit the vga state with qxl_exit_vga_mode.

This is running a F14 guest. I couldn't recreate it since (running several times, this is after it did happen again once which was enough to run under a debugger).

more complete stack traces:

kvm_main_loop_cpu:
  ...
  kvm_handle_io
  ...
  qxl_create_guest_primary

kvm_main_loop:
  ...(timer)...
  gui_update
  dpy_refresh
  display_refresh
  qemu_spice_display_refresh
  vga_hw_update
  qxl_hw_update
  vga_update_display
  vga_draw_text
  qemu_console_resize
  dpy_resize
  display_resize
  qemu_spice_display_resize
  qemu_spice_destroy_host_primary

red_worker:
  handle_dev_destroy_primary_surface
   PANIC_ON(!worker->surfaces.surfaces[surface_id].context.canvas)

Comment 2 Uri Lublin 2011-01-31 16:46:53 UTC
Alon, please try to reproduce with a smp (4 or 8 vcpus) guest.

Comment 3 Alon Levy 2011-03-20 12:09:55 UTC
*** Bug 680114 has been marked as a duplicate of this bug. ***

Comment 4 Alon Levy 2011-03-20 12:17:37 UTC
This situation is prevented by the latest locking fixes to bug 678208. I'm marking as duplicate because this is fixed by the same solution, but the bug is actually a different case - here it's an assert caused by dropping the global qemu mutex in the vcpu thread, and in 678208 it's a hang caused by taking the global qemu mutex from the spice server thread.

*** This bug has been marked as a duplicate of bug 678208 ***

Comment 5 leo.liao 2012-11-15 02:02:09 UTC
I met the same condition with :
centos:
kernel:3.6.2
qemu-kvm:1.2.0
spice
qxl video driver

My VM Xp abort and I found the error message from the /var/log/libvirt/qemu/xxx.log

validate_surface: failed on 9
validate_surface: panic !worker->surfaces[surface_id].context.canvas

I have met this twice in two days.
any details are needed?