674055 – non reproducable abort on spice-server: PANIC_ON(!worker->surfaces.surfaces[surface_id].context.canvas)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 674055 - non reproducable abort on spice-server: PANIC_ON(!worker->surfaces.surfaces[surface_id].context.canvas)

Summary: non reproducable abort on spice-server: PANIC_ON(!worker->surfaces.surfaces[s...

Keywords:
Status:	CLOSED DUPLICATE of bug 678208
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	spice-server
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Alon Levy
QA Contact:	Desktop QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	680114 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-01-31 13:47 UTC by Alon Levy
Modified:	2014-08-04 22:08 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-03-20 12:17:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alon Levy 2011-01-31 13:47:04 UTC

Description of problem:
spice-server abort at red_worker.c:handle_dev_destroy_primary_surface

Reporting because I think the problem is real. But could not reproduce so far.

The actual problem: we are accessing red_dispatcher from two threads, and it was never designed for that. This can happen like seen in the stack trace below, first user is a vga timer callback from main thread, and the second is a qxl io handler from vcpu thread. They both call red_dispatcher, which writes to the same pipe. So either protect all calls to red_dispatcher with a mutex (independent from qemu_iothread_lock), or serialize them using a per vcpu pipe to main thread, who will be the only user of red_dispatcher.

Version-Release number of selected component (if applicable):

How reproducible:
0% so far

Steps to Reproduce:
1. Boot up a winxp guest, crash during driver initialization.

Actual results:
crash

Expected results:
no crash

Additional info:

I got a panic in handle_dev_destroy_primary_surface, red_dispatcher.c

At the moment of panic there was an inconsistent status wrt qxl thinking it's in NATIVE mode in one thread and VGA in another. The main_loop had a timer triggered vga refresh leading to a call to qemu_spice_destroy_host_primary (because of vga_draw_text calling qemu_console_resize), which is only possible if qxl0->mode==QXL_MODE_VGA.

Otoh, an io triggered qxl_create_guest_primary, which is called after qxl0->mode is set to QXL_MODE_NATIVE, and after ensuring we exit the vga state with qxl_exit_vga_mode.

This is running a F14 guest. I couldn't recreate it since (running several times, this is after it did happen again once which was enough to run under a debugger).

more complete stack traces:

kvm_main_loop_cpu:
...
kvm_handle_io
...
qxl_create_guest_primary

kvm_main_loop:
...(timer)...
gui_update
dpy_refresh
display_refresh
qemu_spice_display_refresh
vga_hw_update
qxl_hw_update
vga_update_display
vga_draw_text
qemu_console_resize
dpy_resize
display_resize
qemu_spice_display_resize
qemu_spice_destroy_host_primary

red_worker:
handle_dev_destroy_primary_surface
PANIC_ON(!worker->surfaces.surfaces[surface_id].context.canvas)

Comment 2 Uri Lublin 2011-01-31 16:46:53 UTC

Alon, please try to reproduce with a smp (4 or 8 vcpus) guest.

Comment 3 Alon Levy 2011-03-20 12:09:55 UTC

*** Bug 680114 has been marked as a duplicate of this bug. ***

Comment 4 Alon Levy 2011-03-20 12:17:37 UTC

This situation is prevented by the latest locking fixes to bug 678208. I'm marking as duplicate because this is fixed by the same solution, but the bug is actually a different case - here it's an assert caused by dropping the global qemu mutex in the vcpu thread, and in 678208 it's a hang caused by taking the global qemu mutex from the spice server thread.

*** This bug has been marked as a duplicate of bug 678208 ***

Comment 5 leo.liao 2012-11-15 02:02:09 UTC

I met the same condition with :
centos:
kernel:3.6.2
qemu-kvm:1.2.0
spice
qxl video driver

My VM Xp abort and I found the error message from the /var/log/libvirt/qemu/xxx.log

validate_surface: failed on 9
validate_surface: panic !worker->surfaces[surface_id].context.canvas

I have met this twice in two days.
any details are needed?

Note You need to log in before you can comment on or make changes to this bug.