Bug 746950

Summary: spice-server: channel_event deadlock on reset + timeout
Product: Red Hat Enterprise Linux 6 Reporter: Alon Levy <alevy>
Component: spice-serverAssignee: Uri Lublin <uril>
Status: CLOSED ERRATA QA Contact: Desktop QE <desktop-qa-list>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1CC: acathrow, berrange, cfergeau, cmeadors, dblechte, ddumas, djasa, mkenneth, mkrcmari, pvine, tpelka
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: spice-server-0.8.2-5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 15:22:01 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alon Levy 2011-10-18 11:34:20 UTC
Description of problem:
a system reset in qemu followed by discovering a disconnected client triggers a deadlock on channel_event since the recent added lock taking in channel_event (see bug 697441, commit 34f2a05)

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. launch qemu: qemu-system-x86_64 -spice disable-ticketing,port=1234 -vga qxl -monitor stdio
2. launch spice client (doesn't matter which): spicec -h localhost -p 1234
3. important: don't close the spice client, just background it: ctrl-z in the launching terminal of spicec
4. from qemu monitor: issue system_reset repeatedly. after about the 10th attempt, you will see the message:
(qemu) red_wait_outgoing_item: blocked
5. after a certain timeout you will see the following message:
flush_display_commands: update timeout

Actual results:
No more monitor input is possible. qemu is blocked totally.

Expected results:
client disconnect detected and event delivered, qemu continues normally.

Additional info:
Stack traces:

main thread:
(gdb) bt
#0  0x00000037ec00e5bd in read () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fbd86bda3da in receive_data (fd=10, in_buf=0x7fffd361411c, n=4) at red_worker.h:143
#2  0x00007fbd86bda45f in read_message (fd=10, message=0x7fffd361411c) at red_worker.h:156
#3  0x00007fbd86bdad6c in red_dispatcher_destroy_surfaces (dispatcher=0x2445380) at red_dispatcher.c:316
#4  0x00007fbd86bdadb5 in qxl_worker_destroy_surfaces (qxl_worker=0x2445380) at red_dispatcher.c:322
#5  0x00000000004a3cce in qxl_spice_destroy_surfaces (qxl=0x24d4840, async=QXL_SYNC) at /home/alon/src/spice_rhel
6/qemu-kvm-rhel6/hw/qxl.c:228
#6  0x00000000004a63ee in qxl_reset_surfaces (d=0x24d4840) at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/hw/qxl.c:
1008
#7  0x00000000004a5c87 in qxl_hard_reset (d=0x24d4840, loadvm=0) at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/hw/
qxl.c:878
#8  0x00000000004a5d41 in qxl_reset_handler (dev=0x24d4840) at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/hw/qxl.c
:896
#9  0x00000000005109bc in qdev_reset (opaque=0x24d4840) at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/hw/qdev.c:27
1
#10 0x000000000040da35 in qemu_system_reset () at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/vl.c:3259
#11 0x0000000000438c16 in qemu_kvm_system_reset () at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/qemu-kvm.c:1934
#12 0x0000000000439404 in kvm_main_loop () at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/qemu-kvm.c:2206
#13 0x000000000040e7aa in main_loop () at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/vl.c:4095
#14 0x00000000004129b7 in main (argc=7, argv=0x7fffd3614728, envp=0x7fffd3614768) at /home/alon/src/spice_rhel6/q
emu-kvm-rhel6/vl.c:6323

(This is less important but just a reminder that the whole timeout handling out of the main event loop is wrong)
red_worker thread before timeout:
(gdb) bt
#0  0x00000037ebcbc17d in nanosleep () at ../sysdeps/unix/syscall-template.S:82
#1  0x00000037ebce9f94 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:33
#2  0x00007fbd86c00e0f in flush_display_commands (worker=0x7fbd84c7fbd0) at red_worker.c:8769
#3  0x00007fbd86c00f88 in flush_all_qxl_commands (worker=0x7fbd84c7fbd0) at red_worker.c:8820
#4  0x00007fbd86c0389f in handle_dev_destroy_surfaces (worker=0x7fbd84c7fbd0) at red_worker.c:9818
#5  0x00007fbd86c041d7 in handle_dev_input (listener=0x7fbd84c7fbd0, events=1) at red_worker.c:10013
#6  0x00007fbd86c04de2 in red_worker_main (arg=0x7fffd3613e70) at red_worker.c:10304

red worker thread after timeout:
#3  0x0000000000439ec3 in kvm_mutex_lock () at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/qemu-kvm.c:2610
#4  0x0000000000439ef8 in qemu_mutex_lock_iothread () at /home/alon/src/spice_rhel6/qemu-kvm-rhel6/qemu-kvm.c:262
3
#5  0x00000000005168cd in channel_event (event=3, info=0x2946dd0) at ui/spice-core.c:253
#6  0x00007f5139ba41d6 in reds_channel_event (stream=0x2946d50, event=3) at reds.c:418
#7  0x00007f5139bb0377 in reds_stream_free (s=0x2946d50) at reds.c:5280
#8  0x00007f5139bc459a in red_disconnect_channel (channel=0x7f5118069970) at red_worker.c:8489
#9  0x00007f5139bc4727 in red_disconnect_display (channel=0x7f5118069970) at red_worker.c:8524
#10 0x00007f5139bc4dff in flush_display_commands (worker=0x7f5136c42bd0) at red_worker.c:8766
#11 0x00007f5139bc4f88 in flush_all_qxl_commands (worker=0x7f5136c42bd0) at red_worker.c:8820
#12 0x00007f5139bc789f in handle_dev_destroy_surfaces (worker=0x7f5136c42bd0) at red_worker.c:9818
#13 0x00007f5139bc81d7 in handle_dev_input (listener=0x7f5136c42bd0, events=1) at red_worker.c:10013
#14 0x00007f5139bc8de2 in red_worker_main (arg=0x7fff7e21b780) at red_worker.c:10304

There are patches to handle this on the mailing list but they need revising: http://lists.freedesktop.org/archives/spice-devel/2011-September/005369.html

Comment 5 Alon Levy 2011-10-23 18:01:31 UTC
fixed with this brew build, verified using the procedure in the bug description.

 https://brewweb.devel.redhat.com/taskinfo?taskID=3734350

Alon

Comment 8 Alon Levy 2011-10-25 12:26:49 UTC
commit pushed to 0.8 branch in upstream.

Comment 9 Alon Levy 2011-10-26 21:27:56 UTC
Hmm, I was just going to complain that my brew build didn't enter the RHEL-6.2-candidate group/tag, but I guess it just took some time. There it is, moving to MODIFIED.

Comment 11 Marian Krcmarik 2011-10-27 15:00:54 UTC
I've tried to verify according to steps, however with spice-server-0.8.2-5 I can get:
(qemu) system_reset
(qemu) red_wait_outgoing_item: blocked
#For about 2 minutes, qemu monitor is not responsive
flush_cursor_commands: flush cursor timeout
(qemu)
#Qemu monitor is responsive again

Is this expected correct behaviour? (Just making sure), Thanks.

Comment 12 Alon Levy 2011-10-28 14:30:25 UTC
(In reply to comment #11)
> I've tried to verify according to steps, however with spice-server-0.8.2-5 I
> can get:
> (qemu) system_reset
> (qemu) red_wait_outgoing_item: blocked
> #For about 2 minutes, qemu monitor is not responsive
> flush_cursor_commands: flush cursor timeout
> (qemu)
> #Qemu monitor is responsive again
> 
> Is this expected correct behaviour? (Just making sure), Thanks.

These are two seperate issues. What you are seeing is a timeout entirely in spice-server that has not been addressed, it's actually part of the solution to a possible temporary delay (albiet not optimal). After the timeout the disconnection happens (you'll see the remaining channels are still there with info spice before and after).

So in short yes, this is the expected correct behavior, and another non blocker bz is probably in order.

Alon

Comment 14 errata-xmlrpc 2011-12-06 15:22:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1634.html

Comment 15 David Jaša 2012-03-05 17:28:39 UTC
After a consultation with Marian, we've decided that no ongoing testing is needed.