Bug 240012

Summary: xen-vncfb segfault
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: xenAssignee: Markus Armbruster <armbru>
Status: CLOSED RAWHIDE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: katzj, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-09-24 23:29:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 245188    
Attachments:
Description Flags
Core dump
none
Binary of xen-vncfb matching preceeding coredump.
none
Coredump with symbols
none
Binary file matching the above coredump with symbols
none
Binary file matching the above coredump with symbols (2nd version)
none
Another core dump
none
Another core dump
none
Another core dump
none
Reference counting fix
none
Don't continue blindly when the socket is closed
none
Avoid double cleanup
none
Fix rfbClientIterator
none
Avoid double-cleanup none

Description Richard W.M. Jones 2007-05-14 11:52:37 UTC
Description of problem:

The following lines were found in dmesg after I had been aggressively starting
and stopping guests and installing new guests for a period of about 1 hour
(using virt-manager).

# dmesg|grep segfault
xen-vncfb[14705]: segfault at 0000000000000000 rip 000000000040c7f4 rsp
0000000042802c48 error 6
xen-vncfb[13053]: segfault at 0000000000000000 rip 000000000040c7f4 rsp
0000000042802ca8 error 6

There seemed to be no visible indication from virt-manager, _except_ that
sometimes when you hit the Install->Finish button in virt-manager, the console
did not appear even though the guest has started installing.  This might be
related to the appearance of the above, but I am not sure (further testing ongoing).

Version-Release number of selected component (if applicable):

# rpm -qf /usr/lib64/xen/bin/xen-vncfb
xen-3.1.0-0.rc7.1.fc7
# uname -a
Linux lambda 2.6.20-2925.8.fc7xen #1 SMP Thu May 10 17:47:43 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

Other parts of Fedora 7 up to date as of this morning.

How reproducible:

Further testing of this is ongoing to see if it is related to the console not
coming up when starting to install a guest.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Richard W.M. Jones 2007-05-14 12:36:31 UTC
This segfault is _not_ coincident with the "lost console" when starting an
install from virt-manager.

Comment 2 Markus Armbruster 2007-05-14 22:19:16 UTC
rip 000000000040c7f4 seems to be sraSpanInsertBefore+4

  40c7f0:       48 8b 46 08             mov    0x8(%rsi),%rax
  40c7f4:       48 89 37                mov    %rsi,(%rdi)
  40c7f7:       48 89 47 08             mov    %rax,0x8(%rdi)
  40c7fb:       48 8b 46 08             mov    0x8(%rsi),%rax
  40c7ff:       48 89 7e 08             mov    %rdi,0x8(%rsi)
  40c803:       48 89 38                mov    %rdi,(%rax)
  40c806:       c3                      retq   

static void
sraSpanInsertBefore(sraSpan *newspan, sraSpan *before) {
  newspan->_next = before;
  newspan->_prev = before->_prev;
  before->_prev->_next = newspan;
  before->_prev = newspan;
}

newspan must have been null.  Before I investigate all possible callers, let's
try to capture a core dump.  Richard, could you take care of that?

Comment 3 Richard W.M. Jones 2007-05-16 15:21:21 UTC
Created attachment 154832 [details]
Core dump

This is the core dump.	I'm going to upload the corresponding binary next - it
is slightly different because I have recompiled xen on this machine (to apply
another patch).

Comment 4 Richard W.M. Jones 2007-05-16 15:22:07 UTC
Created attachment 154833 [details]
Binary of xen-vncfb matching preceeding coredump.

This is the binary which matches the preceeding coredump.

Comment 5 Richard W.M. Jones 2007-05-16 16:52:10 UTC
Created attachment 154848 [details]
Coredump with symbols

Comment 6 Richard W.M. Jones 2007-05-16 16:52:47 UTC
Created attachment 154849 [details]
Binary file matching the above coredump with symbols

Comment 7 Markus Armbruster 2007-05-16 19:39:55 UTC
I'm sorry, but the binary in attachment 154849 [details] is stripped.  Can I have one with
symbols?

Comment 8 Richard W.M. Jones 2007-05-17 09:31:11 UTC
Created attachment 154900 [details]
Binary file matching the above coredump with symbols (2nd version)

Sorry about that.  Try this one.

Comment 9 Richard W.M. Jones 2007-05-17 09:57:11 UTC
Created attachment 154903 [details]
Another core dump

This core dump has a slightly different stack trace, although it's in the same
general area of the code.

Comment 10 Richard W.M. Jones 2007-05-17 10:35:34 UTC
Created attachment 154904 [details]
Another core dump

And another core dump, with yet again a slightly different stack trace,
although the same area of the code.

Comment 11 Richard W.M. Jones 2007-05-17 11:01:44 UTC
Created attachment 154905 [details]
Another core dump

And another one - again with a different stack trace from the others.

Comment 12 Mark McLoughlin 2007-05-17 14:38:40 UTC
Suggest something like this untested patch:

  http://www.gnome.org/~markmc/code/xen-libvncserver-threading.patch

Comment 13 Markus Armbruster 2007-05-17 18:13:23 UTC
*** Bug 240424 has been marked as a duplicate of this bug. ***

Comment 14 Richard W.M. Jones 2007-05-18 12:56:53 UTC
Markus suggests running xen-vncfb under valgrind to gain more information.

Comment 15 Markus Armbruster 2007-06-08 19:35:14 UTC
Created attachment 156605 [details]
Reference counting fix

Patch proposed by Mark McLoughlin.  It makes the refcounting match the life
range of cl.

Comment 16 Markus Armbruster 2007-06-08 19:46:52 UTC
Created attachment 156608 [details]
Don't continue blindly when the socket is closed

Patch proposed by Mark McLoughlin.  Avoids passing invalid file descriptor to
FD_SET() etc.

Comment 17 Markus Armbruster 2007-06-08 19:49:56 UTC
This is actually a whole family of bugs, and the patches I just attached fix
just two of them.  There may be more.

The root cause they share is that when the client goes away, the threads doing
work for it terminate in a confused and badly coordinated manner, which only
works with lucky timing. 

Comment 18 Markus Armbruster 2007-07-27 14:15:17 UTC
Created attachment 160117 [details]
Avoid double cleanup

The client iterator (protected by rfbClientListMutex) skips entries
with sock<0.  But rfbClientConnectionGone() neglects to reset
cl->sock.  This leads to double-cleanup, with disastrous results.

Comment 19 Markus Armbruster 2007-07-31 15:06:26 UTC
Created attachment 160331 [details]
Fix rfbClientIterator

rfbClientIterator is swarming with bugs:
* Whoever added rfbClientListMutex didn't know what he was doing.
* Reference counting is broken
* The iterator normally skips closed entries (those with sock<0).  But
  rfbClientConnectionGone() neglects to reset cl->sock.
* Closed entries are *not* skipped when LIBVNCSERVER_HAVE_LIBPTHREAD
  is undefined.

Comment 20 Markus Armbruster 2007-07-31 15:08:39 UTC
Created attachment 160332 [details]
Avoid double-cleanup

Both clientInput() and rfbScreenCleanup() call
rfbClientConnectionGone().  This works only if clientInput() wins the
race with a sufficient margin to take the client off the list before
rfbScreenCleanup() sees it.  Otherwise, rfbClientConnectionGone() is
called twice, with potentially disastrous results.

Comment 21 Daniel Berrangé 2007-09-24 23:29:30 UTC
Rawhide no longer uses the LibVNCServer codebase at all, so closing this bug.
F-7 has had an errata pushed to fix the races

* Sun Sep 23 2007 Daniel P. Berrange <berrange> - 3.1.0-3.fc7
- Fix race conditions in LibVNCServer on client disconnect