240012 – xen-vncfb segfault

Bug 240012 - xen-vncfb segfault

Summary: xen-vncfb segfault

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xen
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Markus Armbruster
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	240424 (view as bug list)
Depends On:
Blocks:	245188
TreeView+	depends on / blocked

Reported:	2007-05-14 11:52 UTC by Richard W.M. Jones
Modified:	2007-11-30 22:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-09-24 23:29:30 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Core dump (26.50 KB, application/x-bzip2) 2007-05-16 15:21 UTC, Richard W.M. Jones	no flags	Details
Binary of xen-vncfb matching preceeding coredump. (88.32 KB, application/x-bzip2) 2007-05-16 15:22 UTC, Richard W.M. Jones	no flags	Details
Coredump with symbols (29.25 KB, application/x-bzip2) 2007-05-16 16:52 UTC, Richard W.M. Jones	no flags	Details
Binary file matching the above coredump with symbols (80.54 KB, application/x-bzip2) 2007-05-16 16:52 UTC, Richard W.M. Jones	no flags	Details
Binary file matching the above coredump with symbols (2nd version) (161.93 KB, application/x-bzip2) 2007-05-17 09:31 UTC, Richard W.M. Jones	no flags	Details
Another core dump (29.13 KB, application/x-bzip2) 2007-05-17 09:57 UTC, Richard W.M. Jones	no flags	Details
Another core dump (27.73 KB, application/x-bzip2) 2007-05-17 10:35 UTC, Richard W.M. Jones	no flags	Details
Another core dump (29.14 KB, application/x-bzip2) 2007-05-17 11:01 UTC, Richard W.M. Jones	no flags	Details
Reference counting fix (807 bytes, patch) 2007-06-08 19:35 UTC, Markus Armbruster	no flags	Details \| Diff
Don't continue blindly when the socket is closed (885 bytes, patch) 2007-06-08 19:46 UTC, Markus Armbruster	no flags	Details \| Diff
Avoid double cleanup (1.02 KB, patch) 2007-07-27 14:15 UTC, Markus Armbruster	no flags	Details \| Diff
Fix rfbClientIterator (2.52 KB, patch) 2007-07-31 15:06 UTC, Markus Armbruster	no flags	Details \| Diff
Avoid double-cleanup (1.04 KB, patch) 2007-07-31 15:08 UTC, Markus Armbruster	no flags	Details \| Diff
Show Obsolete (4) View All

Description Richard W.M. Jones 2007-05-14 11:52:37 UTC

Description of problem:

The following lines were found in dmesg after I had been aggressively starting
and stopping guests and installing new guests for a period of about 1 hour
(using virt-manager).

# dmesg|grep segfault
xen-vncfb[14705]: segfault at 0000000000000000 rip 000000000040c7f4 rsp
0000000042802c48 error 6
xen-vncfb[13053]: segfault at 0000000000000000 rip 000000000040c7f4 rsp
0000000042802ca8 error 6

There seemed to be no visible indication from virt-manager, _except_ that
sometimes when you hit the Install->Finish button in virt-manager, the console
did not appear even though the guest has started installing.  This might be
related to the appearance of the above, but I am not sure (further testing ongoing).

Version-Release number of selected component (if applicable):

# rpm -qf /usr/lib64/xen/bin/xen-vncfb
xen-3.1.0-0.rc7.1.fc7
# uname -a
Linux lambda 2.6.20-2925.8.fc7xen #1 SMP Thu May 10 17:47:43 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux

Other parts of Fedora 7 up to date as of this morning.

How reproducible:

Further testing of this is ongoing to see if it is related to the console not
coming up when starting to install a guest.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Richard W.M. Jones 2007-05-14 12:36:31 UTC

This segfault is _not_ coincident with the "lost console" when starting an
install from virt-manager.

Comment 2 Markus Armbruster 2007-05-14 22:19:16 UTC

rip 000000000040c7f4 seems to be sraSpanInsertBefore+4

  40c7f0:       48 8b 46 08             mov    0x8(%rsi),%rax
  40c7f4:       48 89 37                mov    %rsi,(%rdi)
  40c7f7:       48 89 47 08             mov    %rax,0x8(%rdi)
  40c7fb:       48 8b 46 08             mov    0x8(%rsi),%rax
  40c7ff:       48 89 7e 08             mov    %rdi,0x8(%rsi)
  40c803:       48 89 38                mov    %rdi,(%rax)
  40c806:       c3                      retq   

static void
sraSpanInsertBefore(sraSpan *newspan, sraSpan *before) {
  newspan->_next = before;
  newspan->_prev = before->_prev;
  before->_prev->_next = newspan;
  before->_prev = newspan;
}

newspan must have been null.  Before I investigate all possible callers, let's
try to capture a core dump.  Richard, could you take care of that?

Comment 3 Richard W.M. Jones 2007-05-16 15:21:21 UTC

Created attachment 154832 [details]
Core dump

This is the core dump.	I'm going to upload the corresponding binary next - it
is slightly different because I have recompiled xen on this machine (to apply
another patch).

Comment 4 Richard W.M. Jones 2007-05-16 15:22:07 UTC

Created attachment 154833 [details]
Binary of xen-vncfb matching preceeding coredump.

This is the binary which matches the preceeding coredump.

Comment 5 Richard W.M. Jones 2007-05-16 16:52:10 UTC

Created attachment 154848 [details]
Coredump with symbols

Comment 6 Richard W.M. Jones 2007-05-16 16:52:47 UTC

Created attachment 154849 [details]
Binary file matching the above coredump with symbols

Comment 7 Markus Armbruster 2007-05-16 19:39:55 UTC

I'm sorry, but the binary in attachment 154849 [details] is stripped.  Can I have one with
symbols?

Comment 8 Richard W.M. Jones 2007-05-17 09:31:11 UTC

Created attachment 154900 [details]
Binary file matching the above coredump with symbols (2nd version)

Sorry about that.  Try this one.

Comment 9 Richard W.M. Jones 2007-05-17 09:57:11 UTC

Created attachment 154903 [details]
Another core dump

This core dump has a slightly different stack trace, although it's in the same
general area of the code.

Comment 10 Richard W.M. Jones 2007-05-17 10:35:34 UTC

Created attachment 154904 [details]
Another core dump

And another core dump, with yet again a slightly different stack trace,
although the same area of the code.

Comment 11 Richard W.M. Jones 2007-05-17 11:01:44 UTC

Created attachment 154905 [details]
Another core dump

And another one - again with a different stack trace from the others.

Comment 12 Mark McLoughlin 2007-05-17 14:38:40 UTC

Suggest something like this untested patch:

  http://www.gnome.org/~markmc/code/xen-libvncserver-threading.patch

Comment 13 Markus Armbruster 2007-05-17 18:13:23 UTC

*** Bug 240424 has been marked as a duplicate of this bug. ***

Comment 14 Richard W.M. Jones 2007-05-18 12:56:53 UTC

Markus suggests running xen-vncfb under valgrind to gain more information.

Comment 15 Markus Armbruster 2007-06-08 19:35:14 UTC

Created attachment 156605 [details]
Reference counting fix

Patch proposed by Mark McLoughlin.  It makes the refcounting match the life
range of cl.

Comment 16 Markus Armbruster 2007-06-08 19:46:52 UTC

Created attachment 156608 [details]
Don't continue blindly when the socket is closed

Patch proposed by Mark McLoughlin.  Avoids passing invalid file descriptor to
FD_SET() etc.

Comment 17 Markus Armbruster 2007-06-08 19:49:56 UTC

This is actually a whole family of bugs, and the patches I just attached fix
just two of them.  There may be more.

The root cause they share is that when the client goes away, the threads doing
work for it terminate in a confused and badly coordinated manner, which only
works with lucky timing.

Comment 18 Markus Armbruster 2007-07-27 14:15:17 UTC

Created attachment 160117 [details]
Avoid double cleanup

The client iterator (protected by rfbClientListMutex) skips entries
with sock<0.  But rfbClientConnectionGone() neglects to reset
cl->sock.  This leads to double-cleanup, with disastrous results.

Comment 19 Markus Armbruster 2007-07-31 15:06:26 UTC

Created attachment 160331 [details]
Fix rfbClientIterator

rfbClientIterator is swarming with bugs:
* Whoever added rfbClientListMutex didn't know what he was doing.
* Reference counting is broken
* The iterator normally skips closed entries (those with sock<0).  But
  rfbClientConnectionGone() neglects to reset cl->sock.
* Closed entries are *not* skipped when LIBVNCSERVER_HAVE_LIBPTHREAD
  is undefined.

Comment 20 Markus Armbruster 2007-07-31 15:08:39 UTC

Created attachment 160332 [details]
Avoid double-cleanup

Both clientInput() and rfbScreenCleanup() call
rfbClientConnectionGone().  This works only if clientInput() wins the
race with a sufficient margin to take the client off the list before
rfbScreenCleanup() sees it.  Otherwise, rfbClientConnectionGone() is
called twice, with potentially disastrous results.

Comment 21 Daniel Berrangé 2007-09-24 23:29:30 UTC

Rawhide no longer uses the LibVNCServer codebase at all, so closing this bug.
F-7 has had an errata pushed to fix the races

* Sun Sep 23 2007 Daniel P. Berrange <berrange> - 3.1.0-3.fc7
- Fix race conditions in LibVNCServer on client disconnect

Note You need to log in before you can comment on or make changes to this bug.