Description of problem: The following lines were found in dmesg after I had been aggressively starting and stopping guests and installing new guests for a period of about 1 hour (using virt-manager). # dmesg|grep segfault xen-vncfb[14705]: segfault at 0000000000000000 rip 000000000040c7f4 rsp 0000000042802c48 error 6 xen-vncfb[13053]: segfault at 0000000000000000 rip 000000000040c7f4 rsp 0000000042802ca8 error 6 There seemed to be no visible indication from virt-manager, _except_ that sometimes when you hit the Install->Finish button in virt-manager, the console did not appear even though the guest has started installing. This might be related to the appearance of the above, but I am not sure (further testing ongoing). Version-Release number of selected component (if applicable): # rpm -qf /usr/lib64/xen/bin/xen-vncfb xen-3.1.0-0.rc7.1.fc7 # uname -a Linux lambda 2.6.20-2925.8.fc7xen #1 SMP Thu May 10 17:47:43 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux Other parts of Fedora 7 up to date as of this morning. How reproducible: Further testing of this is ongoing to see if it is related to the console not coming up when starting to install a guest. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This segfault is _not_ coincident with the "lost console" when starting an install from virt-manager.
rip 000000000040c7f4 seems to be sraSpanInsertBefore+4 40c7f0: 48 8b 46 08 mov 0x8(%rsi),%rax 40c7f4: 48 89 37 mov %rsi,(%rdi) 40c7f7: 48 89 47 08 mov %rax,0x8(%rdi) 40c7fb: 48 8b 46 08 mov 0x8(%rsi),%rax 40c7ff: 48 89 7e 08 mov %rdi,0x8(%rsi) 40c803: 48 89 38 mov %rdi,(%rax) 40c806: c3 retq static void sraSpanInsertBefore(sraSpan *newspan, sraSpan *before) { newspan->_next = before; newspan->_prev = before->_prev; before->_prev->_next = newspan; before->_prev = newspan; } newspan must have been null. Before I investigate all possible callers, let's try to capture a core dump. Richard, could you take care of that?
Created attachment 154832 [details] Core dump This is the core dump. I'm going to upload the corresponding binary next - it is slightly different because I have recompiled xen on this machine (to apply another patch).
Created attachment 154833 [details] Binary of xen-vncfb matching preceeding coredump. This is the binary which matches the preceeding coredump.
Created attachment 154848 [details] Coredump with symbols
Created attachment 154849 [details] Binary file matching the above coredump with symbols
I'm sorry, but the binary in attachment 154849 [details] is stripped. Can I have one with symbols?
Created attachment 154900 [details] Binary file matching the above coredump with symbols (2nd version) Sorry about that. Try this one.
Created attachment 154903 [details] Another core dump This core dump has a slightly different stack trace, although it's in the same general area of the code.
Created attachment 154904 [details] Another core dump And another core dump, with yet again a slightly different stack trace, although the same area of the code.
Created attachment 154905 [details] Another core dump And another one - again with a different stack trace from the others.
Suggest something like this untested patch: http://www.gnome.org/~markmc/code/xen-libvncserver-threading.patch
*** Bug 240424 has been marked as a duplicate of this bug. ***
Markus suggests running xen-vncfb under valgrind to gain more information.
Created attachment 156605 [details] Reference counting fix Patch proposed by Mark McLoughlin. It makes the refcounting match the life range of cl.
Created attachment 156608 [details] Don't continue blindly when the socket is closed Patch proposed by Mark McLoughlin. Avoids passing invalid file descriptor to FD_SET() etc.
This is actually a whole family of bugs, and the patches I just attached fix just two of them. There may be more. The root cause they share is that when the client goes away, the threads doing work for it terminate in a confused and badly coordinated manner, which only works with lucky timing.
Created attachment 160117 [details] Avoid double cleanup The client iterator (protected by rfbClientListMutex) skips entries with sock<0. But rfbClientConnectionGone() neglects to reset cl->sock. This leads to double-cleanup, with disastrous results.
Created attachment 160331 [details] Fix rfbClientIterator rfbClientIterator is swarming with bugs: * Whoever added rfbClientListMutex didn't know what he was doing. * Reference counting is broken * The iterator normally skips closed entries (those with sock<0). But rfbClientConnectionGone() neglects to reset cl->sock. * Closed entries are *not* skipped when LIBVNCSERVER_HAVE_LIBPTHREAD is undefined.
Created attachment 160332 [details] Avoid double-cleanup Both clientInput() and rfbScreenCleanup() call rfbClientConnectionGone(). This works only if clientInput() wins the race with a sufficient margin to take the client off the list before rfbScreenCleanup() sees it. Otherwise, rfbClientConnectionGone() is called twice, with potentially disastrous results.
Rawhide no longer uses the LibVNCServer codebase at all, so closing this bug. F-7 has had an errata pushed to fix the races * Sun Sep 23 2007 Daniel P. Berrange <berrange> - 3.1.0-3.fc7 - Fix race conditions in LibVNCServer on client disconnect