Bug 484553

Summary: PolicyKit/D-Bus related libvirtd segfault
Product: [Fedora] Fedora Reporter: Mark McLoughlin <markmc>
Component: libvirtAssignee: Daniel Veillard <veillard>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: berrange, clalance, crobinso, davidz, stian, veillard, virt-maint, walters
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-11-18 08:54:44 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Mark McLoughlin 2009-02-08 04:34:46 EST
libvirt-0.6.0-2.fc10.i386

Started a kvm guest and libvirtd seemed to die. All I've got is this is /var/log/messages:

  libvirtd[6143]: segfault at 4 ip 00d9f706 sp b5727e98 error 6 in libdbus-1.so.3.4.0[d7a000+3f000]

On the plus side, when I restarted libvirtd the guest was running and I was able to connect to it just fine.
Comment 1 Mark McLoughlin 2009-02-09 04:27:30 EST
Okay, can't seem to reproduce this. Closing for now.
Comment 2 Mark McLoughlin 2009-02-09 15:02:54 EST
Got you ... you little so-and-so! Grr!

I'm not quite sure why, but it was extremely frustrating trying to get any sort of a useful debug log, backtrace or strace out of this - is libvirtd ignoring SIGSEGV or something these days? Don't think so, but there's something very odd going on here ...

Anyways, the bug itself - turns out this is reproducible by connecting to a vm with "virt-viewer -c qemu:///system foo" as an unprivileged user. Happens maybe 1 in 10 times.

Backtrace below, definitely PolicyKit or D-Bus related.

PolicyKit-0.9-4.fc10.i386
dbus-1.2.4-2.fc10.i386
libvirt-0.6.0-2.fc10.i386

Thread 6 (process 22236):
#0  0x00829416 in __kernel_vsyscall ()
#1  0x002dad99 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x002d6149 in _L_lock_89 () from /lib/libpthread.so.0
#3  0x002d5a52 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x04a12bfd in virMutexLock (m=0x4aafd90) at threads-pthread.c:51
#5  0x04a11bd1 in virLogLock () at logging.c:137
#6  virLogFiltersCheck () at logging.c:408
#7  virLogMessage (category=0x809eefc "file.node_device.c", priority=1, funcname=0x809efa7 "nodeDeviceLock", linenr=52, 
    flags=0, fmt=0x809eeef "LOCK node %p") at logging.c:503
#8  0x08093dd0 in nodeDeviceLock (driver=0x8e4c5f0) at node_device.c:52
#9  0x08096010 in dbus_watch_callback (fdatch=2, fd=8, events=<value optimized out>, opaque=0x8e3f0a0)
    at node_device_hal.c:576
#10 0x08052a59 in virEventDispatchHandles () at event.c:427
#11 virEventRunOnce () at event.c:559
#12 0x08058ccf in qemudOneLoop () at qemud.c:2001
#13 qemudRunLoop () at qemud.c:2105
#14 main (argc=1, argv=Cannot access memory at address 0x84
) at qemud.c:2846

Thread 5 (process 22243):
#0  0x00829416 in __kernel_vsyscall ()
#1  0x002d8105 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x04a12ae4 in virCondWait (c=0x8e395b8, m=0x8e395a0) at threads-pthread.c:81
#3  0x080552f4 in qemudWorker (data=0x8e43f1c) at qemud.c:1384
#4  0x002d451f in start_thread () from /lib/libpthread.so.0
#5  0x0021004e in clone () from /lib/libc.so.6

Thread 4 (process 22241):
#0  0x00829416 in __kernel_vsyscall ()
#1  0x002d8105 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x04a12ae4 in virCondWait (c=0x8e395b8, m=0x8e395a0) at threads-pthread.c:81
#3  0x080552f4 in qemudWorker (data=0x8e43f04) at qemud.c:1384
#4  0x002d451f in start_thread () from /lib/libpthread.so.0
#5  0x0021004e in clone () from /lib/libc.so.6

Thread 3 (process 22244):
#0  0x00829416 in __kernel_vsyscall ()
#1  0x002d8105 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x04a12ae4 in virCondWait (c=0x8e395b8, m=0x8e395a0) at threads-pthread.c:81
#3  0x080552f4 in qemudWorker (data=0x8e43f28) at qemud.c:1384
#4  0x002d451f in start_thread () from /lib/libpthread.so.0
#5  0x0021004e in clone () from /lib/libc.so.6

Thread 2 (process 22242):
#0  0x00829416 in __kernel_vsyscall ()
#1  0x002d8105 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x04a12ae4 in virCondWait (c=0x8e395b8, m=0x8e395a0) at threads-pthread.c:81
#3  0x080552f4 in qemudWorker (data=0x8e43f10) at qemud.c:1384
#4  0x002d451f in start_thread () from /lib/libpthread.so.0
#5  0x0021004e in clone () from /lib/libc.so.6

Thread 1 (process 22240):
#0  _dbus_watch_invalidate (watch=0x0) at dbus-watch.c:147
#1  0x00d9f0ea in free_watches (transport=0x8e3efd0) at dbus-transport-socket.c:82
#2  0x00d9f161 in socket_disconnect (transport=0x8e3efd0) at dbus-transport-socket.c:908
#3  0x00d9cd89 in _dbus_transport_disconnect (transport=0x8e3efd0) at dbus-transport.c:494
#4  0x00d9d97e in _dbus_transport_queue_messages (transport=0x8e3efd0) at dbus-transport.c:1137
#5  0x00d87c98 in _dbus_connection_get_dispatch_status_unlocked (connection=0x8e49940) at dbus-connection.c:3983
#6  0x00d89f6f in _dbus_connection_block_pending_call (pending=0x8e5e580) at dbus-connection.c:2322
#7  0x00d97271 in dbus_pending_call_block (pending=0x0) at dbus-pending-call.c:707
#8  0x00d88f3a in dbus_connection_send_with_reply_and_block (connection=0x8e49940, message=0x8e42ad0, 
    timeout_milliseconds=-1, error=0xb80351a0) at dbus-connection.c:3316
#9  0x026a65d7 in polkit_session_new_from_objpath () from /usr/lib/libpolkit-dbus.so.2
#10 0x026a7026 in polkit_caller_new_from_pid () from /usr/lib/libpolkit-dbus.so.2
#11 0x08059c08 in remoteDispatchAuthPolkit (server=0x8e395a0, client=0xb4d8ce78, conn=0x0, rerr=0xb8035218, 
    args=0xb80352d0, ret=0xb8035290) at remote.c:3057
#12 0x08060664 in remoteDispatchClientRequest (server=0x8e395a0, client=0xb4d8ce78, msg=0xb4d4b390) at remote.c:322
#13 0x0805537f in qemudWorker (data=0x8e43ef8) at qemud.c:1406
#14 0x002d451f in start_thread () from /lib/libpthread.so.0
#15 0x0021004e in clone () from /lib/libc.so.6
Comment 3 Colin Walters 2009-02-09 15:17:35 EST
It looks like some code is using libdbus from multiple threads?  I can tell you it basically doesn't work reliably right now.  It looks like thread 6 is running the equivalent of GMainLoop (set up with the DBusConnection), and thread 1 is doing synchronous calls.

I'm surprised it's segfaulting; with multiple threads most of the symptoms I've seen have been lost messages and not crashes, but I wouldn't rule it out.

Here is the multi-thread bug I looked at most recently:
http://bugs.freedesktop.org/show_bug.cgi?id=857
Comment 4 Colin Walters 2009-02-09 15:22:34 EST
Is the design that qemu is in-process but in a separate thread, and there's a main thread which is handling IPC requests, delegating auth to PolKit and delegating some of those IPC requests to the qemu thread?

What is qemu doing with dbus?  Can that functionality be moved to the main thread?
Comment 5 Daniel Berrange 2009-02-10 04:52:58 EST
This is the libvirtd daemon, which handles all the management tasks. QEMU itself is separate and not impacted. You're more or less correct with the architecture. Thread 6 is the initial thread, which runs the main loop and is responsible for all I/O, sending & receiving all RPC calls. The other 5 threads are all workers which actually execute/process the RPC call.

The thing I probably need to look at first is whether there's a bug in our DBus <-> event loop integration causing our use of watches to not be thread safe.
Comment 6 Daniel Berrange 2009-02-10 05:30:04 EST
Actually the easier solution is probably to just create a private bus instance for talking to PolicyKit, and just leave the main thread using the global shared bus instance for the device enumeration stuff with HAL.
Comment 7 Colin Walters 2009-02-10 09:39:19 EST
The private connection sounds like a reasonable workaround.  Or is there a way to move some of the PolicyKit interaction to the main thread?  You could do async calls from there.

libdbus thread safety is creeping up in priority, but it's a large leap to go from "mostly thread safe" (where we are now) to "we guarantee it".

Wait a second...actually, look at this:

#3  0x00d9cd89 in _dbus_transport_disconnect (transport=0x8e3efd0) at
dbus-transport.c:494

You're getting disconnected from the bus!  I missed that before.  That is likely a bigger problem (though thread safety will still bite).  There are multiple reasons one can get kicked off the bus, but a very common one is sending invalid UTF-8.  

Could there be some sort of race condition e.g. with the main thread free()ing the string passed to the worker thread?
Comment 8 Daniel Berrange 2009-02-18 07:46:27 EST
Doh.  dbus_threads_init_default()  found to be useful :-)
Comment 9 Mark McLoughlin 2009-02-19 04:38:17 EST
* Wed Feb 18 2009 Daniel P. Berrange <berrange@redhat.com> - 0.6.0-4.fc11
- Fix QEMU startup timeout/race (rhbz #484649)
- Setup DBus threading. Don't allow dbus to call _exit / change SIGPIPE (rhbz #484553)
- Fix timeout when autostarting session daemon
Comment 10 Daniel Berrange 2009-03-10 08:29:18 EDT
Fix now present in F10 updates-testing build too - 0.6.1-1.fc10
Comment 11 Fedora Update System 2009-03-11 13:58:50 EDT
libvirt-0.6.1-1.fc10 has been pushed to the Fedora 10 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update libvirt'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F10/FEDORA-2009-2396
Comment 12 Bug Zapper 2009-11-18 06:02:42 EST
This message is a reminder that Fedora 10 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 10.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '10'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 10's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 10 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 13 Cole Robinson 2009-11-18 08:54:44 EST
libvirt-0.6.1 isn't pushed to stable, and is sitting in updates testing for F10. This bug won't be fixed in the stable f10 package, but is fixed in F11.

Moving to F11 and closing as CURRENTRELEASE.