Bug 479064 - Rebooting guest domain causes hald to abort.
Rebooting guest domain causes hald to abort.
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: hal (Show other bugs)
4.7
All Linux
high Severity high
: rc
: ---
Assigned To: Richard Hughes
desktop-bugs@redhat.com
: Patch, Regression, ZStream
Depends On:
Blocks: 661723
  Show dependency treegraph
 
Reported: 2009-01-06 15:36 EST by Bryan Mason
Modified: 2012-06-14 16:55 EDT (History)
8 users (show)

See Also:
Fixed In Version: hal-0.4.2-9.EL4
Doc Type: Bug Fix
Doc Text:
Under certain circumstances, rebooting a guest domain could cause the hald service to terminate unexpectedly and create a core dump. With this update, the underlying source code has been modified to address this issue, and hald no longer crashes upon a guest domain reboot.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-14 16:55:57 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Test patch to add verbosity to dbus memory functions. (2.97 KB, patch)
2009-03-05 14:43 EST, Bryan Mason
no flags Details | Diff
Test patch to redirect _dbus_verbose to /var/log/dbus.log (1.05 KB, patch)
2009-03-05 14:44 EST, Bryan Mason
no flags Details | Diff
Log file created using the above patches. (349.19 KB, text/plain)
2009-03-05 14:45 EST, Bryan Mason
no flags Details
Core file resulting from enabling dbus malloc guards (353.80 KB, application/x-gzip)
2009-03-05 16:18 EST, Bryan Mason
no flags Details
Log file created by running "hald --daemon=no --verbose=yes &> logfile" (186.02 KB, text/plain)
2009-03-05 19:30 EST, Bryan Mason
no flags Details
Log file created by setting DBUS_MALLOC_GUARDS=1 and running "hald --daemon=no --verbose=yes" &> logfile" (169.91 KB, application/octet-stream)
2009-03-05 19:49 EST, Bryan Mason
no flags Details
Patch to remove async matching (4.74 KB, patch)
2009-03-24 09:13 EDT, Olivier Fourdan
no flags Details | Diff

  None (edit)
Description Bryan Mason 2009-01-06 15:36:16 EST
Description of problem:

    If a CD-ROM is defined in the xen configuration file, but does not
    actually exist in Dom0, the hald in DomU will abort.

Version-Release number of selected component (if applicable):

    RHEL 4.7: hal-0.4.2-8.EL4
    RHEL 4.6: hal-0.4.2-7.EL4

    Does not occur in RHEL 4.5: hal-0.4.2-6.EL4    

How reproducible:

    Unknown.  Reported to happen every time, but unable to reproduce
    in test lab.

Steps to Reproduce:

    1. Install DomU using virt-install.  Define a non-existent CD-ROM
       in the xen configuration file:

       disk = [ ..., ",hdc:cdrom,r" ]

    2. Enable core-dumps in DomU by adding the following line to
       /etc/init.d/functions:

           DAEMON_COREFILE_LIMIT=unlimited

       and changing the following line in /etc/profile:

           ulimit -S -c 0 > /dev/null 2>> 

       to

           ulimit -S -c unlimited > /dev/null 2>1  <-- after edit

    3. Start haldaemon in DomU.

    4. Reboot DomU.

Actual results:

    hald aborts and a corefile is generated.

Expected results:

    No abort or corefile.

Additional info:

    As described above, this is a regression from RHEL 4.5.  This has
    only been reproduce on ia64, although it may happen on other
    architectures.
Comment 3 RHEL Product and Program Management 2009-01-06 15:42:55 EST
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 6 Bryan Mason 2009-03-03 17:03:18 EST
Some additional data points that I've been able to discover over the last couple of weeks:

1) The problem seems to be reproducible only when the system is shut
   down.  Running

       service haldaemon start; sleep 10; service haldaemon stop

   does not seem to cause the problem.

2) I modified the init scripts to run:

       /usr/sbin/hald --daemon=no &> /var/log/hald.log &

   and

       killall -TERM hald

   to start and stop hald so I could capture any extra information
   being written to stdout/stderr.  When I did this, I was unable to
   reproduce the problem.

I don't know if those data helps, but I thought it would be worth reporting anyway.
Comment 9 Bryan Mason 2009-03-05 14:40:32 EST
Here's a report on some investigation that I've done to try and figure out why hald is segfaulting:

To track memory allocation through the dbus memory functions, I made changes to dbus/dbus-memory.c to call _dbus_verbose() from within dbus_malloc(), dbus_malloc0(), dbus_realloc(), and dbus_free().  For each case, I printed the parameters to the call as well as the return value and the value of n_blocks_outstanding.

Because the problem only seems to reproduce if haldaemon is run as a daemon, I made changes to dbus/dbus-internals.c to cause _dbus_verbose() to send messages to /var/log/dbus.log instead of stdout.

I rebuilt libdbus-1.so.0 and modified /etc/rc.d/init.d/haldaemon to include the following:

    export HALD_VERBOSE=1
    export DBUS_VERBOSE=1
    export DBUS_PRINT_BACKTRACE=1
    export LD_PRELOAD=/root/dbus-srpm/libdbus-1.so.0

Based on my analysis of /var/log/dbus.log created when haldaemon segfaulted, it looks like dbus_free() is being called on memory blocks that were never allocated using the dbus memory functions.  Before haldaemon segfaults, the following lines appear in /var/log/dbus.log:

   4332 4468: dbus_free((nil)) called with n_blocks_outstanding=63
   4333 4468: dbus_free(0x60000000000690a0) called with n_blocks_outstanding=63
   4334 4468: dbus_free(0x6000000000069300) called with n_blocks_outstanding=62
   4335 4468: dbus_free(0x6000000000040cd0) called with n_blocks_outstanding=61
   4336 4468: dbus_free(0x600000000006f580) called with n_blocks_outstanding=60
   4337 4468: dbus_free(0x60000000000406a0) called with n_blocks_outstanding=59
   4338 4468: dbus_free(0x600000000006f520) called with n_blocks_outstanding=58
   4339 4468: dbus_free(0x60000000000682b0) called with n_blocks_outstanding=57
   4340 4468: dbus_free(0x600000000006f3d0) called with n_blocks_outstanding=56
   4341 4468: dbus_free(0x6000000000065bc0) called with n_blocks_outstanding=55
   4342 4468: dbus_free(0x6000000000067150) called with n_blocks_outstanding=54
   4343 4468: dbus_free(0x6000000000068460) called with n_blocks_outstanding=53
   4344 4468: dbus_free(0x6000000000066d30) called with n_blocks_outstanding=52
   4345 4468: dbus_free(0x6000000000066970) called with n_blocks_outstanding=51
   4346 4468: dbus_free(0x600000000006e880) called with n_blocks_outstanding=50
   4347 4468: dbus_free(0x600000000006e5d0) called with n_blocks_outstanding=49
   4348 4468: dbus_free(0x6000000000069080) called with n_blocks_outstanding=48
   4349 4468: dbus_free(0x6000000000040e30) called with n_blocks_outstanding=47
   4350 4468: dbus_free(0x6000000000049fd0) called with n_blocks_outstanding=46
   4351 4468: dbus_free(0x600000000004a180) called with n_blocks_outstanding=45
   4352 4468: dbus_free(0x600000000004a300) called with n_blocks_outstanding=44
   4353 4468: dbus_free(0x6000000000050d30) called with n_blocks_outstanding=43
   4354 4468: dbus_free(0x6000000000050ee0) called with n_blocks_outstanding=42
   4355 4468: dbus_free(0x6000000000050f80) called with n_blocks_outstanding=41
   4356 4468: dbus_free(0x600000000005e7d0) called with n_blocks_outstanding=40
   4357 4468: dbus_free(0x6000000000051c60) called with n_blocks_outstanding=39
   4358 4468: dbus_free(0x600000000005e100) called with n_blocks_outstanding=38
   4359 4468: dbus_free(0x600000000005e2b0) called with n_blocks_outstanding=37
   4360 4468: dbus_free(0x600000000005e460) called with n_blocks_outstanding=36
   4361 4468: dbus_free(0x600000000005a230) called with n_blocks_outstanding=35
   4362 4468: dbus_free(0x600000000005a3e0) called with n_blocks_outstanding=34
   4363 4468: dbus_free(0x600000000005a590) called with n_blocks_outstanding=33
   4364 4468: dbus_free(0x6000000000054e70) called with n_blocks_outstanding=32
   4365 4468: dbus_free(0x6000000000055020) called with n_blocks_outstanding=31
   4366 4468: dbus_free(0x60000000000551d0) called with n_blocks_outstanding=30
   4367 4468: dbus_free(0x6000000000037840) called with n_blocks_outstanding=29
   4368 4468: dbus_free(0x60000000000379f0) called with n_blocks_outstanding=28
   4369 4468: dbus_free(0x6000000000037b70) called with n_blocks_outstanding=27
   4370 4468: dbus_free(0x600000000005a820) called with n_blocks_outstanding=26
   4371 4468: dbus_free(0x600000000005a9d0) called with n_blocks_outstanding=25

As far as I can tell, the calls to dbus_free on 4336-4337 and on 4338-4371 are freeing blocks of memory that were not returned by dbus_malloc(), dbus_malloc0(), or dbus_realloc().

If these blocks of memory have not been allocated by the dbus memory routines, then n_blocks_outstanding was not incremented when these blocks were allocated.  Since these blocks are being freed by dbus_free, n_blocks_outstanding is being decremented when it shouldn't be.

The next step I'm planning on taking is to add some debugging code to the dbus memory routines to keep track of all the blocks that have been allocated and print a backtrace when dbus_free is called on a block that was not allocated by one of the functions in dbus-memory.c.  This should tell us who is incorrectly calling dbus_free().
Comment 10 Bryan Mason 2009-03-05 14:43:24 EST
Created attachment 334189 [details]
Test patch to add verbosity to dbus memory functions.
Comment 11 Bryan Mason 2009-03-05 14:44:07 EST
Created attachment 334190 [details]
Test patch to redirect _dbus_verbose to /var/log/dbus.log
Comment 12 Bryan Mason 2009-03-05 14:45:56 EST
Created attachment 334191 [details]
Log file created using the above patches.
Comment 13 Bryan Mason 2009-03-05 16:18:05 EST
Created attachment 334211 [details]
Core file resulting from enabling dbus malloc guards

As it turns out, enabling the memory guards in the dbus memory routines uncovered which routine was freeing memory that hadn't been allocated using dbus_{m,re}alloc*.  Adding

    export DBUS_MALLOC_GUARDS=1

to /etc/rc.d/init.d/haldaemon and rebooting resulted in the attached core file with the following backtrace:

Thread 1 (process 4463):
#0  0xa000000000010640 in __kernel_syscall_via_break ()
#1  0x20000000004ab8b0 in raise () from /lib/tls/libc.so.6.1
#2  0x20000000004ae290 in abort () from /lib/tls/libc.so.6.1
#3  0x20000000002c1ab0 in _dbus_abort () at dbus-sysdeps.c:93
#4  0x2000000000297310 in _dbus_real_assert_not_reached (
    explanation=0x20000000002db708 "Invalid malloc block source ID", 
    file=0x20000000002db728 "dbus-memory.c", line=315) at dbus-internals.c:476
#5  0x20000000002a34f0 in source_string (source=1610612736)
    at dbus-memory.c:315
#6  0x20000000002a35a0 in check_guards (free_block=Variable "free_block" is 
    not available.) at dbus-memory.c:343
#7  0x20000000002a3f20 in dbus_free (memory=0x60000000000348a0)
    at dbus-memory.c:610
#8  0x4000000000027400 in udev_node_created_cb (store=0x6000000000045800, 
    device=0x0, user_data=0x60000000000348a0) at linux/osspec.c:1206
#9  0x4000000000014190 in match_device_async_timeout (
    user_data=0x600000000006b360) at device_store.c:435
#10 0x200000000017f6d0 in g_timeout_dispatch (source=0x600000000006d710, 
    callback=@0x4000000000062970: 0x4000000000014140
    <match_device_async_timeout>, user_data=0x600000000006b360) at gmain.c:3301
#11 0x20000000001763d0 in g_main_context_dispatch (context=0x60000000000334f0)
    at gmain.c:1942
#12 0x200000000017bc50 in g_main_context_iterate (context=0x60000000000334f0, 
    block=6, dispatch=1, self=0x1) at gmain.c:2573
#13 0x200000000017c5f0 in g_main_loop_run (loop=0x6000000000035c10)
    at gmain.c:2777
#14 0x4000000000015440 in main (argc=2, argv=0x60000fffffffbc78) at hald.c:513
#15 0x20000000004816b0 in __libc_start_main () from /lib/tls/libc.so.6.1
#16 0x4000000000007820 in _start ()
Comment 14 Bryan Mason 2009-03-05 19:30:29 EST
Created attachment 334245 [details]
Log file created by running "hald --daemon=no --verbose=yes &> logfile"

I take back what I said in Comment #6 and #8.  It is possible to reproduce this by running "hald --daemon=no --verbose=yes".  As is turns out, hald was probably aborting, but wasn't producing a core file because I hadn't run "ulimit -c unlimited" in the init script.

Here's the log file that was created.  I have the corresponding core file if you're interested.
Comment 15 Bryan Mason 2009-03-05 19:49:50 EST
Created attachment 334246 [details]
Log file created by setting DBUS_MALLOC_GUARDS=1 and running "hald --daemon=no --verbose=yes" &> logfile"

I may also have to amend what I said in Comment #13.  I now believe that turning enabling the dbus malloc guards causes hald to abort while running.  After a reboot of the test system, hald was running, but then aborted and dumped core with the same signature as that in Comment #13.  I was running haldaemon with the "--daemon=no" option, so I was able to capture the attached log file.  Again, I have the core file if you're interested.
Comment 25 Olivier Fourdan 2009-04-23 08:55:00 EDT
Comment on attachment 336457 [details]
Patch to remove async matching

> Remove async matching, to mimic upstream git commit 1b5448d2f839f889da56d348990fdbcb0e6d038a
>
>  http://cgit.freedesktop.org/hal/commit/?id=1b5448d2f839f889da56d348990fdbcb0e6d038a
>
> Fixes hald crashing when removing a USB stick
>
> -- Olivier Fourdan <ofourdan@redhat.com>
>
> device_store.c |  127 ---------------------------------------------------------
> device_store.h |    7 ---
> linux/osspec.c |    7 ---
> 3 files changed, 2 insertions(+), 139 deletions(-)
>
>--- hal-0.4.2/hald/device_store.c.remove-async-matching	2009-03-24 11:51:57.000000000 +0000
>+++ hal-0.4.2/hald/device_store.c	2009-03-24 11:52:04.000000000 +0000
>@@ -353,130 +353,3 @@
> 
> 	return matches;
> }
>-
>-typedef struct {
>-	HalDeviceStore *store;
>-	char *key;
>-	char *value;
>-	HalDeviceStoreAsyncCallback callback;
>-	gpointer user_data;
>-
>-	guint prop_signal_id;
>-	guint store_signal_id;
>-	guint timeout_id;
>-} AsyncMatchInfo;
>-
>-static void
>-destroy_async_match_info (AsyncMatchInfo *info)
>-{
>-	g_object_unref (info->store);
>-
>-	g_free (info->key);
>-	g_free (info->value);
>-
>-	g_signal_handler_disconnect (info->store, info->prop_signal_id);
>-	g_signal_handler_disconnect (info->store, info->store_signal_id);
>-	g_source_remove (info->timeout_id);
>-
>-	g_free (info);
>-}
>-
>-static void
>-match_device_async (HalDeviceStore *store, HalDevice *device,
>-		    const char *key, gboolean removed, gboolean added,
>-		    gpointer user_data)
>-{
>-	AsyncMatchInfo *info = (AsyncMatchInfo *) user_data;
>-
>-	/* Only want to do it for added or changed properties */
>-	if (removed)
>-		return;
>-
>-	/* Keys have to match */
>-	if (strcmp (info->key, key) != 0)
>-		return;
>-
>-	/* Values have to match */
>-	if (strcmp (hal_device_property_get_string (device, key),
>-		    info->value) != 0)
>-		return;
>-
>-	info->callback (store, device, info->user_data);
>-
>-	destroy_async_match_info (info);
>-}
>-
>-static void
>-store_changed (HalDeviceStore *store, HalDevice *device,
>-	       gboolean added, gpointer user_data)
>-{
>-	AsyncMatchInfo *info = (AsyncMatchInfo *) user_data;
>-
>-	if (!added)
>-		return;
>-
>-	if (!hal_device_has_property (device, info->key))
>-		return;
>-
>-	if (strcmp (hal_device_property_get_string (device, info->key),
>-		    info->value) != 0)
>-		return;
>-
>-	info->callback (store, device, info->user_data);
>-
>-	destroy_async_match_info (info);
>-}
>-
>-static gboolean
>-match_device_async_timeout (gpointer user_data)
>-{
>-	AsyncMatchInfo *info = (AsyncMatchInfo *) user_data;
>-
>-	info->callback (info->store, NULL, info->user_data);
>-
>-	destroy_async_match_info (info);
>-
>-	return FALSE;
>-}
>-
>-void
>-hal_device_store_match_key_value_string_async (HalDeviceStore *store,
>-					       const char *key,
>-					       const char *value,
>-					       HalDeviceStoreAsyncCallback callback,
>-					       gpointer user_data,
>-					       int timeout)
>-{
>-	HalDevice *device;
>-	AsyncMatchInfo *info;
>-
>-	/* First check to see if it's already there */
>-	device = hal_device_store_match_key_value_string (store, key, value);
>-
>-	if (device != NULL || timeout == 0) {
>-		callback (store, device, user_data);
>-
>-		return;
>-	}
>-
>-	info = g_new0 (AsyncMatchInfo, 1);
>-
>-	info->store = g_object_ref (store);
>-	info->key = g_strdup (key);
>-	info->value = g_strdup (value);
>-	info->callback = callback;
>-	info->user_data = user_data;
>-
>-	info->prop_signal_id = g_signal_connect (store,
>-						 "device_property_changed",
>-						 G_CALLBACK (match_device_async),
>-						 info);
>-	info->store_signal_id = g_signal_connect (store,
>-						  "store_changed",
>-						  G_CALLBACK (store_changed),
>-						  info);
>-
>-	info->timeout_id = g_timeout_add (timeout,
>-					  match_device_async_timeout,
>-					  info);
>-}
>--- hal-0.4.2/hald/linux/osspec.c.remove-async-matching	2009-03-24 11:52:38.000000000 +0000
>+++ hal-0.4.2/hald/linux/osspec.c	2009-03-24 11:55:16.000000000 +0000
>@@ -1406,13 +1406,10 @@
> 	if (strcmp(action, "add") == 0) {
> 
> 		/* When udev gives the SEQNUM this can be synchronous. */
>-		hal_device_store_match_key_value_string_async (
>+		hal_device_store_match_key_value_string (
> 			hald_get_tdl (),
> 			".udev.sysfs_path",
>-			sysfs_path_full,
>-			udev_node_created_cb, 
>-			g_strdup (device_name), /* will be freed in udev_node_created_cb */
>-			HAL_LINUX_HOTPLUG_TIMEOUT);
>+			sysfs_path_full);
> 	}
> 
> 	g_free (subsystem);
>--- hal-0.4.2/hald/device_store.h.remove-async-matching	2009-03-24 11:51:50.000000000 +0000
>+++ hal-0.4.2/hald/device_store.h	2009-03-24 11:52:04.000000000 +0000
>@@ -109,13 +109,6 @@
> 								  const char *key,
> 								  const char *value);
> 
>-void           hal_device_store_match_key_value_string_async (HalDeviceStore *store,
>-							      const char *key,
>-							      const char *value,
>-							      HalDeviceStoreAsyncCallback callback,
>-							      gpointer user_data,
>-							      int timeout);
>-
> void hal_device_store_print (HalDeviceStore *store);
> 
>
Comment 33 Vladimir Benes 2010-11-09 10:05:50 EST
reproducer in comment #0,
granting qa_ack
Comment 37 Richard Hughes 2010-12-09 07:59:01 EST
Build as http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2962064
Comment 39 Jaromir Hradilek 2011-01-07 07:02:30 EST
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Under certain circumstances, rebooting a guest domain could cause the hald service to terminate unexpectedly and create a core dump. With this update, the underlying source code has been modified to address this issue, and hald no longer crashes upon a guest domain reboot.
Comment 40 Shad L. Lords 2011-03-14 14:45:51 EDT
After installing hal-0.4.2-9.EL4 that was released under bug #661723 there is a regression in the detection of usb devices.

/var/log /messages shows the following:

Mar 14 16:45:32 iwebos kernel: Attached scsi disk sdb at scsi1, channel 0, id 0, lun 0
Mar 14 16:45:33 iwebos last message repeated 2 times
Mar 14 16:45:46 iwebos hald[4506]: Timed out waiting for hotplug event 563. Rebasing to 562

If I roll back to hal-0.4.2-8.EL4 then the messages files ends with the following:

Mar 14 18:27:10 iwebos kernel: Attached scsi disk sdb at scsi1, channel 0, id 0, lun 0
Mar 14 18:27:10 iwebos scsi.agent[5178]: disk at /devices/pci0000:00/0000:00:11.0/0000:02:03.0/usb1/1-1/1-1:1.0/host1/target1:0:0/1:0:0:0
Mar 14 18:27:10 iwebos kernel: USB Mass Storage device found at 2
Mar 14 18:27:10 iwebos kernel: usbcore: registered new driver usb-storage
Mar 14 18:27:10 iwebos kernel: USB Mass Storage support registered.
Mar 14 18:27:11 iwebos fstab-sync[5256]: added mount point /media/usbdisk for /dev/sdb1

Note You need to log in before you can comment on or make changes to this bug.