Bug 142183 - hald crash on sk98lin
hald crash on sk98lin
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: hal (Show other bugs)
3
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: David Zeuthen
:
: 142218 142671 143176 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-12-07 17:56 EST by Marek Kassur
Modified: 2007-11-30 17:10 EST (History)
12 users (show)

See Also:
Fixed In Version: 0.4.5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-01-13 14:08:22 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
last fragment of: strace hald --verbose=yes --daemon=no 2>strace.hald (10.39 KB, text/plain)
2004-12-07 17:59 EST, Marek Kassur
no flags Details
backtrace of hald (2.55 KB, text/plain)
2004-12-09 11:43 EST, Serg Oskin
no flags Details
tree /sys (48.04 KB, text/plain)
2004-12-09 12:34 EST, Serg Oskin
no flags Details
Rever patch of gobject/gsignal.c (3.81 KB, patch)
2004-12-16 17:40 EST, Marek Kassur
no flags Details | Diff
More precise backtrace of hald (3.40 KB, text/plain)
2004-12-16 21:27 EST, Marek Kassur
no flags Details

  None (edit)
Description Marek Kassur 2004-12-07 17:56:51 EST
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3)
Gecko/20041020

Description of problem:
hald crash on sk98lin module, but it worked on earlier versions of hal.

sh-3.00# hald --verbose=yes --daemon=no
<..snip..>
23:39:21.704 [I] linux/osspec.c:793: handling /sys/class/net/eth1 net
23:39:21.711 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

23:39:21.711 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

23:39:21.711 [W] linux/net_class_device.c:257: Error reading link info
23:39:21.711 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

23:39:21.711 [W] linux/net_class_device.c:193: Error reading rate info
Segmentation fault




Version-Release number of selected component (if applicable):
hal-0.4.2-1.FC3

How reproducible:
Always

Steps to Reproduce:
1. service haldaemon start
2.
3.
    

Additional info:
Comment 1 Marek Kassur 2004-12-07 17:59:47 EST
Created attachment 108070 [details]
last fragment of: strace hald --verbose=yes --daemon=no 2>strace.hald
Comment 2 Serg Oskin 2004-12-08 09:24:21 EST
I have precisely such problem.
Comment 3 Marek Kassur 2004-12-08 18:47:36 EST
# lspci -v -s 02:05.0
02:05.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T
[Marvell] (rev 12)
        Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
        Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 217
        Memory at feaf8000 (32-bit, non-prefetchable) [size=16K]
        I/O ports at d800 [size=256]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data
Comment 4 Jan Bernhardt 2004-12-09 04:01:28 EST
Exactly the same problem and hardware here. See bug id 142218.
Comment 5 David Zeuthen 2004-12-09 11:01:16 EST
*** Bug 142218 has been marked as a duplicate of this bug. ***
Comment 6 David Zeuthen 2004-12-09 11:04:14 EST
Can anyone give me a backtrace? You'll need the hal-debuginfo package
installed and do

# gdb /usr/sbin/hald

and then give the command 'run --daemon=no --verbose=yes'. When the
crash occurs give the command 'backtrace' and post the output.

Also see http://fedora.linux.duke.edu/wiki/index.cgi/StackTraces

Thanks,
David
Comment 7 Serg Oskin 2004-12-09 11:43:58 EST
Created attachment 108225 [details]
backtrace of hald
Comment 8 David Zeuthen 2004-12-09 12:18:40 EST
Serg: Thanks for the backtrace - please also attach the output of
'tree /sys' (you'll need pkg tree for that command).
Comment 9 Serg Oskin 2004-12-09 12:34:41 EST
Created attachment 108239 [details]
tree /sys

Probably such information will assist:

# mii-tool
SIOCGMIIPHY on 'eth0' failed: Bad address
no MII interfaces found
#
Comment 10 David Zeuthen 2004-12-10 16:46:37 EST
Interesting - fixed one little glitch; not sure that it will solve the
issue but please try out these RPM's

 http://people.redhat.com/davidz/hal-testing2/

and let me know; thanks.
Comment 11 Benjamin Lebsanft 2004-12-10 17:12:54 EST
still doesn't work for me
Comment 12 Anthony Rumble 2004-12-10 17:19:41 EST
ditto
Comment 13 Fernando J. Leal 2004-12-10 17:59:39 EST
After applying the patch above (hal-0.4.2.cvs20041210-1.i386.rpm) the
problem is still the same. The same segmentation fault when run "hald
--verbose=yes --daemon=no".
Comment 14 David Zeuthen 2004-12-10 18:08:44 EST
Hmm, try running '/usr/bin/valgrind --tool=memcheck /usr/sbin/hald
--daemon=no --verbose=yes' and post the output.
Comment 15 Marek Kassur 2004-12-10 18:52:20 EST
Here is interesting part:

00:49:05.362 [I] linux/osspec.c:793: handling /sys/class/net/eth1 net
00:49:05.450 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

00:49:05.451 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

00:49:05.452 [W] linux/net_class_device.c:257: Error reading link info
00:49:05.458 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

00:49:05.459 [W] linux/net_class_device.c:193: Error reading rate info
==4082==
==4082== Invalid read of size 2
==4082==    at 0x3E6550: (within /usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7CBB: g_signal_emit_valist (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7F59: g_signal_emit (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x804E1DE: ??? (device.c:742)
==4082==  Address 0x5614 is not stack'd, malloc'd or (recently) free'd
==4082==
==4082== Process terminating with default action of signal 11 (SIGSEGV)
==4082==  Access not within mapped region at address 0x5614
==4082==    at 0x3E6550: (within /usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7CBB: g_signal_emit_valist (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7F59: g_signal_emit (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x804E1DE: ??? (device.c:742)
==4082==
==4082== ERROR SUMMARY: 92 errors from 24 contexts (suppressed: 21 from 1)
==4082== malloc/free: in use at exit: 3688412 bytes in 13978 blocks.
==4082== malloc/free: 63738 allocs, 49760 frees, 15613063 bytes allocated.
==4082== For a detailed leak analysis,  rerun with: --leak-check=yes
==4082== For counts of detected errors, rerun with: -v
Comment 16 David Zeuthen 2004-12-12 12:59:09 EST
*** Bug 142671 has been marked as a duplicate of this bug. ***
Comment 17 Jason Grant 2004-12-13 22:24:32 EST
I have this problem too, on the same hardware, after installing the
fc3 updates.

Just one more observation - even when I roll back HAL to the version
shipped with the release of fc3 (0.4.0-10), I still get the
segmentation fault.  This occurs whether I use the original kernel, or
the one currently available as a fc3 update (2.6.9-1.681_FC3smp). 
This makes me wonder whether the bug has been introduced via one of
the other fc3 updates.
Comment 18 Jason Grant 2004-12-13 23:40:27 EST
Here are the fc3 packages that were installed on my host just prior to
the onset of HAL seg faults:

[Fri Dec 10 21:28:16 2004] up2date installing packages:
['Omni-0.9.2-1.1', 'Omni-foomatic-0.9.2-1.1', 'gaim-1.1.0-0.FC3',
'glib2-2.4.8-1.fc3', 'glib2-devel-2.4.8-1.fc3', 'gtk2-2.4.14-1.fc3',
'gtk2-devel-2.4.14-1.fc3', 'libpng-1.2.8-1.fc3',
'libpng-devel-1.2.8-1.fc3', 'libpng10-1.0.18-1.fc3',
'libpng10-devel-1.0.18-1.fc3', 'nfs-utils-1.0.6-44', 'rhpl-0.148.1-2',
'rsh-0.17-24.1', 'selinux-policy-targeted-1.17.30-2.39',
'shadow-utils-4.0.3-56', 'udev-039-10.FC3.5',
'wireless-tools-27-0.pre25.3', 'xorg-x11-6.8.1-12.FC3.21',
'xorg-x11-Mesa-libGL-6.8.1-12.FC3.21',
'xorg-x11-Mesa-libGLU-6.8.1-12.FC3.21',
'xorg-x11-deprecated-libs-6.8.1-12.FC3.21',
'xorg-x11-deprecated-libs-devel-6.8.1-12.FC3.21',
'xorg-x11-devel-6.8.1-12.FC3.21',
'xorg-x11-font-utils-6.8.1-12.FC3.21',
'xorg-x11-libs-6.8.1-12.FC3.21', 'xorg-x11-tools-6.8.1-12.FC3.21',
'xorg-x11-twm-6.8.1-12.FC3.21', 'xorg-x11-xauth-6.8.1-12.FC3.21',
'xorg-x11-xfs-6.8.1-12.FC3.21'
Comment 19 Marek Kassur 2004-12-14 11:16:13 EST
/usr/lib/libgobject-2.0.so.0.400.8 is where segfault occurred, it's
part of glib2 (glib2-2.4.8-1.fc3).

https://www.redhat.com/archives/fedora-announce-list/2004-December/msg00055.html
Comment 20 Jason Grant 2004-12-14 22:40:52 EST
I just rolled back to the glib2 version that was released with fc3
(2.4.7-1).  HAL is no longer crashing on startup.

(I did my upgrade with 'rpm -U --oldpackage glib2-2.4.7-1*rpm' - this
leaves dangling symlinks in /usr/lib that need to be fixed manually)
Comment 21 Mike Voxx 2004-12-15 00:51:00 EST
so what do we do from here? where do we get updates on the status of
this problem? 
Comment 22 Jason Grant 2004-12-15 06:45:45 EST
yes, I'm unclear on whether it's a glib2 bug, or whether the crash
occurs in glib2 because HAL is passing it junk.
Comment 23 David Zeuthen 2004-12-15 12:54:03 EST
Matthias, do you know of any known regressions between glib2-2.4.8-1
and glib2-2.4.7-1?
Comment 24 Marek Kassur 2004-12-16 17:40:01 EST
Created attachment 108756 [details]
Rever patch of gobject/gsignal.c

This small patch reverts gsignal.c to version 2.4.7, that will make HAL working
again. Maybe somewhere there our bug is hiding, but where ?
Comment 25 David Zeuthen 2004-12-16 21:26:39 EST
*** Bug 143176 has been marked as a duplicate of this bug. ***
Comment 26 Marek Kassur 2004-12-16 21:27:15 EST
Created attachment 108771 [details]
More precise backtrace of hald

static inline void
handler_ref (Handler *handler)
{
  g_return_if_fail (handler->ref_count > 0);
^^^^^^^
It segfault here (if I get it right): glib-2.4.8/gobject/gsignal.c:564
Comment 27 Matthias Clasen 2004-12-16 22:21:29 EST
David, I didn't know of problems with the gsignal optimization in 2.4.8
so far. Looking at the patch, nothing obvious jumps out. Can you
reproduce the segfault ? It might be worth trying to run the thing
under valgrind to see if the handler list becomes corrupted at some point.

I'll join your efforts to debug this on Monday, as I won't be there
tomorrow.
Comment 28 Matthias Clasen 2004-12-16 22:23:18 EST
One further question: is hald using threads, so that reentrancy issues
could be involved ?
Comment 29 David Zeuthen 2004-12-16 22:34:33 EST
Hi Matthias, 

No, hald is not using threads; there is some reentrancy involved
though due to the rather asynchronous nature of how hald works. There
is also a valgrind trace in comment 15.

Btw, the bug only seems to occur with the sk98lin network driver. I'm
suspecting it's writing too much data into a struct allocated on the
stack when doing an ioctl(), thereby corrupting memory. I will try to
dig into the driver source to see what is happening. I'll also try to
allocate the struct for the ioctl on the to see if that makes the
crash go away.
Comment 30 Matthias Clasen 2004-12-16 22:50:18 EST
Ok, my next idea would be to write a function to check the integrity
of the handler list and call that from suitable places to catch when
and how it might get corrupted. 
Comment 31 Jason Grant 2004-12-20 03:51:36 EST
David,

Relating to your idea about struct overflow on an ioctl(), my host has
three ethernet controllers as shown below (output from lspci).  The
3Com  one is on the motherboard, and it is *disabled* under linux
(e.g. not listed by ifconfig), so there's no network traffic going
through it.  

I'm hoping that this observation will save you some time, since it
means that any such leak is occurring even without transmission of
ethernet data.

02:05.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T
[Marvell] (rev 12)
02:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
02:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
Comment 32 David Zeuthen 2005-01-03 16:19:04 EST
I've tried the approach I've mentioned in comment 29. Please try the
RPM's from

 http://people.redhat.com/davidz/cvs20050103/

If you're not on x86 you may rebuild from the SRPM.

Thanks,
David
Comment 33 Marek Kassur 2005-01-03 16:37:36 EST
Yes, it works for me.

Big Thanks,
Marek.
Comment 34 Igor Miletic 2005-01-03 16:42:48 EST
Works fine here too (glib2-2.6).

Thanks,
Igor
Comment 35 Serg Oskin 2005-01-03 16:59:54 EST
For me too works.

Thanks,
Serg.
Comment 36 Jamie Zawinski 2005-01-03 22:45:28 EST
Yes, this makes hald stop crashing for me.

I still can't seem to talk to either of my CF card readers, though,
but I imagine that's a different bug?  When I plug either of them
in, I get this in syslog, and nothing shows up in /media/.

I have seen both of these readers work (eratically) on RH9.
Generally they'd work once, then if I tried to use them again
the next day, my only option would be to reboot first.


Dazzle USB 2.0 reader (unsure of model number):

  Jan  3 19:35:39 grendel kernel: usb 5-1: new full speed USB device
using address 52
  Jan  3 19:35:39 grendel kernel: usb 5-1: device not accepting
address 52, error -71
  Jan  3 19:35:39 grendel kernel: usb 5-1: new full speed USB device
using address 53
  Jan  3 19:35:40 grendel kernel: usb 5-1: device not accepting
address 53, error -71

SanDisk ImageMate SDDR-31 USB 1.0 reader:

  Jan  3 19:37:57 grendel kernel: usb 5-1: new full speed USB device
using address 56
  Jan  3 19:37:58 grendel kernel: usb 5-1: device not accepting
address 56, error -71
  Jan  3 19:37:58 grendel kernel: usb 5-1: new full speed USB device
using address 57
  Jan  3 19:37:58 grendel kernel: usb 5-1: device not accepting
address 57, error -71

The SanDisk works fine in an (elderly, USB-1) Macintosh; the Dazzle
doesn't.
Comment 37 Jason Grant 2005-01-04 02:45:48 EST
Fixed for me.  Now using glib2-2.4.8-1 OK.  Thanks.
Comment 38 Benjamin Lebsanft 2005-01-05 07:31:07 EST
hal daemon now works for me, but my cf card reader is not recognized
Comment 39 David Zeuthen 2005-01-13 14:08:22 EST
This fix is in hal-0.4.5 available from Rawhide and it will also
appear as a FC3 update. Closing.

Note You need to log in before you can comment on or make changes to this bug.