142183 – hald crash on sk98lin

Bug 142183 - hald crash on sk98lin

Summary: hald crash on sk98lin

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	hal
Sub Component:
Version:	3
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	David Zeuthen
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	142218 142671 143176 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-12-07 22:56 UTC by Marek Kassur
Modified:	2007-11-30 22:10 UTC (History)
CC List:	12 users (show)
Fixed In Version:	0.4.5
Clone Of:
Environment:
Last Closed:	2005-01-13 19:08:22 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
last fragment of: strace hald --verbose=yes --daemon=no 2>strace.hald (10.39 KB, text/plain) 2004-12-07 22:59 UTC, Marek Kassur	no flags	Details
backtrace of hald (2.55 KB, text/plain) 2004-12-09 16:43 UTC, Serg Oskin	no flags	Details
tree /sys (48.04 KB, text/plain) 2004-12-09 17:34 UTC, Serg Oskin	no flags	Details
Rever patch of gobject/gsignal.c (3.81 KB, patch) 2004-12-16 22:40 UTC, Marek Kassur	no flags	Details \| Diff
More precise backtrace of hald (3.40 KB, text/plain) 2004-12-17 02:27 UTC, Marek Kassur	no flags	Details
View All

Description Marek Kassur 2004-12-07 22:56:51 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3)
Gecko/20041020

Description of problem:
hald crash on sk98lin module, but it worked on earlier versions of hal.

sh-3.00# hald --verbose=yes --daemon=no
<..snip..>
23:39:21.704 [I] linux/osspec.c:793: handling /sys/class/net/eth1 net
23:39:21.711 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

23:39:21.711 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

23:39:21.711 [W] linux/net_class_device.c:257: Error reading link info
23:39:21.711 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

23:39:21.711 [W] linux/net_class_device.c:193: Error reading rate info
Segmentation fault




Version-Release number of selected component (if applicable):
hal-0.4.2-1.FC3

How reproducible:
Always

Steps to Reproduce:
1. service haldaemon start
2.
3.
    

Additional info:

Comment 1 Marek Kassur 2004-12-07 22:59:47 UTC

Created attachment 108070 [details]
last fragment of: strace hald --verbose=yes --daemon=no 2>strace.hald

Comment 2 Serg Oskin 2004-12-08 14:24:21 UTC

I have precisely such problem.

Comment 3 Marek Kassur 2004-12-08 23:47:36 UTC

# lspci -v -s 02:05.0
02:05.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T
[Marvell] (rev 12)
        Subsystem: ASUSTeK Computer Inc. P4P800 Mainboard
        Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 217
        Memory at feaf8000 (32-bit, non-prefetchable) [size=16K]
        I/O ports at d800 [size=256]
        Capabilities: [48] Power Management version 2
        Capabilities: [50] Vital Product Data

Comment 4 Jan Bernhardt 2004-12-09 09:01:28 UTC

Exactly the same problem and hardware here. See bug id 142218.

Comment 5 David Zeuthen 2004-12-09 16:01:16 UTC

*** Bug 142218 has been marked as a duplicate of this bug. ***

Comment 6 David Zeuthen 2004-12-09 16:04:14 UTC

Can anyone give me a backtrace? You'll need the hal-debuginfo package
installed and do

# gdb /usr/sbin/hald

and then give the command 'run --daemon=no --verbose=yes'. When the
crash occurs give the command 'backtrace' and post the output.

Also see http://fedora.linux.duke.edu/wiki/index.cgi/StackTraces

Thanks,
David

Comment 7 Serg Oskin 2004-12-09 16:43:58 UTC

Created attachment 108225 [details]
backtrace of hald

Comment 8 David Zeuthen 2004-12-09 17:18:40 UTC

Serg: Thanks for the backtrace - please also attach the output of
'tree /sys' (you'll need pkg tree for that command).

Comment 9 Serg Oskin 2004-12-09 17:34:41 UTC

Created attachment 108239 [details]
tree /sys

Probably such information will assist:

# mii-tool
SIOCGMIIPHY on 'eth0' failed: Bad address
no MII interfaces found
#

Comment 10 David Zeuthen 2004-12-10 21:46:37 UTC

Interesting - fixed one little glitch; not sure that it will solve the
issue but please try out these RPM's

 http://people.redhat.com/davidz/hal-testing2/

and let me know; thanks.

Comment 11 Benjamin Lebsanft 2004-12-10 22:12:54 UTC

still doesn't work for me

Comment 12 Anthony Rumble 2004-12-10 22:19:41 UTC

ditto

Comment 13 Fernando J. Leal 2004-12-10 22:59:39 UTC

After applying the patch above (hal-0.4.2.cvs20041210-1.i386.rpm) the
problem is still the same. The same segmentation fault when run "hald
--verbose=yes --daemon=no".

Comment 14 David Zeuthen 2004-12-10 23:08:44 UTC

Hmm, try running '/usr/bin/valgrind --tool=memcheck /usr/sbin/hald
--daemon=no --verbose=yes' and post the output.

Comment 15 Marek Kassur 2004-12-10 23:52:20 UTC

Here is interesting part:

00:49:05.362 [I] linux/osspec.c:793: handling /sys/class/net/eth1 net
00:49:05.450 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

00:49:05.451 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

00:49:05.452 [W] linux/net_class_device.c:257: Error reading link info
00:49:05.458 [E] linux/net_class_device.c:137: SIOCGMIIREG on eth1
failed: Bad address

00:49:05.459 [W] linux/net_class_device.c:193: Error reading rate info
==4082==
==4082== Invalid read of size 2
==4082==    at 0x3E6550: (within /usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7CBB: g_signal_emit_valist (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7F59: g_signal_emit (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x804E1DE: ??? (device.c:742)
==4082==  Address 0x5614 is not stack'd, malloc'd or (recently) free'd
==4082==
==4082== Process terminating with default action of signal 11 (SIGSEGV)
==4082==  Access not within mapped region at address 0x5614
==4082==    at 0x3E6550: (within /usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7CBB: g_signal_emit_valist (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x3E7F59: g_signal_emit (in
/usr/lib/libgobject-2.0.so.0.400.8)
==4082==    by 0x804E1DE: ??? (device.c:742)
==4082==
==4082== ERROR SUMMARY: 92 errors from 24 contexts (suppressed: 21 from 1)
==4082== malloc/free: in use at exit: 3688412 bytes in 13978 blocks.
==4082== malloc/free: 63738 allocs, 49760 frees, 15613063 bytes allocated.
==4082== For a detailed leak analysis,  rerun with: --leak-check=yes
==4082== For counts of detected errors, rerun with: -v

Comment 16 David Zeuthen 2004-12-12 17:59:09 UTC

*** Bug 142671 has been marked as a duplicate of this bug. ***

Comment 17 Jason Grant 2004-12-14 03:24:32 UTC

I have this problem too, on the same hardware, after installing the
fc3 updates.

Just one more observation - even when I roll back HAL to the version
shipped with the release of fc3 (0.4.0-10), I still get the
segmentation fault.  This occurs whether I use the original kernel, or
the one currently available as a fc3 update (2.6.9-1.681_FC3smp). 
This makes me wonder whether the bug has been introduced via one of
the other fc3 updates.

Comment 18 Jason Grant 2004-12-14 04:40:27 UTC

Here are the fc3 packages that were installed on my host just prior to
the onset of HAL seg faults:

[Fri Dec 10 21:28:16 2004] up2date installing packages:
['Omni-0.9.2-1.1', 'Omni-foomatic-0.9.2-1.1', 'gaim-1.1.0-0.FC3',
'glib2-2.4.8-1.fc3', 'glib2-devel-2.4.8-1.fc3', 'gtk2-2.4.14-1.fc3',
'gtk2-devel-2.4.14-1.fc3', 'libpng-1.2.8-1.fc3',
'libpng-devel-1.2.8-1.fc3', 'libpng10-1.0.18-1.fc3',
'libpng10-devel-1.0.18-1.fc3', 'nfs-utils-1.0.6-44', 'rhpl-0.148.1-2',
'rsh-0.17-24.1', 'selinux-policy-targeted-1.17.30-2.39',
'shadow-utils-4.0.3-56', 'udev-039-10.FC3.5',
'wireless-tools-27-0.pre25.3', 'xorg-x11-6.8.1-12.FC3.21',
'xorg-x11-Mesa-libGL-6.8.1-12.FC3.21',
'xorg-x11-Mesa-libGLU-6.8.1-12.FC3.21',
'xorg-x11-deprecated-libs-6.8.1-12.FC3.21',
'xorg-x11-deprecated-libs-devel-6.8.1-12.FC3.21',
'xorg-x11-devel-6.8.1-12.FC3.21',
'xorg-x11-font-utils-6.8.1-12.FC3.21',
'xorg-x11-libs-6.8.1-12.FC3.21', 'xorg-x11-tools-6.8.1-12.FC3.21',
'xorg-x11-twm-6.8.1-12.FC3.21', 'xorg-x11-xauth-6.8.1-12.FC3.21',
'xorg-x11-xfs-6.8.1-12.FC3.21'

Comment 19 Marek Kassur 2004-12-14 16:16:13 UTC

/usr/lib/libgobject-2.0.so.0.400.8 is where segfault occurred, it's
part of glib2 (glib2-2.4.8-1.fc3).

https://www.redhat.com/archives/fedora-announce-list/2004-December/msg00055.html

Comment 20 Jason Grant 2004-12-15 03:40:52 UTC

I just rolled back to the glib2 version that was released with fc3
(2.4.7-1).  HAL is no longer crashing on startup.

(I did my upgrade with 'rpm -U --oldpackage glib2-2.4.7-1*rpm' - this
leaves dangling symlinks in /usr/lib that need to be fixed manually)

Comment 21 Mike Voxx 2004-12-15 05:51:00 UTC

so what do we do from here? where do we get updates on the status of
this problem?

Comment 22 Jason Grant 2004-12-15 11:45:45 UTC

yes, I'm unclear on whether it's a glib2 bug, or whether the crash
occurs in glib2 because HAL is passing it junk.

Comment 23 David Zeuthen 2004-12-15 17:54:03 UTC

Matthias, do you know of any known regressions between glib2-2.4.8-1
and glib2-2.4.7-1?

Comment 24 Marek Kassur 2004-12-16 22:40:01 UTC

Created attachment 108756 [details]
Rever patch of gobject/gsignal.c

This small patch reverts gsignal.c to version 2.4.7, that will make HAL working
again. Maybe somewhere there our bug is hiding, but where ?

Comment 25 David Zeuthen 2004-12-17 02:26:39 UTC

*** Bug 143176 has been marked as a duplicate of this bug. ***

Comment 26 Marek Kassur 2004-12-17 02:27:15 UTC

Created attachment 108771 [details]
More precise backtrace of hald

static inline void
handler_ref (Handler *handler)
{
  g_return_if_fail (handler->ref_count > 0);
^^^^^^^
It segfault here (if I get it right): glib-2.4.8/gobject/gsignal.c:564

Comment 27 Matthias Clasen 2004-12-17 03:21:29 UTC

David, I didn't know of problems with the gsignal optimization in 2.4.8
so far. Looking at the patch, nothing obvious jumps out. Can you
reproduce the segfault ? It might be worth trying to run the thing
under valgrind to see if the handler list becomes corrupted at some point.

I'll join your efforts to debug this on Monday, as I won't be there
tomorrow.

Comment 28 Matthias Clasen 2004-12-17 03:23:18 UTC

One further question: is hald using threads, so that reentrancy issues
could be involved ?

Comment 29 David Zeuthen 2004-12-17 03:34:33 UTC

Hi Matthias, 

No, hald is not using threads; there is some reentrancy involved
though due to the rather asynchronous nature of how hald works. There
is also a valgrind trace in comment 15.

Btw, the bug only seems to occur with the sk98lin network driver. I'm
suspecting it's writing too much data into a struct allocated on the
stack when doing an ioctl(), thereby corrupting memory. I will try to
dig into the driver source to see what is happening. I'll also try to
allocate the struct for the ioctl on the to see if that makes the
crash go away.

Comment 30 Matthias Clasen 2004-12-17 03:50:18 UTC

Ok, my next idea would be to write a function to check the integrity
of the handler list and call that from suitable places to catch when
and how it might get corrupted.

Comment 31 Jason Grant 2004-12-20 08:51:36 UTC

David,

Relating to your idea about struct overflow on an ioctl(), my host has
three ethernet controllers as shown below (output from lspci).  The
3Com  one is on the motherboard, and it is *disabled* under linux
(e.g. not listed by ifconfig), so there's no network traffic going
through it.  

I'm hoping that this observation will save you some time, since it
means that any such leak is occurring even without transmission of
ethernet data.

02:05.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T
[Marvell] (rev 12)
02:09.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
02:0a.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)

Comment 32 David Zeuthen 2005-01-03 21:19:04 UTC

I've tried the approach I've mentioned in comment 29. Please try the
RPM's from

 http://people.redhat.com/davidz/cvs20050103/

If you're not on x86 you may rebuild from the SRPM.

Thanks,
David

Comment 33 Marek Kassur 2005-01-03 21:37:36 UTC

Yes, it works for me.

Big Thanks,
Marek.

Comment 34 Igor Miletic 2005-01-03 21:42:48 UTC

Works fine here too (glib2-2.6).

Thanks,
Igor

Comment 35 Serg Oskin 2005-01-03 21:59:54 UTC

For me too works.

Thanks,
Serg.

Comment 36 Jamie Zawinski 2005-01-04 03:45:28 UTC

Yes, this makes hald stop crashing for me.

I still can't seem to talk to either of my CF card readers, though,
but I imagine that's a different bug?  When I plug either of them
in, I get this in syslog, and nothing shows up in /media/.

I have seen both of these readers work (eratically) on RH9.
Generally they'd work once, then if I tried to use them again
the next day, my only option would be to reboot first.


Dazzle USB 2.0 reader (unsure of model number):

  Jan  3 19:35:39 grendel kernel: usb 5-1: new full speed USB device
using address 52
  Jan  3 19:35:39 grendel kernel: usb 5-1: device not accepting
address 52, error -71
  Jan  3 19:35:39 grendel kernel: usb 5-1: new full speed USB device
using address 53
  Jan  3 19:35:40 grendel kernel: usb 5-1: device not accepting
address 53, error -71

SanDisk ImageMate SDDR-31 USB 1.0 reader:

  Jan  3 19:37:57 grendel kernel: usb 5-1: new full speed USB device
using address 56
  Jan  3 19:37:58 grendel kernel: usb 5-1: device not accepting
address 56, error -71
  Jan  3 19:37:58 grendel kernel: usb 5-1: new full speed USB device
using address 57
  Jan  3 19:37:58 grendel kernel: usb 5-1: device not accepting
address 57, error -71

The SanDisk works fine in an (elderly, USB-1) Macintosh; the Dazzle
doesn't.

Comment 37 Jason Grant 2005-01-04 07:45:48 UTC

Fixed for me.  Now using glib2-2.4.8-1 OK.  Thanks.

Comment 38 Benjamin Lebsanft 2005-01-05 12:31:07 UTC

hal daemon now works for me, but my cf card reader is not recognized

Comment 39 David Zeuthen 2005-01-13 19:08:22 UTC

This fix is in hal-0.4.5 available from Rawhide and it will also
appear as a FC3 update. Closing.

Note You need to log in before you can comment on or make changes to this bug.