Bug 69558

Summary: Boot 'fsck' of ext2 or ext3 root FS before 3C579-TP init corrupts kernel, breaks 3c509.c init
Product: [Retired] Red Hat Linux Reporter: starlight
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: low    
Version: 7.3CC: alan
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2002-07-25 06:49:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
hardward description of system where problem occurs
none
kernel configuration
none
kernel symbol map
none
kernel traceback
none
Author's reply
none
assembler listing for 3c509.o as compiled on target system, object matches original from kernel build
none
command used to compile 3c509.c for assmebler listing and matching object
none
source for 3c509.c
none
ksymoops rendering of Opps traceback none

Description starlight 2002-07-23 03:46:53 UTC
Description of Problem:

Have 3C579-TP card, which is supported by the 3c509.c driver.
If too much time elapses between the kernel initialization of
the driver and the activation of the card via 'ifconfig',
the driver fails and removes the eth0 device from the network
device list.  Whenever 'fsck' is run during boot, the extra
delay causes this to happen.  When 'fsck' doesn't run, the
card comes up fine.  Created work-around by inserting
'ifconfig eth0 up' in '/etc/rc.d/rc.sysinit' before the
'fsck' step.  Contacted author of driver (Donald Becker
<becker>) and he indicated that the bug is specific
to the RedHat distribution and that RedHat should supply
the fix.

Version-Release number of selected component (if applicable):

/usr/src/linux-2.4.18-3/drivers/net/3c509.c

#define DRV_NAME    "3c509"
#define DRV_VERSION "1.18c"
#define DRV_RELDATE "1Mar2002

How Reproducible:

Remove or relocate '/etc/sysconfig/network-scripts/ifcfg-eth0'
and reboot system.  Wait two or three minutes and then bring
up 'eth0' by hand with 'ifconfig eth0 192.1.1.10 netmask 255.255.255.0 up'.

Expected Results:

Happiness and Light

Actual Results:

Unable to handle kernel NULL pointer dereference at virtual address 00000800
 printing eip:
c02412d0
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<c02412d0>]    Not tainted
EFLAGS: 00010246

EIP is at  (2.4.18-3custom)
eax: 00000800   ebx: c022d884   ecx: 00000006   edx: 0000400e
esi: 0000400e   edi: 00004000   ebp: c022d880   esp: c9f99ec0
ds: 0018   es: 0018   ss: 0018
Process ifconfig (pid: 880, stackpage=c9f99000)
Stack: c0198dfc 00004000 00000014 0000400e c022d880 00004000 00000000 c0198213
       c022d880 c022d880 00000000 00001043 c01c3b02 c022d880 c022d880 00001002
       c01c4a12 c022d880 00000000 c9f99f59 cb264784 cb264760 c01ef3b7 c022d880
Call Trace: [<c0198dfc>]
[<c0198213>]
[<c01c3b02>]
[<c01c4a12>]
[<c01ef3b7>]
[<c01bdcee>]
[<c013ec77>]
[<c0108823>]

Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Comment 1 starlight 2002-07-23 03:52:16 UTC
Created attachment 66445 [details]
hardward description of system where problem occurs

Comment 2 starlight 2002-07-23 03:54:37 UTC
Created attachment 66446 [details]
kernel configuration

Comment 3 starlight 2002-07-23 03:59:48 UTC
Created attachment 66447 [details]
kernel symbol map

Comment 4 starlight 2002-07-23 04:00:48 UTC
Created attachment 66448 [details]
kernel traceback

Comment 5 starlight 2002-07-23 04:01:47 UTC
Created attachment 66449 [details]
Author's reply

Comment 6 starlight 2002-07-23 07:09:20 UTC
Created attachment 66464 [details]
assembler listing for 3c509.o as compiled on target system, object matches original from kernel build

Comment 7 starlight 2002-07-23 07:10:22 UTC
Created attachment 66465 [details]
command used to compile 3c509.c for assmebler listing and matching object

Comment 8 starlight 2002-07-23 07:11:02 UTC
Created attachment 66466 [details]
source for 3c509.c

Comment 9 Arjan van de Ven 2002-07-23 07:12:20 UTC
Could you please enable CONFIG_KALLSYMS and try again? that way the oops gets
nicely (and accuratly) decoded automatic...

Comment 10 starlight 2002-07-23 07:45:12 UTC
Will doing this provide anything other than a symbolic
stack traceback?  I've already determined that the
trap occurred on the initial "pushl" instruction of
'read_eeprom' which is being called from el3_up at
0C5C in the assembler listing.  I can compete the
traceback with much less trouble than recompiling
the kernel and recreating the problem.

Comment 11 starlight 2002-07-23 08:28:12 UTC
Created attachment 66484 [details]
ksymoops rendering of Opps traceback

Comment 12 starlight 2002-07-23 09:09:31 UTC
EVIL BUG ALERT!!

On a hunch, I went back and tried bringing up the
3C579 after a ten minute post-boot delay, but without
the 'fsck root' step (which I had had forced always on).
No problemo.  This of course means that this is really
some hideous and evil bug in the JFS file system
code that causes kernel memory corruption when 'fsck'
is run.  Good luck finding it.  I don't have time
for something this nasty and which I can work around.
Note that the problem isn't dependent on my custom-
kernel build.  It appears with the default RedHat
kernel as well.


Comment 13 Stephen Tweedie 2002-07-23 16:42:09 UTC
It means nothing of the sort!  It is much more likely to be a dangling pointer
somewhere --- a pointer to freed memory, which just happens to work if the
memory hasn't been used for something else in the mean time, but which fails if
you run a large task like fsck which is almost guaranteed to reuse the freed
memory for disk buffering.

Comment 14 starlight 2002-07-23 17:26:24 UTC
A valid point.  However the instruction which is trapping
is the initial 'pushl' in a function rather than on
function body code making references via a pointer.
I'm still scratching my head over this as the register
contents all seem to be valid in light of the 'pushl %ebx'
that trapped.  Perhaps 'Oops' has mislabeled a stack
overflow trap as a GPF.  'Oops' references the %eax
value, but this value is not in play at the point of
the trap.

On the balance I still think its the JFS code, which
is new and still labeled EXPERIMENTAL rather than the
3c509 code which is about as old and mature as it gets.


Comment 15 starlight 2002-07-25 06:46:35 UTC
I'm now a bit less certain about the cause of this bug.  I 
discovered that there is almost no difference in the structures 
of the 'ext2' and 'ext3' file systems (just that 'ext3' has a 
"journal" section) and that one can bring up a 'ext3' file 
systems as 'ext2' if desired.  I tried this and the kernel blows 
up even worse with an infinite "Opps" loop.

It could be that the perpetrator is really the 3c509 driver, or 
it could be that any 'fsck' activity at all at boot causes 
corruption.  It would take a lot of effort to determine which,
and I don't have the time.


Comment 16 starlight 2002-07-26 21:35:43 UTC
OK, now I'm thinking that Red Hat is a bit on the lame side.
While wrestling with another kernel bug (which I didn't report here
in part because of the lameness of the support), I discover that,
wouldn't you know it, there is a major bug-fix release of the
7.3 distribution kernel [2.4.18-5] available!  Sure enough this
bug-fix kernel fixes both the other problem and this problem.

What has me really annoyed is that Red Hat has gone to seemingly
great lengths to hide their fix-patch download section (under a
minuscule and deceiving "Errata" link on the home page), and
that nobody in this forum though of suggesting I try the latest
kernel before spending a ridiculous amount of time qualifying
this bug.  You guys are *not* IBM.  You can say a lot of bad
thing about IBM, but they definitely have their act together
when it comes to support.  Get a clue.


Comment 17 Alan Cox 2002-07-27 00:37:22 UTC
We publish the errata, and keep complete archives of them. Errata are posted to
the red hat mailing lists and also a complete list of current updates is
available from the "up2date" command.

If you had trouble finding the errata then thats something we need to look at to
see if we can make it more obvious and easier to do.

Comment 18 starlight 2002-07-27 09:25:46 UTC
You can start by calling it "DOWNLOAD FIX UPDATES" and making it 
prominent on the Red Hat home page.  "Errata?"  That's what 
wussy journalists call it when they get their facts wrong in an 
article and retract it in ultra-fine print on the back page. 
Programs have what we call BUGS.  BUGS are repaired by BUG FIXES 
and PATCHES, not "errata."  It would also be helpful to remind 
people to check for the latest patches in the report submission 
forms before they waste hours on a bug report.  Since I bought 
the latest distribution of Red Hat, it never occurred to me that 
there would have been two kernel fix releases and one 'gdb' 
update since my CDs were mastered in late April, not to mention 
a dozen other assorted though less important items.  And as I 
pointed out above, those who support this forum should mention 
it when an update is available.  The versions I'm working with 
are clearly presented at the beginning of the report.  The 
2.4.18-5 kernel update to 7.3 is over a month old and should be 
known to people who support kernel components.  I've been 
working with AIX for years.  You can't even begin to submit a 
problem report with IBM until you certify you are running the 
latest released version of the relevant component.


Comment 19 Arjan van de Ven 2002-07-28 08:44:49 UTC
Three things:
1) You run a self compiled kernel, which is not supported
2) Bugzilla is NOT a support forum. It's a bugreport tool.
3) Your bug did not look like anything fixed in errata. And yes we try to
support/help even people who for whatever reason don't want to upgrade to the
latest errata kernels.