Red Hat Bugzilla – Bug 69558
Boot 'fsck' of ext2 or ext3 root FS before 3C579-TP init corrupts kernel, breaks 3c509.c init
Last modified: 2005-10-31 17:00:50 EST
Description of Problem:
Have 3C579-TP card, which is supported by the 3c509.c driver.
If too much time elapses between the kernel initialization of
the driver and the activation of the card via 'ifconfig',
the driver fails and removes the eth0 device from the network
device list. Whenever 'fsck' is run during boot, the extra
delay causes this to happen. When 'fsck' doesn't run, the
card comes up fine. Created work-around by inserting
'ifconfig eth0 up' in '/etc/rc.d/rc.sysinit' before the
'fsck' step. Contacted author of driver (Donald Becker
<firstname.lastname@example.org>) and he indicated that the bug is specific
to the RedHat distribution and that RedHat should supply
Version-Release number of selected component (if applicable):
#define DRV_NAME "3c509"
#define DRV_VERSION "1.18c"
#define DRV_RELDATE "1Mar2002
Remove or relocate '/etc/sysconfig/network-scripts/ifcfg-eth0'
and reboot system. Wait two or three minutes and then bring
up 'eth0' by hand with 'ifconfig eth0 188.8.131.52 netmask 255.255.255.0 up'.
Happiness and Light
Unable to handle kernel NULL pointer dereference at virtual address 00000800
*pde = 00000000
EIP: 0010:[<c02412d0>] Not tainted
EIP is at (2.4.18-3custom)
eax: 00000800 ebx: c022d884 ecx: 00000006 edx: 0000400e
esi: 0000400e edi: 00004000 ebp: c022d880 esp: c9f99ec0
ds: 0018 es: 0018 ss: 0018
Process ifconfig (pid: 880, stackpage=c9f99000)
Stack: c0198dfc 00004000 00000014 0000400e c022d880 00004000 00000000 c0198213
c022d880 c022d880 00000000 00001043 c01c3b02 c022d880 c022d880 00001002
c01c4a12 c022d880 00000000 c9f99f59 cb264784 cb264760 c01ef3b7 c022d880
Call Trace: [<c0198dfc>]
Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Created attachment 66445 [details]
hardward description of system where problem occurs
Created attachment 66446 [details]
Created attachment 66447 [details]
kernel symbol map
Created attachment 66448 [details]
Created attachment 66449 [details]
Created attachment 66464 [details]
assembler listing for 3c509.o as compiled on target system, object matches original from kernel build
Created attachment 66465 [details]
command used to compile 3c509.c for assmebler listing and matching object
Created attachment 66466 [details]
source for 3c509.c
Could you please enable CONFIG_KALLSYMS and try again? that way the oops gets
nicely (and accuratly) decoded automatic...
Will doing this provide anything other than a symbolic
stack traceback? I've already determined that the
trap occurred on the initial "pushl" instruction of
'read_eeprom' which is being called from el3_up at
0C5C in the assembler listing. I can compete the
traceback with much less trouble than recompiling
the kernel and recreating the problem.
Created attachment 66484 [details]
ksymoops rendering of Opps traceback
EVIL BUG ALERT!!
On a hunch, I went back and tried bringing up the
3C579 after a ten minute post-boot delay, but without
the 'fsck root' step (which I had had forced always on).
No problemo. This of course means that this is really
some hideous and evil bug in the JFS file system
code that causes kernel memory corruption when 'fsck'
is run. Good luck finding it. I don't have time
for something this nasty and which I can work around.
Note that the problem isn't dependent on my custom-
kernel build. It appears with the default RedHat
kernel as well.
It means nothing of the sort! It is much more likely to be a dangling pointer
somewhere --- a pointer to freed memory, which just happens to work if the
memory hasn't been used for something else in the mean time, but which fails if
you run a large task like fsck which is almost guaranteed to reuse the freed
memory for disk buffering.
A valid point. However the instruction which is trapping
is the initial 'pushl' in a function rather than on
function body code making references via a pointer.
I'm still scratching my head over this as the register
contents all seem to be valid in light of the 'pushl %ebx'
that trapped. Perhaps 'Oops' has mislabeled a stack
overflow trap as a GPF. 'Oops' references the %eax
value, but this value is not in play at the point of
On the balance I still think its the JFS code, which
is new and still labeled EXPERIMENTAL rather than the
3c509 code which is about as old and mature as it gets.
I'm now a bit less certain about the cause of this bug. I
discovered that there is almost no difference in the structures
of the 'ext2' and 'ext3' file systems (just that 'ext3' has a
"journal" section) and that one can bring up a 'ext3' file
systems as 'ext2' if desired. I tried this and the kernel blows
up even worse with an infinite "Opps" loop.
It could be that the perpetrator is really the 3c509 driver, or
it could be that any 'fsck' activity at all at boot causes
corruption. It would take a lot of effort to determine which,
and I don't have the time.
OK, now I'm thinking that Red Hat is a bit on the lame side.
While wrestling with another kernel bug (which I didn't report here
in part because of the lameness of the support), I discover that,
wouldn't you know it, there is a major bug-fix release of the
7.3 distribution kernel [2.4.18-5] available! Sure enough this
bug-fix kernel fixes both the other problem and this problem.
What has me really annoyed is that Red Hat has gone to seemingly
great lengths to hide their fix-patch download section (under a
minuscule and deceiving "Errata" link on the home page), and
that nobody in this forum though of suggesting I try the latest
kernel before spending a ridiculous amount of time qualifying
this bug. You guys are *not* IBM. You can say a lot of bad
thing about IBM, but they definitely have their act together
when it comes to support. Get a clue.
We publish the errata, and keep complete archives of them. Errata are posted to
the red hat mailing lists and also a complete list of current updates is
available from the "up2date" command.
If you had trouble finding the errata then thats something we need to look at to
see if we can make it more obvious and easier to do.
You can start by calling it "DOWNLOAD FIX UPDATES" and making it
prominent on the Red Hat home page. "Errata?" That's what
wussy journalists call it when they get their facts wrong in an
article and retract it in ultra-fine print on the back page.
Programs have what we call BUGS. BUGS are repaired by BUG FIXES
and PATCHES, not "errata." It would also be helpful to remind
people to check for the latest patches in the report submission
forms before they waste hours on a bug report. Since I bought
the latest distribution of Red Hat, it never occurred to me that
there would have been two kernel fix releases and one 'gdb'
update since my CDs were mastered in late April, not to mention
a dozen other assorted though less important items. And as I
pointed out above, those who support this forum should mention
it when an update is available. The versions I'm working with
are clearly presented at the beginning of the report. The
2.4.18-5 kernel update to 7.3 is over a month old and should be
known to people who support kernel components. I've been
working with AIX for years. You can't even begin to submit a
problem report with IBM until you certify you are running the
latest released version of the relevant component.
1) You run a self compiled kernel, which is not supported
2) Bugzilla is NOT a support forum. It's a bugreport tool.
3) Your bug did not look like anything fixed in errata. And yes we try to
support/help even people who for whatever reason don't want to upgrade to the
latest errata kernels.