Description of Problem: Have 3C579-TP card, which is supported by the 3c509.c driver. If too much time elapses between the kernel initialization of the driver and the activation of the card via 'ifconfig', the driver fails and removes the eth0 device from the network device list. Whenever 'fsck' is run during boot, the extra delay causes this to happen. When 'fsck' doesn't run, the card comes up fine. Created work-around by inserting 'ifconfig eth0 up' in '/etc/rc.d/rc.sysinit' before the 'fsck' step. Contacted author of driver (Donald Becker <becker>) and he indicated that the bug is specific to the RedHat distribution and that RedHat should supply the fix. Version-Release number of selected component (if applicable): /usr/src/linux-2.4.18-3/drivers/net/3c509.c #define DRV_NAME "3c509" #define DRV_VERSION "1.18c" #define DRV_RELDATE "1Mar2002 How Reproducible: Remove or relocate '/etc/sysconfig/network-scripts/ifcfg-eth0' and reboot system. Wait two or three minutes and then bring up 'eth0' by hand with 'ifconfig eth0 192.1.1.10 netmask 255.255.255.0 up'. Expected Results: Happiness and Light Actual Results: Unable to handle kernel NULL pointer dereference at virtual address 00000800 printing eip: c02412d0 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[<c02412d0>] Not tainted EFLAGS: 00010246 EIP is at (2.4.18-3custom) eax: 00000800 ebx: c022d884 ecx: 00000006 edx: 0000400e esi: 0000400e edi: 00004000 ebp: c022d880 esp: c9f99ec0 ds: 0018 es: 0018 ss: 0018 Process ifconfig (pid: 880, stackpage=c9f99000) Stack: c0198dfc 00004000 00000014 0000400e c022d880 00004000 00000000 c0198213 c022d880 c022d880 00000000 00001043 c01c3b02 c022d880 c022d880 00001002 c01c4a12 c022d880 00000000 c9f99f59 cb264784 cb264760 c01ef3b7 c022d880 Call Trace: [<c0198dfc>] [<c0198213>] [<c01c3b02>] [<c01c4a12>] [<c01ef3b7>] [<c01bdcee>] [<c013ec77>] [<c0108823>] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Created attachment 66445 [details] hardward description of system where problem occurs
Created attachment 66446 [details] kernel configuration
Created attachment 66447 [details] kernel symbol map
Created attachment 66448 [details] kernel traceback
Created attachment 66449 [details] Author's reply
Created attachment 66464 [details] assembler listing for 3c509.o as compiled on target system, object matches original from kernel build
Created attachment 66465 [details] command used to compile 3c509.c for assmebler listing and matching object
Created attachment 66466 [details] source for 3c509.c
Could you please enable CONFIG_KALLSYMS and try again? that way the oops gets nicely (and accuratly) decoded automatic...
Will doing this provide anything other than a symbolic stack traceback? I've already determined that the trap occurred on the initial "pushl" instruction of 'read_eeprom' which is being called from el3_up at 0C5C in the assembler listing. I can compete the traceback with much less trouble than recompiling the kernel and recreating the problem.
Created attachment 66484 [details] ksymoops rendering of Opps traceback
EVIL BUG ALERT!! On a hunch, I went back and tried bringing up the 3C579 after a ten minute post-boot delay, but without the 'fsck root' step (which I had had forced always on). No problemo. This of course means that this is really some hideous and evil bug in the JFS file system code that causes kernel memory corruption when 'fsck' is run. Good luck finding it. I don't have time for something this nasty and which I can work around. Note that the problem isn't dependent on my custom- kernel build. It appears with the default RedHat kernel as well.
It means nothing of the sort! It is much more likely to be a dangling pointer somewhere --- a pointer to freed memory, which just happens to work if the memory hasn't been used for something else in the mean time, but which fails if you run a large task like fsck which is almost guaranteed to reuse the freed memory for disk buffering.
A valid point. However the instruction which is trapping is the initial 'pushl' in a function rather than on function body code making references via a pointer. I'm still scratching my head over this as the register contents all seem to be valid in light of the 'pushl %ebx' that trapped. Perhaps 'Oops' has mislabeled a stack overflow trap as a GPF. 'Oops' references the %eax value, but this value is not in play at the point of the trap. On the balance I still think its the JFS code, which is new and still labeled EXPERIMENTAL rather than the 3c509 code which is about as old and mature as it gets.
I'm now a bit less certain about the cause of this bug. I discovered that there is almost no difference in the structures of the 'ext2' and 'ext3' file systems (just that 'ext3' has a "journal" section) and that one can bring up a 'ext3' file systems as 'ext2' if desired. I tried this and the kernel blows up even worse with an infinite "Opps" loop. It could be that the perpetrator is really the 3c509 driver, or it could be that any 'fsck' activity at all at boot causes corruption. It would take a lot of effort to determine which, and I don't have the time.
OK, now I'm thinking that Red Hat is a bit on the lame side. While wrestling with another kernel bug (which I didn't report here in part because of the lameness of the support), I discover that, wouldn't you know it, there is a major bug-fix release of the 7.3 distribution kernel [2.4.18-5] available! Sure enough this bug-fix kernel fixes both the other problem and this problem. What has me really annoyed is that Red Hat has gone to seemingly great lengths to hide their fix-patch download section (under a minuscule and deceiving "Errata" link on the home page), and that nobody in this forum though of suggesting I try the latest kernel before spending a ridiculous amount of time qualifying this bug. You guys are *not* IBM. You can say a lot of bad thing about IBM, but they definitely have their act together when it comes to support. Get a clue.
We publish the errata, and keep complete archives of them. Errata are posted to the red hat mailing lists and also a complete list of current updates is available from the "up2date" command. If you had trouble finding the errata then thats something we need to look at to see if we can make it more obvious and easier to do.
You can start by calling it "DOWNLOAD FIX UPDATES" and making it prominent on the Red Hat home page. "Errata?" That's what wussy journalists call it when they get their facts wrong in an article and retract it in ultra-fine print on the back page. Programs have what we call BUGS. BUGS are repaired by BUG FIXES and PATCHES, not "errata." It would also be helpful to remind people to check for the latest patches in the report submission forms before they waste hours on a bug report. Since I bought the latest distribution of Red Hat, it never occurred to me that there would have been two kernel fix releases and one 'gdb' update since my CDs were mastered in late April, not to mention a dozen other assorted though less important items. And as I pointed out above, those who support this forum should mention it when an update is available. The versions I'm working with are clearly presented at the beginning of the report. The 2.4.18-5 kernel update to 7.3 is over a month old and should be known to people who support kernel components. I've been working with AIX for years. You can't even begin to submit a problem report with IBM until you certify you are running the latest released version of the relevant component.
Three things: 1) You run a self compiled kernel, which is not supported 2) Bugzilla is NOT a support forum. It's a bugreport tool. 3) Your bug did not look like anything fixed in errata. And yes we try to support/help even people who for whatever reason don't want to upgrade to the latest errata kernels.