From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Description of problem: After some time (never more than two weeks), one of our servers stop responding (freezes completely), and a reboot is needed to get the system back up and running. The server in question is an IBM xSeries eServer 330 with dual PIII-800MHz and 1 GB ECC ram. Both CPUs have the same family, model and stepping. No bugs are reported in /proc/cpuinfo. We are running a self-compiled 2.2.19 kernel, but with no special or experimental support compiled into the kernel. How reproducible: Always Steps to Reproduce: 1. Boot server 2. Wait for anywhere between a few hours and 10-12 days 3. System hangs Actual Results: After a few days up and running, the server freezes. With 100GB of disk fschk takes quite som time, and our users get angry and frustrated waiting for the server to get back online. Expected Results: The server should keep on truck^H^H^H^H^Hrunning. Additional info: The error message is as follows (abbreviated - stacks and registers snipped): warning: kfree_skb passed an skb still on a list (from c0267fbf) Unable to handle kernel NULL pointer dereference at virtual address 00000004 current -> tss.cr3 = 00101000, %cr3=00101000 *pde = 00000000 Oops: 0002 CPU: 1 EIP: 0010[<80185481>] EFLAGS: 00010047 <snipped registers and stack... available if wanted> Code: 89 58 04 89 03 c7 02 00 00 00 00 c7 42 04 00 00 00 00 c7 42 Aiee, killing interrupt handler Kernel panic: Attempted to kill the idle task! In swapping task - not syncing
A ksymoopsed version of the oops would be very welcome. Dave: does this look familiar ? (feel free to assign the bug back if not)
"We are running a self-compiled 2.2.19 kernel, but with no special or experimental support compiled into the kernel." Is that RH 2.2.19 or a "upstream" 2.2.19 ? And are you willing to try out a patch ?
It's a vanilla "kernel.org" 2.2.19 kernel, not a RH kernel. Before we try out any patches, it would be nice to know what the patch is supposed to do. The server is in full production, and we don't want to risk any instability in addition to the occasional halt described in this bugzilla report. The server seem to be running OK with a reboot every night, BTW. Could this be an indicator for a resource leak of some sort?
That or some timeout which happens to be > 24 hours ;) The patch I propose is a change that will also be in 2.2.20 whenever that comes out.
Send me the patch, and we'll see what happens... Can't guarantee anything today, but maybe tomorrow or Thursday.
Long time no update. Since last, we have tried several kernels: - 2.2.16smp - 2.2.16enterprise - 2.2.16 self compiled - 2.2.19 self compiled - 2.4.12 self compiled - 2.4.14 self compiled The server is running 2.4.14 now, but the only way to get it 100% stable is to remove one CPU. With two CPUs it freezes after anything from 1 day to 1 month. BTW, it's a NetFinity 5600 with dual PIII-800, 1 GB ram and an Adapter 3200s RAID controller (lost faith in IBM ServeRAIDs after a while).
What network driver are you using ? (As you're the only one seeing this problem something must be different)
The network drivers are one "3c59x" (main) and one "dmfe" (crossover to webmail server). We are using TCP/IP only, with a NSF mount from one server to the other. No ipchains, iptables, netfilter or pppoe. We are not all alone with this bug, btw. I have seen one more report in a Norwegian usenet group (no.it.os.unix.linux.diverse - read by at least one RedHat employee), and the reporter did get the error on both dual PIIIs and dual Athlon MPs using different 2.4 kernels. This guy was using the following libraries: libc-5.3.12-31, glibc-devel-2.2.4-18.7.0, glibc-2.2.4-18.7.0, glibc- common-2.2.4-18.7.0, compat-libstdc++-6.2-2.9.0.9 on a RedHat 7.0 system (which he has modified). We are only running libraries supplied with the RedHat distributions we have tried, to keep it as simple as possible.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/