Red Hat Bugzilla – Bug 126936
Kernel 2.6.x locks up hard, 2.4.x works
Last modified: 2015-01-04 17:07:27 EST
I have never been able to use any 2.6.* kernel. I've most recently
tried 2.6.7-mm3 as well as 2.6.6-1.435smp. Both of them lock up hard
under heavy load. No magic SysRq, no log messages indicating the
problem, NMI watchdog is ineffective, etc. In contrast, 2.4.25-pac1
runs fine for me for months at a time.
I would love to give you whatever information you need to be able to
diagnose and fix this problem. Because of it, I am stuck compiling my
own 2.4.x kernels when I'd really rather just be using the current
Fedora 2.6.x kernel.
On this topic, surely somebody by now has written a utility which
captures all the vital statistics about a Linux machine (hardware,
kernel version, glibc version, kernel modules, etc.) and packages it
up to be submitted with a bug report such as this one?
I'm suspecting that the most likely culprit for the problems I'm
seeing is my SiI680 PCI133 off-board IDE controller. I have a vague
recollection that there were known problems with Silicon Image in
2.6.x; have those problems been addressed, or at least do you believe
that they've been addressed, in the Fedora kernel referenced above?
If not, do you know if/when they'll be addressed? Or am is using an
SiI680 controller with 2.6.x simply never going to work?
In terms of settings on the hard disks attached to that controller,
I'm using the following non-default settings in my
/etc/sysconfig/harddisk* files: USE_DMA=1, EIDE_32BIT=1,
EXTRA_PARAMS="-u1 -W0". Would changing these settings eliminate the
hangs (I hope you're not going to tell me to turn off DMA, which would
make the card so slow as to be essentially useless!).
is this still causing problems in the current kernels ?
It appears to no longer be a problem with kernel-smp-2.6.9-1.643 and
I may have spoken too soon. My computer locked up hard a few hours
after I went to bed after switching to the kernel referenced above.
This may be because of the SiImage problem again, or I suppose there
may be something else wrong with this kernel because it's a
OK, my system is unuseable right now, so I hope you had a good reason
to ask me to retest.
I can't now back down to my old kernel because I had to install the
new MAKEDEV for compatibility with the kernel you asked me to test,
and that apparently deleted most of the devices from /dev, one of
which is apparently the device required for the initial console. When
I try to boot with the old kernel now, which is 2.4 so keep in mind
that I can't use hdev, it says it can't open the initial console and
Recreating the missing devices is rather difficult because I can't get
to the actual /dev directory that the 2.4 kernel uses -- it's hidden
by udev! I tried using a rescue disk which caused all kinds of
problems because it's old enough that it's incompatible with most of
the binaries in my root partition. I managed to get my root partition
mounted with the rescue disk and to use mknod to create some console
devices (I couldn't use MAKEDEV to create all of them because of the
above-referenced binary incompatibility problem), but apparently
whatever devices I created weren't enough because the kernel still
couldn't create the initial console.
Do you have any suggestions for what I might do now to restore the
ability to boot with the old kernel?
Also, can you recommend a decent PCI IDE controller which is rock
solid with Red Hat kernels? Preferably one that's relatively fast but
also relatively inexpensive. I'm perfectly willing to throw a little
money at the problem if I can fix it by replacing the SiI card,
although obviously it would be preferable if the Linux kernel worked
with this card.
Well, I managed to recreate /dev/console and I'm back up and running
with 2.4, with a few tweaks I still need to fix.
I'd still like a better solution.
I've not heard any reports (good or bad) about the Sil controllers recently, so
I'm unsure if that could be the cause. Alan ?
There were definitely e-mail messages on LKML indicating that the
Silicon Image PCI controller support in the 2.6.x kernel should be
considered unstable, and I don't see anything in the ChangeLog files
for the 2.6.x kernel series indicating that this instability has been
SI controllers in 2.6.x should be rock solid. The new SATA driver
layer won't be used for the SI/CMD680. If it kicks in with 2.6 I'd
start with acpi=off becsause the IDE driver is identical
Alan, I must confess that much of what you wrote above is too cryptic
for me to understand it. I did get that you want me to try "acpi=off"
in my kernel boot, which I've done and I'll let you know the results,
but I'm under the impression that even before I did that ACPI was
still off for my system, because it's SMP and I see only these
ACPI-related messages when I boot:
Oct 31 07:24:39 jik kernel: ACPI: Unable to locate RSDP
Oct 31 07:24:42 jik kernel: ACPI: Subsystem revision 20040816
Oct 31 07:24:42 jik kernel: ACPI: Interpreter disabled.
Sorry, its never easy t gauge the level of a reply in bugzilla.
There are two sets of Linux drivers for some of the SI devices. The
SI680 that you have is handled by an existing 2.4 and 2.6 stable
driver with essentially the same code in both.
The driver that has been unstable but is now it seems far better is a
SATA specific driver for the SI3112 which in 2.4 used the old IDE
As such I'm puzzled that the SI680 driver should be the cause of the
Am I right that ACPI was already turned off when I booted 2.6, even
before I booted with "acpi=off"?
no. ACPI is on by default.
I tried acpi=off and it didn't help -- the system hung anyway.
My newest theory is that there may be an issue with software raid. I
had five different software raid partitions, /dev/md0
through /dev/md4, each with two mirrors, some with one mirror on my
PIIX4 controller and the other on the SIi680 and some with both
mirrors on the SIi680. I've disabled all of the RAID and switched
back to using the plain partitions, and so far I haven't crashed or
hung. We'll see if that keeps up.
Code review shows no apparent problem differences. I found another bug
involving SI3112 only but thats another story.
No luck -- the 2.6 kernel just locked up on me with all the RAID
disabled. Alan, do you know of *any* extant kernel lockups that I
might be able to test for to explain why my system is locking up under
I just ran memtest86+ on my machine overnight for >7 hours and it
didn't report a single error, so I think bad memory is unlikely to be
the explanation for the lockups. Any suggestions for other things I
I guess you're right that it isn't a problem with the SIIG
controller, because I removed the controller and the machine locked
ok, putting this one down to problematic hardware.
The same hardware works perfectly with 2.4.27-pre2-pac1 and various other 2.4.x
kernels, so how can it be reasonably classified as a hardware problem?
comment 17 seemed to imply that the problem is elsewhere.
is there any improvement with the latest 2.6.9 kernel ?
2.6 and 2.4 will show up differing hardware problems in difering ways. Ok so
it's not the SI680 so lets see what else it might be.
Can you attach an lspci please (off either 2.4 or 2.6 doesn't matter)
Created attachment 108417 [details]
kernel-smp-2.6.9-1.1021_FC4 is no better than any of the other 2.6.x kernels
One thing I don't think I mentioned before is that when the kernel locks up, it
seems to get very sluggish for a few seconds before the lock-up, i.e., the
lock-up is not instantaneous.
I think I have a similar problem. Until a couple of months ago, I
was using FC3 with kernel 2.6.9-1.667 and no problem. I updated to
2.6.9-1.681 and I get these lockups. I've done further updates since
then, am now on kernel 2.6.10-1.737 and it's still happening.
The bug is quite reproducible, I just have to do something which
initiates some heavy disk action (preferably including swap space)
and the system hangs with no option but to hard-reset :-(
Steps to reproduce:
1. Get lots (3.5GB) of data and write an ISO image to hard disk. If
this does not do it try:
2. Execute: split -b 500M <filename> on the ISO image.
For a while I thought it might be kjournald or kswapd, as they always
appeared high up in a top window at the time of the crash, so I
switched both off but the crash occurs just the same.
I agree with the last comment; that the system becomes very sluggish
just beforehand. In fact, during one of my attempts to perform the
split command above, I realised that the system became very sluggish,
killed the process, and the system recovered just fine.
I eventually got the ISO image split into its 7 pieces, so
unfortunately the bug does not _always_ occur.
Is the machine pingable?
The problem I have is that if one proces goes bezerk (memory-wise) and
machine starts to heavilly swap, it will virtually block entire
machine. For desktop user, it might look as if machine halted
(keyboard not reacting, mouse not reacting).
For example, for some time now, for whatever reason, updating kernel
package in Fedora Core 2 and 3 takes enourmous amounts of memory (both
using rpm and yum). The only way to update the kernel RPM, is to
limit the amount of memory that rpm/yum is allowed to use (and I also
# ulimit -Sm 131072
# ulimit -Hm 131072
# nice yum update
Without these restrictions, yum process (which seems to be very hard
on memory) would lock up entire machine. I can still ping, so I know
kernel is alive. And if I am *very* patient (patience measured in
minutes), I might even get some feedback from keyboard or mouse. I
can see disk activity light constatnly on. And if I leave machine
powered on overnight, and than go to work, when I get back home kernel
RPM package gets upgraded, and everything is back to normal (until the
next kernel upgrade).
I've noticed this on two older machine (one desktop, one laptop) with
Pentium MMX processors, as well as on one two-year old Pentium 4
machine with 256 of RAM and (good old) IDE disk (actually I just
locked it up complete by running yum updated without restricting first
how much memory it is allowed to take).
Now the question is, how is it possible that kernel allows for single
process to get entire machine to complete halt? Sure, if one process
makes machine to start swapping heavilly, it will be noticed on
overall performance and everything will slow down drastically. But it
shouldn't get everything to stop, kernel shouldn't allow single
process to monopolize entire physical memory, and not allow anything
else to be run at all.
No, the machine does not respond to anything. I've even left the
computer in its crashed state for hours to see if it finally
recovered by itself, but no dice.
I'll try your renice/ulimit suggestion...
I don't understand why this ticket is still in "NEEDINFO" state. It
would seem that the requested information has been provided, and
furthermore, several other people besides me have attested to having
the same problem, so I think we've ruled out a problem that is unique
to my hardware. What can we do to move this along?
00:0d.0 RAID bus controller: Silicon Image, Inc. (formerly CMD
Technology Inc) PCI0680 Ultra ATA-133 Host Controller (rev 02)
I have TWO of these in my home server driving four disks with heavy
load. I have had perfect stability so far.
It doesn't matter if the bug is NEEDINFO or ASSIGNED, because there is
NO USEFUL INFORMATION in this report to isolate a cause.
You don't need to shout.
There are several people here who have experienced this problem and
who are eager to give you whatever support you need to isolate it
further. If there is no useful information here, then tell us what
would constitute useful information and we will endeavor to provide it.
This may not work, but usually you will get useful oops or panic
tracebacks if you connect a serial console from this computer to
another and setup the tty properly.
I don't believe the two bugs in this bugzilla entry are at all
related. One is a vm/out of memory handling bug (known, mostly fixed
in 2.6.11) the other a total mystery that smells hardware related (NMI
watchdog didn't help get data)
In terms of IDE fixes the only potentially relevant fix for 2.6.11 was
an IDE shared IRQ handling corner case, which wouldn't produce the
symptoms and several md/raid fixes which again don't fit the symptoms
An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem. Please update to this new kernel, and
report whether or not it fixes your problem.
If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.
Sorry, I finally gave up a couple of months ago and bought a new computer. I'm
no longer using the computer that locks up with 2.6.x kernels, and I don't have
time to put it back together enough to test this.