126936 – Kernel 2.6.x locks up hard, 2.4.x works

Bug 126936 - Kernel 2.6.x locks up hard, 2.4.x works

Summary: Kernel 2.6.x locks up hard, 2.4.x works

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-06-29 14:01 UTC by Jonathan Kamens
Modified:	2015-01-04 22:07 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-07-15 22:43:08 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
lspci output (828 bytes, text/plain) 2004-12-13 03:27 UTC, Jonathan Kamens	no flags	Details
View All

Description Jonathan Kamens 2004-06-29 14:01:03 UTC

I have never been able to use any 2.6.* kernel.  I've most recently
tried 2.6.7-mm3 as well as 2.6.6-1.435smp.  Both of them lock up hard
under heavy load.  No magic SysRq, no log messages indicating the
problem, NMI watchdog is ineffective, etc.  In contrast, 2.4.25-pac1
runs fine for me for months at a time.

I would love to give you whatever information you need to be able to
diagnose and fix this problem.  Because of it, I am stuck compiling my
own 2.4.x kernels when I'd really rather just be using the current
Fedora 2.6.x kernel.

On this topic, surely somebody by now has written a utility which
captures all the vital statistics about a Linux machine (hardware,
kernel version, glibc version, kernel modules, etc.) and packages it
up to be submitted with a bug report such as this one?

I'm suspecting that the most likely culprit for the problems I'm
seeing is my SiI680 PCI133 off-board IDE controller.  I have a vague
recollection that there were known problems with Silicon Image in
2.6.x; have those problems been addressed, or at least do you believe
that they've been addressed, in the Fedora kernel referenced above? 
If not, do you know if/when they'll be addressed?  Or am is using an
SiI680 controller with 2.6.x simply never going to work?

In terms of settings on the hard disks attached to that controller,
I'm using the following non-default settings in my
/etc/sysconfig/harddisk* files: USE_DMA=1, EIDE_32BIT=1,
EXTRA_PARAMS="-u1 -W0".  Would changing these settings eliminate the
hangs (I hope you're not going to tell me to turn off DMA, which would
make the card so slow as to be essentially useless!).

Comment 1 Dave Jones 2004-10-30 04:00:35 UTC

is this still causing problems in the current kernels ?

Comment 2 Jonathan Kamens 2004-10-31 04:31:04 UTC

It appears to no longer be a problem with kernel-smp-2.6.9-1.643 and
glibc-2.3.3-74.

Comment 3 Jonathan Kamens 2004-10-31 11:24:02 UTC

I may have spoken too soon.  My computer locked up hard a few hours
after I went to bed after switching to the kernel referenced above. 
This may be because of the SiImage problem again, or I suppose there
may be something else wrong with this kernel because it's a
development kernel.

Comment 4 Jonathan Kamens 2004-10-31 14:32:39 UTC

OK, my system is unuseable right now, so I hope you had a good reason
to ask me to retest.

I can't now back down to my old kernel because I had to install the
new MAKEDEV for compatibility with the kernel you asked me to test,
and that apparently deleted most of the devices from /dev, one of
which is apparently the device required for the initial console.  When
I try to boot with the old kernel now, which is 2.4 so keep in mind
that I can't use hdev, it says it can't open the initial console and
won't boot.

Recreating the missing devices is rather difficult because I can't get
to the actual /dev directory that the 2.4 kernel uses -- it's hidden
by udev!  I tried using a rescue disk which caused all kinds of
problems because it's old enough that it's incompatible with most of
the binaries in my root partition.  I managed to get my root partition
mounted with the rescue disk and to use mknod to create some console
devices (I couldn't use MAKEDEV to create all of them because of the
above-referenced binary incompatibility problem), but apparently
whatever devices I created weren't enough because the kernel still
couldn't create the initial console.

Do you have any suggestions for what I might do now to restore the
ability to boot with the old kernel?

Also, can you recommend a decent PCI IDE controller which is rock
solid with Red Hat kernels?  Preferably one that's relatively fast but
also relatively inexpensive.  I'm perfectly willing to throw a little
money at the problem if I can fix it by replacing the SiI card,
although obviously it would be preferable if the Linux kernel worked
with this card.

Comment 5 Jonathan Kamens 2004-10-31 14:49:28 UTC

Well, I managed to recreate /dev/console and I'm back up and running
with 2.4, with a few tweaks I still need to fix.
I'd still like a better solution.

Comment 6 Dave Jones 2004-11-01 19:26:10 UTC

I've not heard any reports (good or bad) about the Sil controllers recently, so
I'm unsure if that could be the cause.  Alan ?

Comment 7 Jonathan Kamens 2004-11-01 19:46:46 UTC

There were definitely e-mail messages on LKML indicating that the 
Silicon Image PCI controller support in the 2.6.x kernel should be 
considered unstable, and I don't see anything in the ChangeLog files 
for the 2.6.x kernel series indicating that this instability has been 
addressed.

Comment 8 Alan Cox 2004-11-01 22:03:41 UTC

SI controllers in 2.6.x should be rock solid. The new SATA driver
layer won't be used for the SI/CMD680. If it kicks in with 2.6 I'd
start with acpi=off becsause the IDE driver is identical

Comment 9 Jonathan Kamens 2004-11-02 00:47:21 UTC

Alan, I must confess that much of what you wrote above is too cryptic
for me to understand it.  I did get that you want me to try "acpi=off"
in my kernel boot, which I've done and I'll let you know the results,
but I'm under the impression that even before I did that ACPI was
still off for my system, because it's SMP and I see only these
ACPI-related messages when I boot:

Oct 31 07:24:39 jik kernel: ACPI: Unable to locate RSDP
...
Oct 31 07:24:42 jik kernel: ACPI: Subsystem revision 20040816
Oct 31 07:24:42 jik kernel: ACPI: Interpreter disabled.

Comment 10 Alan Cox 2004-11-02 00:51:23 UTC

Sorry, its never easy t gauge the level of a reply in bugzilla.

There are two sets of Linux drivers for some of the SI devices. The
SI680 that you have is handled by an existing 2.4 and 2.6 stable
driver with essentially the same code in both.

The driver that has been unstable but is now it seems far better is a
SATA specific driver for the SI3112 which in 2.4 used the old IDE
style driver.

As such I'm puzzled that the SI680 driver should be the cause of the
problems.

Comment 11 Jonathan Kamens 2004-11-02 01:18:22 UTC

Am I right that ACPI was already turned off when I booted 2.6, even
before I booted with "acpi=off"?

Comment 12 Dave Jones 2004-11-02 20:42:00 UTC

no. ACPI is on by default.

Comment 13 Jonathan Kamens 2004-11-02 20:47:47 UTC

I tried acpi=off and it didn't help -- the system hung anyway.

My newest theory is that there may be an issue with software raid.  I 
had five different software raid partitions, /dev/md0 
through /dev/md4, each with two mirrors, some with one mirror on my 
PIIX4 controller and the other on the SIi680 and some with both 
mirrors on the SIi680.  I've disabled all of the RAID and switched 
back to using the plain partitions, and so far I haven't crashed or 
hung.  We'll see if that keeps up.

Comment 14 Alan Cox 2004-11-02 21:53:59 UTC

Code review shows no apparent problem differences. I found another bug
involving SI3112 only but thats another story.

Comment 15 Jonathan Kamens 2004-11-03 03:49:05 UTC

No luck -- the 2.6 kernel just locked up on me with all the RAID
disabled.  Alan, do you know of *any* extant kernel lockups that I
might be able to test for to explain why my system is locking up under
2.6?

Comment 16 Jonathan Kamens 2004-11-03 11:14:35 UTC

I just ran memtest86+ on my machine overnight for >7 hours and it
didn't report a single error, so I think bad memory is unlikely to be
the explanation for the lockups. Any suggestions for other things I
can try?

Comment 17 Jonathan Kamens 2004-11-04 19:50:24 UTC

I guess you're right that it isn't a problem with the SIIG 
controller, because I removed the controller and the machine locked 
up again.

Comment 18 Dave Jones 2004-12-08 06:57:36 UTC

ok, putting this one down to problematic hardware.

Comment 19 Jonathan Kamens 2004-12-08 17:39:50 UTC

The same hardware works perfectly with 2.4.27-pre2-pac1 and various other 2.4.x
kernels, so how can it be reasonably classified as a hardware problem?

Comment 20 Dave Jones 2004-12-08 17:51:13 UTC

comment 17 seemed to imply that the problem is elsewhere.

is there any improvement with the latest 2.6.9 kernel ?

Comment 21 Alan Cox 2004-12-08 17:55:44 UTC

2.6 and 2.4 will show up differing hardware problems in difering ways. Ok so
it's not the SI680 so lets see what else it might be.

Can you attach an lspci please (off either 2.4 or 2.6 doesn't matter)

Comment 22 Jonathan Kamens 2004-12-13 03:27:47 UTC

Created attachment 108417 [details]
lspci output

Comment 23 Jonathan Kamens 2004-12-15 01:49:16 UTC

kernel-smp-2.6.9-1.1021_FC4 is no better than any of the other 2.6.x kernels
I've tried.

One thing I don't think I mentioned before is that when the kernel locks up, it
seems to get very sluggish for a few seconds before the lock-up, i.e., the
lock-up is not instantaneous.

Comment 24 fergal 2005-01-19 15:43:30 UTC

I think I have a similar problem.  Until a couple of months ago, I 
was using FC3 with kernel 2.6.9-1.667 and no problem.  I updated to 
2.6.9-1.681 and I get these lockups.  I've done further updates since 
then, am now on kernel 2.6.10-1.737 and it's still happening. 
 
The bug is quite reproducible, I just have to do something which 
initiates some heavy disk action (preferably including swap space) 
and the system hangs with no option but to hard-reset :-( 
 
Steps to reproduce: 
 
1. Get lots (3.5GB) of data and write an ISO image to hard disk.  If 
this does not do it try: 
2. Execute:  split -b 500M <filename>   on the ISO image. 
 
 
For a while I thought it might be kjournald or kswapd, as they always 
appeared high up in a top window at the time of the crash, so I 
switched both off but the crash occurs just the same. 
 
I agree with the last comment; that the system becomes very sluggish 
just beforehand.  In fact, during one of my attempts to perform the 
split command above, I realised that the system became very sluggish, 
killed the process, and the system recovered just fine. 
 
I eventually got the ISO image split into its 7 pieces, so 
unfortunately the bug does not _always_ occur. 
 
Fergal

Comment 25 Aleksandar Milivojevic 2005-01-19 17:17:47 UTC

Is the machine pingable?

The problem I have is that if one proces goes bezerk (memory-wise) and
machine starts to heavilly swap, it will virtually block entire
machine.  For desktop user, it might look as if machine halted
(keyboard not reacting, mouse not reacting).

For example, for some time now, for whatever reason, updating kernel
package in Fedora Core 2 and 3 takes enourmous amounts of memory (both
using rpm and yum).  The only way to update the kernel RPM, is to
limit the amount of memory that rpm/yum is allowed to use (and I also
renice it):

   # ulimit -Sm 131072
   # ulimit -Hm 131072
   # nice yum update

Without these restrictions, yum process (which seems to be very hard
on memory) would lock up entire machine.  I can still ping, so I know
kernel is alive.  And if I am *very* patient (patience measured in
minutes), I might even get some feedback from keyboard or mouse.  I
can see disk activity light constatnly on.  And if I leave machine
powered on overnight, and than go to work, when I get back home kernel
RPM package gets upgraded, and everything is back to normal (until the
next kernel upgrade).

I've noticed this on two older machine (one desktop, one laptop) with
Pentium MMX processors, as well as on one two-year old Pentium 4
machine with 256 of RAM and (good old) IDE disk (actually I just
locked it up complete by running yum updated without restricting first
how much memory it is allowed to take).

Now the question is, how is it possible that kernel allows for single
process to get entire machine to complete halt?  Sure, if one process
makes machine to start swapping heavilly, it will be noticed on
overall performance and everything will slow down drastically.  But it
shouldn't get everything to stop, kernel shouldn't allow single
process to monopolize entire physical memory, and not allow anything
else to be run at all.

Comment 26 fergal 2005-01-20 10:00:37 UTC

No, the machine does not respond to anything.  I've even left the 
computer in its crashed state for hours to see if it finally 
recovered by itself, but no dice. 
 
I'll try your renice/ulimit suggestion...

Comment 27 Jonathan Kamens 2005-03-15 04:11:56 UTC

I don't understand why this ticket is still in "NEEDINFO" state.  It
would seem that the requested information has been provided, and
furthermore, several other people besides me have attested to having
the same problem, so I think we've ruled out a problem that is unique
to my hardware.  What can we do to move this along?

Comment 28 Warren Togami 2005-03-15 04:22:24 UTC

00:0d.0 RAID bus controller: Silicon Image, Inc. (formerly CMD
Technology Inc) PCI0680 Ultra ATA-133 Host Controller (rev 02)
I have TWO of these in my home server driving four disks with heavy
load.  I have had perfect stability so far.

It doesn't matter if the bug is NEEDINFO or ASSIGNED, because there is
NO USEFUL INFORMATION in this report to isolate a cause.

Comment 29 Jonathan Kamens 2005-03-15 04:25:52 UTC

You don't need to shout.

There are several people here who have experienced this problem and
who are eager to give you whatever support you need to isolate it
further.  If there is no useful information here, then tell us what
would constitute useful information and we will endeavor to provide it.

Comment 30 Warren Togami 2005-03-15 04:34:43 UTC

This may not work, but usually you will get useful oops or panic
tracebacks if you connect a serial console from this computer to
another and setup the tty properly.

Comment 31 Alan Cox 2005-03-15 13:17:23 UTC

I don't believe the two bugs in this bugzilla entry are at all
related. One is a vm/out of memory handling bug (known, mostly fixed
in 2.6.11) the other a total mystery that smells hardware related (NMI
watchdog didn't help get data)

In terms of IDE fixes the only potentially relevant fix for 2.6.11 was
an IDE shared IRQ handling corner case, which wouldn't produce the
symptoms and several md/raid fixes which again don't fit the symptoms

Comment 32 Dave Jones 2005-07-15 19:20:17 UTC

An update has been released for Fedora Core 3 (kernel-2.6.12-1.1372_FC3) which
may contain a fix for your problem.   Please update to this new kernel, and
report whether or not it fixes your problem.

If you have updated to Fedora Core 4 since this bug was opened, and the problem
still occurs with the latest updates for that release, please change the version
field of this bug to 'fc4'.

Thank you.

Comment 33 Jonathan Kamens 2005-07-15 19:52:59 UTC

Sorry, I finally gave up a couple of months ago and bought a new computer.  I'm
no longer using the computer that locks up with 2.6.x kernels, and I don't have
time to put it back together enough to test this.

Note You need to log in before you can comment on or make changes to this bug.