From Bugzilla Helper: User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.2.19-6.2.1 i586) Description of problem: Under heavy load, this Dell 1U 2-processor server stops responding. It is still pingable, and the console displays the following (repeating) messages: SCSI bus host 0 abort (pid 4732551) timed out - resetting SCSI bus is being reset for host 0 channel 0 SCSI bus host 0 abort (pid 4732552) timed out - resetting SCSI bus is being reset for host 0 channel 0 SCSI host 0 channel 0 reset (pid 4732551) timed out - trying harder SCSI bus is being reset for host 0 channel 0 (the above 6 messages repeat indefinitely) How reproducible: Always Steps to Reproduce: This has happened to me three times so far. It's probably difficult to reproduce it outside this machine's setup. It seems to be dependent on some property of the load. Additional info: * The kernel installed is the 2.2.19-7.0.1smp kernel. * The machine is maat.cygnus.com. * The machine has one 9G SCSI disk attached. * The reboots occurred as follows: /var/log/messages:May 14 15:39:31 maat syslogd 1.3-3: restart. /var/log/messages:May 19 23:19:16 maat syslogd 1.3-3: restart. but the crashes would have occurred some time previous to this (the machine is currently down). * The last log message before the first crash was: /var/log/messages:May 13 19:42:33 maat kernel: unexpected IRQ vector 209 on CPU#0! These messages have been appearing occasionally, usually as follows: /var/log/messages:May 18 13:37:33 maat kernel: unexpected IRQ vector 209 on CPU#0! /var/log/messages:May 18 13:37:33 maat kernel: stuck on TLB IPI wait (CPU#1) The message before the crash had no 'stuck on...' message after it, but it's possible that the second message didn't get written out due to the SCSI lockup. Sometimes the message has CPU#1 and CPU#0 swapped, but always IRQ 209. The message was not in the logfile after the second crash. * `Update to 7.1' is an acceptable resolution for this bug, if there's reason to believe it's fixed there.
Could you try booting with the "noapic" parameter ? This looks like evil IRQ routing fuckups.
Created attachment 19554 [details] Kernel output on boot.
The 'noapic' parameter seems to suppress the problem, at least so far. I've attached the log output the kernel used to produce on boot, which may be helpful in working out what the APIC configuration is.
The machine has been up for two weeks now, so disabling the APIC seems to be an effective workaround, there've been no side-effects, and this kernel is no longer the main focus of development, so I'm downgrading the importance and severity of this bug. Thank you!
I'll close this bug as "worksforsome" as there is a workaround; It's not something easily fixable in a stable 2.2 series (heck, even in 2.4 or 2.5 it will be hard to fix ..)