Bug 41454 - SCSI lockup in SMP kernel under heavy load
Summary: SCSI lockup in SMP kernel under heavy load
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel   
(Show other bugs)
Version: 7.0
Hardware: i686 Linux
Target Milestone: ---
Assignee: Arjan van de Ven
QA Contact: Brock Organ
Depends On:
TreeView+ depends on / blocked
Reported: 2001-05-20 20:23 UTC by Geoff Keating
Modified: 2007-03-27 03:44 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2001-06-12 16:24:55 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Kernel output on boot. (17.58 KB, text/plain)
2001-05-24 21:31 UTC, Geoff Keating
no flags Details

Description Geoff Keating 2001-05-20 20:23:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.2.19-6.2.1 i586)

Description of problem:
Under heavy load, this Dell 1U 2-processor server stops responding.  It is
still pingable, and the console displays the following (repeating)
SCSI bus host 0 abort (pid 4732551) timed out - resetting
SCSI bus is being reset for host 0 channel 0
SCSI bus host 0 abort (pid 4732552) timed out - resetting
SCSI bus is being reset for host 0 channel 0
SCSI host 0 channel 0 reset (pid 4732551) timed out - trying harder
SCSI bus is being reset for host 0 channel 0
(the above 6 messages repeat indefinitely)

How reproducible:

Steps to Reproduce:
This has happened to me three times so far.  It's probably difficult to
reproduce it outside this machine's setup.  It seems to be dependent on
some property of the load.

Additional info:

* The kernel installed is the 2.2.19-7.0.1smp kernel.
* The machine is maat.cygnus.com.
* The machine has one 9G SCSI disk attached.
* The reboots occurred as follows:
/var/log/messages:May 14 15:39:31 maat syslogd 1.3-3: restart.
/var/log/messages:May 19 23:19:16 maat syslogd 1.3-3: restart.
but the crashes would have occurred some time previous to this (the machine
is currently down).
* The last log message before the first crash was:
/var/log/messages:May 13 19:42:33 maat kernel: unexpected IRQ vector 209 on
These messages have been appearing occasionally, usually as follows:
/var/log/messages:May 18 13:37:33 maat kernel: unexpected IRQ vector 209 on
/var/log/messages:May 18 13:37:33 maat kernel: stuck on TLB IPI wait
The message before the crash had no 'stuck on...' message after it, but
it's possible that the second message didn't get written out due to the
SCSI lockup. Sometimes the message has CPU#1 and CPU#0 swapped, but always
IRQ 209.  The message was not in the logfile after the second crash.
* `Update to 7.1' is an acceptable resolution for this bug, if there's
reason to believe it's fixed there.

Comment 1 Arjan van de Ven 2001-05-20 20:28:19 UTC
Could you try booting with the "noapic" parameter ?
This looks like evil IRQ routing fuckups.

Comment 2 Geoff Keating 2001-05-24 21:31:06 UTC
Created attachment 19554 [details]
Kernel output on boot.

Comment 3 Geoff Keating 2001-05-24 21:32:48 UTC
The 'noapic' parameter seems to suppress the problem, at least so far.

I've attached the log output the kernel used to produce on boot, which may be
helpful in
working out what the APIC configuration is.

Comment 4 Geoff Keating 2001-06-12 16:24:50 UTC
The machine has been up for two weeks now, so disabling the APIC seems to be an
effective workaround, there've been no side-effects, and this kernel is no
longer the main focus of development, so I'm downgrading the importance and
severity of this bug.  Thank you!

Comment 5 Arjan van de Ven 2001-06-12 16:30:56 UTC
I'll close this bug as "worksforsome" as there is a workaround;
It's not something easily fixable in a stable 2.2 series (heck, even in 
2.4 or 2.5 it will be hard to fix ..)

Note You need to log in before you can comment on or make changes to this bug.