Bug 41454

Summary: SCSI lockup in SMP kernel under heavy load
Product: [Retired] Red Hat Linux Reporter: Geoff Keating <geoffk>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WORKSFORME QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: low    
Version: 7.0   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-06-12 16:24:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Kernel output on boot. none

Description Geoff Keating 2001-05-20 20:23:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.2.19-6.2.1 i586)

Description of problem:
Under heavy load, this Dell 1U 2-processor server stops responding.  It is
still pingable, and the console displays the following (repeating)
messages:
SCSI bus host 0 abort (pid 4732551) timed out - resetting
SCSI bus is being reset for host 0 channel 0
SCSI bus host 0 abort (pid 4732552) timed out - resetting
SCSI bus is being reset for host 0 channel 0
SCSI host 0 channel 0 reset (pid 4732551) timed out - trying harder
SCSI bus is being reset for host 0 channel 0
(the above 6 messages repeat indefinitely)


How reproducible:
Always

Steps to Reproduce:
This has happened to me three times so far.  It's probably difficult to
reproduce it outside this machine's setup.  It seems to be dependent on
some property of the load.

Additional info:

* The kernel installed is the 2.2.19-7.0.1smp kernel.
* The machine is maat.cygnus.com.
* The machine has one 9G SCSI disk attached.
* The reboots occurred as follows:
/var/log/messages:May 14 15:39:31 maat syslogd 1.3-3: restart.
/var/log/messages:May 19 23:19:16 maat syslogd 1.3-3: restart.
but the crashes would have occurred some time previous to this (the machine
is currently down).
* The last log message before the first crash was:
/var/log/messages:May 13 19:42:33 maat kernel: unexpected IRQ vector 209 on
CPU#0!
These messages have been appearing occasionally, usually as follows:
/var/log/messages:May 18 13:37:33 maat kernel: unexpected IRQ vector 209 on
CPU#0!
/var/log/messages:May 18 13:37:33 maat kernel: stuck on TLB IPI wait
(CPU#1)
The message before the crash had no 'stuck on...' message after it, but
it's possible that the second message didn't get written out due to the
SCSI lockup. Sometimes the message has CPU#1 and CPU#0 swapped, but always
IRQ 209.  The message was not in the logfile after the second crash.
* `Update to 7.1' is an acceptable resolution for this bug, if there's
reason to believe it's fixed there.

Comment 1 Arjan van de Ven 2001-05-20 20:28:19 UTC
Could you try booting with the "noapic" parameter ?
This looks like evil IRQ routing fuckups.

Comment 2 Geoff Keating 2001-05-24 21:31:06 UTC
Created attachment 19554 [details]
Kernel output on boot.

Comment 3 Geoff Keating 2001-05-24 21:32:48 UTC
The 'noapic' parameter seems to suppress the problem, at least so far.

I've attached the log output the kernel used to produce on boot, which may be
helpful in
working out what the APIC configuration is.

Comment 4 Geoff Keating 2001-06-12 16:24:50 UTC
The machine has been up for two weeks now, so disabling the APIC seems to be an
effective workaround, there've been no side-effects, and this kernel is no
longer the main focus of development, so I'm downgrading the importance and
severity of this bug.  Thank you!

Comment 5 Arjan van de Ven 2001-06-12 16:30:56 UTC
I'll close this bug as "worksforsome" as there is a workaround;
It's not something easily fixable in a stable 2.2 series (heck, even in 
2.4 or 2.5 it will be hard to fix ..)