41454 – SCSI lockup in SMP kernel under heavy load

Bug 41454 - SCSI lockup in SMP kernel under heavy load

Summary: SCSI lockup in SMP kernel under heavy load

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.0
Hardware:	i686
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-05-20 20:23 UTC by Geoff Keating
Modified:	2007-03-27 03:44 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-06-12 16:24:55 UTC
Embargoed:

Attachments	(Terms of Use)
Kernel output on boot. (17.58 KB, text/plain) 2001-05-24 21:31 UTC, Geoff Keating	no flags	Details
View All

Description Geoff Keating 2001-05-20 20:23:57 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.2.19-6.2.1 i586)

Description of problem:
Under heavy load, this Dell 1U 2-processor server stops responding.  It is
still pingable, and the console displays the following (repeating)
messages:
SCSI bus host 0 abort (pid 4732551) timed out - resetting
SCSI bus is being reset for host 0 channel 0
SCSI bus host 0 abort (pid 4732552) timed out - resetting
SCSI bus is being reset for host 0 channel 0
SCSI host 0 channel 0 reset (pid 4732551) timed out - trying harder
SCSI bus is being reset for host 0 channel 0
(the above 6 messages repeat indefinitely)


How reproducible:
Always

Steps to Reproduce:
This has happened to me three times so far.  It's probably difficult to
reproduce it outside this machine's setup.  It seems to be dependent on
some property of the load.

Additional info:

* The kernel installed is the 2.2.19-7.0.1smp kernel.
* The machine is maat.cygnus.com.
* The machine has one 9G SCSI disk attached.
* The reboots occurred as follows:
/var/log/messages:May 14 15:39:31 maat syslogd 1.3-3: restart.
/var/log/messages:May 19 23:19:16 maat syslogd 1.3-3: restart.
but the crashes would have occurred some time previous to this (the machine
is currently down).
* The last log message before the first crash was:
/var/log/messages:May 13 19:42:33 maat kernel: unexpected IRQ vector 209 on
CPU#0!
These messages have been appearing occasionally, usually as follows:
/var/log/messages:May 18 13:37:33 maat kernel: unexpected IRQ vector 209 on
CPU#0!
/var/log/messages:May 18 13:37:33 maat kernel: stuck on TLB IPI wait
(CPU#1)
The message before the crash had no 'stuck on...' message after it, but
it's possible that the second message didn't get written out due to the
SCSI lockup. Sometimes the message has CPU#1 and CPU#0 swapped, but always
IRQ 209.  The message was not in the logfile after the second crash.
* `Update to 7.1' is an acceptable resolution for this bug, if there's
reason to believe it's fixed there.

Comment 1 Arjan van de Ven 2001-05-20 20:28:19 UTC

Could you try booting with the "noapic" parameter ?
This looks like evil IRQ routing fuckups.

Comment 2 Geoff Keating 2001-05-24 21:31:06 UTC

Created attachment 19554 [details]
Kernel output on boot.

Comment 3 Geoff Keating 2001-05-24 21:32:48 UTC

The 'noapic' parameter seems to suppress the problem, at least so far.

I've attached the log output the kernel used to produce on boot, which may be
helpful in
working out what the APIC configuration is.

Comment 4 Geoff Keating 2001-06-12 16:24:50 UTC

The machine has been up for two weeks now, so disabling the APIC seems to be an
effective workaround, there've been no side-effects, and this kernel is no
longer the main focus of development, so I'm downgrading the importance and
severity of this bug.  Thank you!

Comment 5 Arjan van de Ven 2001-06-12 16:30:56 UTC

I'll close this bug as "worksforsome" as there is a workaround;
It's not something easily fixable in a stable 2.2 series (heck, even in 
2.4 or 2.5 it will be hard to fix ..)

Note You need to log in before you can comment on or make changes to this bug.