Bug 208033 - mpt scsi driver malfunction
mpt scsi driver malfunction
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
6
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-09-25 18:56 EDT by Garrett Mitchener
Modified: 2008-01-01 22:44 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-01-01 22:44:59 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
More mptscsih messages (4.25 KB, text/plain)
2006-09-26 15:52 EDT, Garrett Mitchener
no flags Details
part of /var/log/messages (39.89 KB, text/plain)
2006-10-20 18:03 EDT, Garrett Mitchener
no flags Details

  None (edit)
Description Garrett Mitchener 2006-09-25 18:56:28 EDT
I posted this as a note under bug # 200787 which is also about problems with the
mpt scsi driver, but I think it's best for me to make a new bug under fc6 test3

Using kernels kernel-2.6.17-1.2647.fc6 and kernel-2.6.18-1.2689.fc6, my dell
precision 690 works pretty well for a while, but then I start getting these
error messages in /var/log/messages:

Sep 25 10:34:08 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff81006d0d6a30)
Sep 25 10:34:08 grograman kernel: sd 0:0:1:0: 
Sep 25 10:34:08 grograman kernel:         command: Read(10): 28 00 0e b2 1c 05
00 00 08 00
Sep 25 10:34:08 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 25 10:34:08 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff81006d0d6a30)
Sep 25 10:34:08 grograman kernel: INFO: trying to register non-static key.
Sep 25 10:34:08 grograman kernel: the code is fine but needs lockdep annotation.
Sep 25 10:34:08 grograman kernel: turning off the locking correctness validator.
Sep 25 10:34:08 grograman kernel: 
Sep 25 10:34:08 grograman kernel: Call Trace:
Sep 25 10:34:08 grograman kernel:  [<ffffffff8026ebbd>] show_trace+0xae/0x336
Sep 25 10:34:08 grograman kernel:  [<ffffffff8026ee5a>] dump_stack+0x15/0x17
Sep 25 10:34:08 grograman kernel:  [<ffffffff802a8871>] __lock_acquire+0x135/0xa64
Sep 25 10:34:09 grograman kernel:  [<ffffffff802a9743>] lock_acquire+0x4b/0x69
Sep 25 10:34:09 grograman kernel:  [<ffffffff80267dff>] _spin_lock_irq+0x2b/0x38
Sep 25 10:34:09 grograman kernel:  [<ffffffff80265873>]
wait_for_completion_timeout+0x35/0xd7
Sep 25 10:34:09 grograman kernel:  [<ffffffff8807d57d>]
:scsi_mod:scsi_send_eh_cmnd+0x269/0x405
Sep 25 10:34:09 grograman kernel:  [<ffffffff8807d784>]
:scsi_mod:scsi_eh_tur+0x32/0x86
Sep 25 10:34:09 grograman kernel:  [<ffffffff8807e01b>]
:scsi_mod:scsi_error_handler+0x3f5/0xa81
Sep 25 10:34:09 grograman kernel:  [<ffffffff802354ad>] kthread+0x100/0x136
Sep 25 10:34:09 grograman kernel:  [<ffffffff802617a0>] child_rip+0xa/0x12
Sep 25 10:34:09 grograman kernel: DWARF2 unwinder stuck at child_rip+0xa/0x12
Sep 25 10:34:09 grograman kernel: Leftover inexact backtrace:
Sep 25 10:34:09 grograman kernel:  [<ffffffff80267e72>] _spin_unlock_irq+0x2b/0x31
Sep 25 10:34:09 grograman kernel:  [<ffffffff80260ddc>] restore_args+0x0/0x30
Sep 25 10:34:09 grograman kernel:  [<ffffffff8024fca4>] run_workqueue+0x19/0xfa
Sep 25 10:34:09 grograman kernel:  [<ffffffff802353ad>] kthread+0x0/0x136
Sep 25 10:34:09 grograman kernel:  [<ffffffff80261796>] child_rip+0x0/0x12
Sep 25 10:34:09 grograman kernel: 

and many like this:

Sep 25 13:12:58 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff81013e7caa30)
Sep 25 13:12:58 grograman kernel: sd 0:0:1:0: 
Sep 25 13:12:58 grograman kernel:         command: Write(10): 2a 00 1b 5b 9b a5
00 00 28 00
Sep 25 13:12:58 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 25 13:12:58 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff81013e7caa30)
Sep 25 13:22:33 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff81010578cd60)
Sep 25 13:22:33 grograman kernel: sd 0:0:1:0: 
Sep 25 13:22:33 grograman kernel:         command: Write(10): 2a 00 0f 5c 82 f5
00 00 08 00
Sep 25 13:22:33 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 25 13:22:33 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff81010578cd60)
Sep 25 13:30:50 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff8101196ea958)
Sep 25 13:30:50 grograman kernel: sd 0:0:1:0: 
Sep 25 13:30:50 grograman kernel:         command: Write(10): 2a 00 01 5c 77 dd
00 00 08 00
Sep 25 13:30:50 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 25 13:30:50 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff8101196ea958)
Sep 25 13:32:33 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff81013e7ca3d0)

The second round of messages above is from 2.6.18-1.2689.fc6.  I think the first
is from kernel-2.6.18-1.2647.fc6, but I'm not 100% sure.


Eventually the system gets to the point where any access to the file system
makes a process freeze and I have to power off.  The time it takes to get to
this point varies.

I'm trying kernel-2.6.17-1.2630.fc6, which I think might not generate this
behavior, but I'll have to let it run for a while to find out for sure.

Just so you know, from lspci:

05:0b.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS (rev 01)
Comment 1 Garrett Mitchener 2006-09-26 15:52:27 EDT
Created attachment 137165 [details]
More mptscsih messages

Linux version 2.6.17-1.2630.fc6 (brewbuilder@hs20-bc1-7.build.redhat.com) (gcc
version 4.1.1 20060828 (Red Hat 4.1.1-20)) #1 SMP Wed Sep 6 16:08:02 EDT 2006
Comment 2 Garrett Mitchener 2006-09-26 15:56:34 EDT
I just got the dreaded mptscish message running kernel 2.6.17-1.2630, see new
attachment.  Generally this kernel is a lot more stable than the two 2.6.18
updates I mentioned, but it also has this scsi problem.  I don't know if it's
relevant, but I put in a USB key a minute or two before I got this error.  The
first program to freeze was my e-mail program (thunderbird) which I think might
have been trying to access a gpg key that I just copied from the USB key to my
home directory.
Comment 3 Garrett Mitchener 2006-09-27 13:53:51 EDT
I'm also getting this error using kernel 2.6.18-1.2693.fc6 -- see below.  The
machine was essentially idle when these messages were generated.  I wasn't doing
anything with my USB key at the time.  Even though I'm getting these same
errors, the system seems more stable than the other 2.6.18 kernels.  I haven't
witnessed processes freezing during the scsi error like happened with the
earlier 2.6.18 kernels, but I've only had a day to watch this one.  It also
hasn't generated the backtrace error I was getting, just these messages about
aborting a task.

Sep 27 13:25:52 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff810113542740)
Sep 27 13:25:52 grograman kernel: sd 0:0:1:0: 
Sep 27 13:25:52 grograman kernel:         command: Write(10): 2a 00 00 05 00 ed
00 00 10 00
Sep 27 13:25:52 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 27 13:25:52 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff810113542740)
Sep 27 13:29:38 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff81013e5e23d0)
Sep 27 13:29:38 grograman kernel: sd 0:0:1:0: 
Sep 27 13:29:38 grograman kernel:         command: Write(10): 2a 00 1b 3a 31 7d 
00 00 08 00
Sep 27 13:29:39 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 27 13:29:39 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff81013e5e23d0)
Sep 27 13:33:17 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff81013e5e2700)
Sep 27 13:33:17 grograman kernel: sd 0:0:1:0: 
Sep 27 13:33:17 grograman kernel:         command: Write(10): 2a 00 00 05 0d 45
00 00 10 00
Sep 27 13:33:17 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 27 13:33:17 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff81013e5e2700)
Sep 27 13:37:17 grograman kernel: mptscsih: ioc0: attempting task abort!
(sc=ffff81010ee6c450)
Sep 27 13:37:17 grograman kernel: sd 0:0:1:0: 
Sep 27 13:37:17 grograman kernel:         command: Write(10): 2a 00 00 05 15 25
00 00 10 00
Sep 27 13:37:17 grograman kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Sep 27 13:37:17 grograman kernel: mptscsih: ioc0: task abort: SUCCESS
(sc=ffff81010ee6c450)
Comment 4 Konrad Rzeszutek 2006-09-27 14:27:40 EDT
Garrett,

Did you try per comment #9 in BZ 200787:

"Garlett,
Could you decrease the io speed of your disks and check whether your issues
still exist? (Normally it can be done before the boot time in some system scsi
bios settings etc.)" ?


Comment 5 Garrett Mitchener 2006-09-28 10:18:39 EDT
I read over the discussion at bug # 200787, and I looked all over the setup
functions on my workstation and I haven't found anything that looks like a way
to change scsi transfer speeds.  I never saw anything that looked like 80 MB/s
or 160 MB/s.  Is there a kernel option or something that would change the speed?
Comment 6 Garrett Mitchener 2006-10-06 13:12:07 EDT
I don't know if this is relevant but I stumbled onto a post by someone who had
some way of reliably creating these error messages:

http://lkml.org/lkml/2006/2/8/132
Comment 7 Garrett Mitchener 2006-10-06 17:02:25 EDT
The problem has been especially bad today.  I start getting these messages and
my workstation becomes unstable within minutes of start up.  I tried booting
into Windows, but it wouldn't start, so I tried running the hardware
diagnostics, and I'm getting errors from the confidence test of both my scsi
hard drives.  They look like this:

Error code 4400:012C
Msg: Block 243456: No targets installed

and I get a whole bunch of different blocks that give this error.  This wasn't
happening when I ran hardware diagnostics before.  Does this mean I have a
hardware fault or that linux has left the scsi devices in some bad state?
Comment 8 Garrett Mitchener 2006-10-06 17:20:22 EDT
And when I power down the machine and unplug it and restart it, the confidence
test runs with no errors on either drive.  Windows starts, etc.

This particular work station seems to "remember" something if I don't unplug it.
 When I plug it in, the fans run for a moment and the hard drive lights blink,
then it goes quiet until I turn the power on.
Comment 9 Garrett Mitchener 2006-10-11 12:21:08 EDT
In a fit of desperation I tried using kernel 2.6.15 from fedora 5.  It also
generates these errors, and it has mpt driver version 3.03.07, which supposedly
doesn't have this problem (see bug # 200787).

Today my machine lasts a matter of minutes before becoming unusable.
Comment 10 Garrett Mitchener 2006-10-17 10:12:46 EDT
I am cautiously optimistic that I have found a work-around.  After reading some
documentation for the MPT driver, I added this to the kernel command line in
/boot/grub/grub.conf:

mptscsih="width:0 factor:0x0A"

With these kernel options, my workstation has run for a couple of days without
getting the "task abort" error.  I haven't experimented with them much, so I
don't know exactly which of these two options (or maybe the combo?) is the fix.
Comment 11 Garrett Mitchener 2006-10-17 10:15:42 EDT
Oh, and here's a link to a page with the driver documentation:

http://www.lsil.com/storage_home/products_home/standard_product_ics/sas_ics/lsisas1068/index.html?remote=1
Comment 12 Konrad Rzeszutek 2006-10-17 11:32:21 EDT
Garrett,

Thanks for finding that. Let me see if I can find a Qlogic contact and make
him/her aware of this BZ. 
Comment 15 Dmitry Butskoy 2006-10-18 09:17:17 EDT
Konrad,

You can find upstream's guys iin the Kernel's Changelog (all about MPT etc. :) )
I corresponded with them already a little.

Or mail me, I'll send you their e-mails...
Comment 16 Garrett Mitchener 2006-10-20 18:03:35 EDT
Created attachment 139035 [details]
part of /var/log/messages

Unfortunately, I just discovered that I'm still getting these task abort errors
even with the kernel command line option I posted.  Someone posted a similar
experience under bug # 200787.	I've attached part of a log.  However, the
errors seem to be much rarer with those kernel options than without, and with
them I haven't ever reached a point where my workstation became unusable.

I still haven't found a way to set the scsi speed in the bios on this
workstation.
Comment 17 Garrett Mitchener 2006-11-03 16:03:52 EST
I've been running linux for a couple of weeks now without seeing this problem
and I thought it was handled, but today it came back with a vengeance.  Windows
crashed and I'm getting errors from the dell diagnostics software.  I even
updated the bios to A02 and it didn't help.

Now, today (and the last time my machine had one of these mptscsih episodes) it
was unusually hot in my office.  Could all or some of this trouble be due to
something overheating?  I've called the help desk here at work, and they're
going to check it out, but I'm curious to know if other people see this problem
with any correlation to heat.
Comment 18 Dmitry Butskoy 2006-11-07 06:32:46 EST
> any correlation to heat.
Nope. 

In my case, the problem always occurs when I write a file more than ~50Mb ... 
Comment 19 Garrett Mitchener 2006-11-20 16:18:50 EST
The helpdesk staff here replaced my workstation's motherboard about a week ago,
and since then I haven't had any more of these task abort episodes.  I suspect
now that what I was seeing was some heat-related hardware problem with the old
motherboard, and the kernel option might have helped by slowing the hard drives
down (and maybe creating less heat? or making the scsi hardware less sensitive
to the heat?).  So it may not be a kernel problem after all, which would mean
that Dmitri's problem is something else entirely.  I'm not absolutely sure of
course -- I'll have to watch it for a few weeks and see.
Comment 20 Eric Moore 2006-12-05 23:53:01 EST
Garrett - You have a sas controller. I mentioned in the other bugzilla 
(200787), that this is not a SPI issue.  Passing of mptscsih="width:0 
factor:0x0A" only applies to mptspi driver, not mptsas which effects you.  

It looks like the oops occured as a result of timeouts. In the opps I see that 
in the contents of scsi_eh_tur() that failed at lock_acquire.
There could be a problem in scsi_error.c, perhaps Tom Coughlan would review 
that.

Regarding your timeouts, lets first get you upgraded to the latest firmware.

Can you let me know what firmware version your having?

# cat /proc/mpt/summary

Driver version

# cat /proc/mpt/version

Try to upgrade your firmware from our website:
http://www.lsil.com/cm/DownloadSearch.do
Host BUs Adaters -> SAS HBAs -> (look at barcode sticker to determine
which card you have)

If you have any problems with downloading, try sending
email to support@lsil.com
Comment 21 Eric Moore 2006-12-05 23:54:52 EST
Adding mself to cc list <eric.moore@lsi.com>
Comment 22 Garrett Mitchener 2006-12-13 10:59:30 EST
cat /proc/mpt/summary

gives

ioc0: LSISAS1068, FwRev=00063200h, Ports=1, MaxQ=511, IRQ=98

and

cat /proc/mpt/version

gives

mptlinux-3.04.01
  Fusion MPT base driver
  Fusion MPT SAS host driver

Now, I got a new motherboard for this workstation and since it was installed, I
haven't had any task abort problems.  I can try to upgrade the firmware if think
that would be informative.
Comment 23 Michael J. Carter 2007-02-09 16:45:23 EST
I'm seeing similar symptoms on a Dell Precision 470:

Feb  5 12:02:26 leprechaun kernel: mptscsi: ioc0: attempting task abort!
(sc=edf506c0)
Feb  5 12:02:26 leprechaun kernel: scsi6 : destination target 0, lun 0
Feb  5 12:02:26 leprechaun kernel:         command = Write (10) 00 01 92 8a 49
00 00 18 00
Feb  5 12:02:26 leprechaun kernel: mptbase: ioc0: LogInfo(0x31140000):
Originator={PL}, Code={IO Executed}, SubCode(0x0000)
Feb  5 12:02:26 leprechaun kernel: mptscsi: ioc0: task abort: SUCCESS (sc=edf506c0)


[root@leprechaun ~]# lspci | grep LSI
04:08.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X
Fusion-MPT SAS (rev 01)

[root@leprechaun ~]# cat /proc/mpt/summary
ioc0: LSISAS1068, FwRev=00063200h, Ports=1, MaxQ=267, IRQ=169

[root@leprechaun ~]# cat /proc/mpt/version
mptlinux-3.02.62.01rh
  Fusion MPT base driver
  Fusion MPT FC host driver
  Fusion MPT SPI host driver
  Fusion MPT SAS host driver

[root@leprechaun ~]# cat /proc/scsi/scsi
Attached devices:
Host: scsi6 Channel: 00 Id: 00 Lun: 00
  Vendor: MAXTOR   Model: ATLAS10K5_073SAS Rev: BP00
  Type:   Direct-Access                    ANSI SCSI revision: 05

I'm going to persue the firmware upgrade as suggested in comment #20.
Comment 24 Michael J. Carter 2007-02-09 16:49:14 EST
Sorry, that should be "Dell Precision 490"
Comment 25 Eric Moore 2007-02-09 17:11:50 EST
what did you do to get the timeouts
Comment 26 Michael J. Carter 2007-02-12 11:25:57 EST
This is a customer's machine. He said the system will eventually lock up once he
"starts doing work" - he does imagery analysis so there's a fair amount of I/O.

I have the machine on my bench now, but can't reproduce the lockups w/ simple
bonnie runs.

Comment 27 Michael J. Carter 2007-02-12 11:53:36 EST
This is a Dell card, so I've just installed the latest firmware on it:

Dell SAS 5/iR Adapter, v.00.06.50.00.06.06.00.02, A02
Release Date:  7/28/2006
Description:  Firmware (v. 00.06.50.00) and BIOS (v. 06.06.00.02) for the Dell
SAS5/iR Adapter
Comment 28 Karel Tuma 2007-03-28 21:44:20 EDT
#27: did it help? where is the A02 reported, this is the what is system bios
reporting (F2 @ boot)

I've just spent like 10 hours trying to figure out this bug (task abort, both
during read and writes) Dell Poweredge 680 w/ sas 5/iR - sas1068 pci-e, 2x 70gb
10k RPM hitachi drives, exact firmware as in #27 (shipped already by dell as
such). Here are some observations:

1) placing workload on 1st drive does not seem to trigger the bug at all, tried
the following:
dd if=/dev/sda2 of=/dev/null bs=65536 &
dd if=/dev/zero of=/dev/sda2 bs=65536 &

2) doing the same only on sdb does trigger the bug after a long while (3-5
mins), probably after flushing dirty data to sda, see 12) for speculation.

3) putting dd workload on both drives (four dd's, to be exact) triggers the bug
almost imediately.

4) after creating stripe from 2 drives ('IS' in LSI SAS 5 terminology), it's
hardly possible to even mkfs during operating system installation.

funny thing to try: bootable cd with dell's GUI stuff to partition disks, manage
raids etc (shipped w/ every poweredge) boots busybox linux of some sort. switch
to holographic shell (vty4?) and run the two dds on the background. triggers the
bug too. kernel 2.4.23, pretty old driver version of mptbase/sas.

5) tried fedora5, centos, rhel4, debian etch, with various kernels. all trigger
the timeout bug after bulk workload.

6) freebsd chokes completely spitting message of similiar sort in endless loop
(during extracting base to /)

7) i'm desperate

8) smartctl -t long'ed both drives, reported they're fine.

9) there is an interrupt storm happening in /proc/interrupts on sc0 irq (watch
cat /proc/interrupts & trigger the bug), yet vmstat reports 0/0 in/out until the
damn thing wakes up (usually 15-30 secs to recover, depends on the io queue size?)

10) amateur's conclusion:
this is most likely NOT a software/driver related bug. more likely wiring
interference/bad cabling. no visual defects found by on the card, nor the
cabling/drives. maybe wrong drives/adaptor/firmware/bus(pci-e?)/whatever combo?

11) dell insists everything is fine (it is, with single drive).

12) the bug seem to promote itself imediately when there is both read/write bulk
workload on both drives at once (the thing is single-channel, right?)

13) i'm going to bug dell till they fix the damn thing.

14) also tried vanilla 2.6.20.4 (debian etch) and freebsd-current. no effect.

15) of course tried the usual suspects (acpi=off,noapic,nosmp etc). no effect.

XX) suggestions? ;-(
#27: did it help? where is the A02 reported, this is the what is system bios
reporting (F2 @ boot)

I've just spent like 10 hours trying to figure out this bug (task abort, both
during read and writes) Dell Poweredge 680 w/ sas 5/iR - sas1068 pci-e, 2x 70gb
10k RPM hitachi drives, exact firmware as in #27 (shipped already by dell as
such). Here are some observations:

1) placing workload on 1st drive does not seem to trigger the bug at all, tried
the following:
dd if=/dev/sda2 of=/dev/null bs=65536 &
dd if=/dev/zero of=/dev/sda2 bs=65536 &

2) doing the same only on sdb does trigger the bug after a long while (3-5
mins), probably after flushing dirty data to sda, see 12) for speculation.

3) putting dd workload on both drives (four dd's, to be exact) triggers the bug
almost imediately.

4) after creating stripe from 2 drives ('IS' in LSI SAS 5 terminology), it's
hardly possible to even mkfs during operating system installation.

funny thing to try: bootable cd with dell's GUI stuff to partition disks, manage
raids etc (shipped w/ every poweredge) boots busybox linux of some sort. switch
to holographic shell (vty4?) and run the two dds on the background. triggers the
bug too. kernel 2.4.23, pretty old driver version of mptbase/sas.

5) tried fedora5, centos, rhel4, debian etch, with various kernels. all trigger
the timeout bug after bulk workload.

6) freebsd chokes completely spitting message of similiar sort in endless loop
(during extracting base to /)

7) i'm desperate

8) smartctl -t long'ed both drives, reported they're fine.

9) there is an interrupt storm happening in /proc/interrupts on sc0 irq (watch
cat /proc/interrupts & trigger the bug), yet vmstat reports 0/0 in/out until the
damn thing wakes up (usually 15-30 secs to recover, depends on the io queue size?)

10) amateur's conclusion:
this is most likely NOT a software/driver related bug. more likely wiring
interference/bad cabling. no visual defects found by on the card, nor the
cabling/drives. maybe wrong drives/adaptor/firmware/bus(pci-e?)/whatever combo?

11) dell insists everything is fine (it is, with single drive).

12) the bug seem to promote itself imediately when there is both read/write bulk
workload on both drives at once (the thing is single-channel, right?)

13) i'm going to bug dell till they fix the damn thing.

14) also tried vanilla 2.6.20.4 (debian etch) and freebsd-current. no effect.

15) of course tried the usual suspects (acpi=off,noapic,nosmp etc). no effect.

XX) suggestions? ;-(

hope this helps a bit, i'll try to provide logs of whatever could be of interest
once i'll get serial cable and some courage to touch the damn box ever again.
Comment 29 Michael Will 2007-06-12 12:33:12 EDT
I have the same issues with both RHEL4u4 and SLES10-SP1 with a SAS attached
external raid enclosure. The SAS adapter is an LSI SAS1068E.

It first works fine, then I get the aforementioned SCSI timeouts and errors and
the device goes offline. Unloading and reloading the driver fixes it until next
time. 

It's faster to trigger the bug when striping across several luns via LVM, but
even for a single LUN it will eventually show up when doing large block dd read
and write. 

 
The SUSE enterprise driver from SLES10-SP1 driver that we tried is version
3.04.02-suse.
The Redhat enterprise driver from RHEL4u4 that we tried is version 3.02.62.01rh.
Comment 30 Michael Will 2007-06-12 12:33:36 EDT
I have the same issues with both RHEL4u4 and SLES10-SP1 with a SAS attached
external raid enclosure. The SAS adapter is an LSI SAS1068E.

It first works fine, then I get the aforementioned SCSI timeouts and errors and
the device goes offline. Unloading and reloading the driver fixes it until next
time. 

It's faster to trigger the bug when striping across several luns via LVM, but
even for a single LUN it will eventually show up when doing large block dd read
and write. 

 
The SUSE enterprise driver from SLES10-SP1 driver that we tried is version
3.04.02-suse.
The Redhat enterprise driver from RHEL4u4 that we tried is version 3.02.62.01rh.
Comment 31 Sammy 2007-06-13 14:34:43 EDT
The latest bios/firmware update for this SAS card from DELL may solve your
problems:

Firmware (v. 00.10.49.00) and BIOS (v. 06.12.02.00) f

=========================================================================
BIOS Fixes/Enhancements

1. Fixed issue where spontanious reboot may happen when attempting to enter the
SAS5 BIOS using CTRL-C
2. With a failed drive, the BIOS would timeout port enable before the firmware
was done with it.
3. Disabled verify support for SATA drives since bad block remapping is not
supported on SATA
4. Corrected issue where a drive failing during a verify operation in BIOS
configuration utility would hang the utility.

Firmware Fixes/Enhancements

1. Fixed issue where I/O timeouts can occur during a period of very high I/O
activity when one
of the targets of this activity goes missing and the Device Missing Delay
feature is enabled.
2. Fixed issue where Task Management was incorrectly removing devices when any
target reset was performed.
3. Fixed issue where pulling and inserting drives may cause Fault 7202 in
Windows Event log
when a Direct Attach SEP device is attached to the controller.
4. Fixed issue where W2k3 enumerates the devices, after sending an Inquiry
command, it then erroneously
sends an Inquiry command with the EVPD bit set to the SEP device. However, the
SATA-II SEP devices do not
support the EVPD option. The Inquriy with EVPD bit results in a Check-Condition
and the Windows driver
would notice the LogInfo and then create an entry in the Windows Event Log. This
event log message has been supressed
5. Added support for > 2TB volumes.
6. Fixed Issue where resync periodically runs slowly after reboot in Windows
7. Corrected issue where a degraded volume with a failed drive comes back
optimal on next reboot
========================================================================
Comment 32 Michael Will 2007-06-13 17:06:49 EDT
Good call. Coincided with us trying the same. We upgraded the LSI 3442E-R which
had firmware 1.09.00.0 on it to LSI's current posted firmware 1.18.00.0, and the
stability issues went away.
Comment 33 chris.eborn.dneg 2007-07-13 09:54:45 EDT
We were having this problem too - it persisted after the above firmware upgrade.
I noticed that all the machines with the problem had HITACHI  Model:
HUS151414VLS300  Rev: A48B disks. Machines with other disks (like SEAGATE 
Model: ST3146854SS      Rev: S410) were fine. We swapped out all of the hitachi
disks and have not seen the problem since.

We could reproduce the problem by running 3 or 4 concurrent copies of:
dd if=/dev/zero of=/path/to/disk

We run 2.6.17-1.2143_FC4smp.
Comment 34 Konrad Rzeszutek 2007-07-13 10:02:16 EDT
Prarit,

The patch you came up with for the Hitachi drivers - is this the same model?
Comment 35 Prarit Bhargava 2007-07-16 09:47:02 EDT
konradr, nope.  But that doesn't rule out a similar problem with this model as well.

The failure we're seeing in this BZ is very different from what I was seeing
with the other BZ.  In that case the failure was a communication error when
attempting to use NCQ.  Here we're seeing a task abort on the mpt driver.

P.
Comment 36 Jon Stanley 2007-12-30 21:27:03 EST
I'm not able to determine from this lengthy report if the problem is actually
present in current versions of Fedora.  It is present in RHEL from the notes, I
would clone this bug against the RHEL4 product to get it attention there.

Hello,

I'm reviewing this bug as part of the kernel bug triage project, an attempt to
isolate current bugs in the Fedora kernel.

http://fedoraproject.org/wiki/KernelBugTriage

I am CC'ing myself to this bug, however this version of Fedora is no longer
maintained.

Please attempt to reproduce this bug with a current version of Fedora (presently
Fedora 8). If the bug no longer exists, please close the bug or I'll do so in a
few days if there is no further information lodged.

Thanks for using Fedora!
Comment 37 Garrett Mitchener 2008-01-01 21:58:33 EST
I haven't seen this problem for a very long time now, ever since the campus
helpdesk replaced the motherboard in my workstation.  In my case, the problem
turned out to be in the hardware and probably not the driver.  I'm still running
FC6 on that machine, haven't had time to upgrade.  I plan to upgrade it to
Fedora 8 within the month, and I'll repost the bug if it starts up again.

Other people have been adding to this bug report, and they may be seeing similar
symptoms from a different problem.

Thank you so much for 
Comment 38 Jon Stanley 2008-01-01 22:44:59 EST
Given that data, I'm going to close this as INSUFFCIENT_DATA.  Feel free to
re-open if the problem crops up again.  If someone on the CC list does not have
permission to re-open the bug and feels that it needs to be, please make a
comment in this bug report and I will re-open it.

Note You need to log in before you can comment on or make changes to this bug.