Bug 85350

Summary: (SCSI AIC79XX)SCSI Error: Adaptec aic79xx: "Saw underflow" with Seagate ST336607LW
Product: [Retired] Red Hat Linux Reporter: Terry Barnaby <terry1>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 7.3   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:40:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log/messages none

Description Terry Barnaby 2003-02-28 14:28:52 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021203

Description of problem:
System: Dual Xeon 2.4GHz system using SuperMicro X5DA8 Motherboard.
SCSI: Adaptec 7902 onboard dual channel SCSI controller
Disks: 2 off Quantum Atlas 10K2 18G (160LW), 1 of Quantum 9G (80LW)
Disks: 1 off Seagate ST336607LW 36G (320LW)
System: RedHat 7.3 with updates to 18/02/03
Kernel: 2.4.18-24.7.xsmp

This system has been running fine for about a month with the Quantum
Atlas 10K2 SCSI disks. I have just tried to add a Seagate ST336607LW
320LW SCSI disk and am now getting problems.
If I start off a disk to disk copy of a large amount of information,
after about 10mins the SCSI disk will lock up.

I get the kernel message "Saw underflow (16384 of 20480 bytes). Treated as
error" followed by various SCSI error messages. The SCSI disks LED
remains on and it is impossible to access the SCSI disk.
Resetimg the system does not clear the SCSI disk LED and the SCSI
disk is not seen in the Adaptec BIOS on startup. A power off/on cycle
will clear the condition.

A proper 320LW SCSI cable is being used. I have also tried reducing
the data rate for the driver to 160LW to no effect. I notice this
drive is used in "packetized" mode. I have not tried turning this off yet.

I don't know if this is a Linux kernel SCSI driver problem or a SCSI
disk problem on the Seagate ST336607LW.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Copy a disk partition with "tar -clf - . | (cd /new; tar -xpf -)
2.
3.
    

Additional info:

Comment 1 Terry Barnaby 2003-02-28 14:30:58 UTC
Created attachment 90412 [details]
/var/log/messages

Comment 2 R. Gran 2003-05-18 21:01:02 UTC
SCSI hangs on heavy load
(in my case, copy a directory from remote machine to local scsi raid)

This could be a dupe of about 8 different bugs reported here, but they
are all reported with different configurations.  Please transfer this
to the one you are really tracking.  This one seems the most similar, so 
I have posted it here.

System

Intel 440bx (ASUS P2B-D) PentiumIII SMP
Adaptec 39320-R (host raid disabled in scsi-bios)
Two Seagate 373307 drives, RAID 1 configuration
/home is on the SCSI RAID.
/ is on an IDE drive (boot to IDE)
   
Install RedHat 8.0 from CD (except for the no scsi driver in install)

/home is on SCSI RAID
/ and others on IDE drive, boot to IDE.

scsi works for a while with no load
very quickly up2dated to 2.4.20-13.8smp kernel

copy /home directory from remote machine via rsync through ssh.

First try: errors detected and one SCSI drive taken off line.
rsync continues but I stop it in order to inspect the damage.

Second try:  check hardware, reboot, resyncronize RAID, 
and then rsync via ssh again.
Fails harder this time, after minutes.

Am about to try reproducing the error with stock 2.4.18-24smp kernel.


If that fails also, what are my options?
Previous bugs say different driver (1.3.x from gibbs)
Or different kernel (which ones work, which ones do not?)
Different kernel options (noapic?)

There seems to e a lot of speculatoin, but no clear answers,
and no bug reports tagged FIXED.  So, whats up?

I am somewhat short on time, about to leave the country for a while, 
if you could narrow the possibilities of what I should do to make this work,
that would be very, very helpful.



Some text below.

an example of some error messages that repeat for an awful long time.

May 19 02:22:50 usroom06 kernel: Abort called for cmd c223ba00
May 19 02:22:50 usroom06 kernel: scsi0: Dumping Card State at program address
0x22 Mode 0x11
May 19 02:22:50 usroom06 kernel: Softc pointer is c2566000
May 19 02:22:50 usroom06 kernel: IOWNID == 0x7, TOWNID == 0x7, SCSISEQ1 == 0x12
May 19 02:22:50 usroom06 kernel: SCSISIGI == 0x0
May 19 02:22:50 usroom06 kernel: QFREEZE_COUNT == 0, SEQ_FLAGS2 == 0x1
May 19 02:22:50 usroom06 kernel: scsi0: LASTSCB 0x3f CURRSCB 0x3f NEXTSCB 0xff40
SEQINTCTL 0x0
May 19 02:22:50 usroom06 kernel: SCSISEQ = 0x0
May 19 02:22:50 usroom06 kernel: SCB count = 108
May 19 02:22:50 usroom06 kernel: Kernel NEXTQSCB = 63
May 19 02:22:50 usroom06 kernel: scsi0: LQCTL1 = 0x0
May 19 02:22:50 usroom06 kernel: scsi0: WAITING_TID_LIST == 0xffcf:0xff3f
May 19 02:22:50 usroom06 kernel: scsi0: WAITING_SCB_TAILS: 0(0xff3f) 1(0xff64)
2(0xff00) 3(0xff00) 4(0xff00) 5(0xff00) 6(0xff00) 7(0xff00) 8(0xff00) 9(0xff00)
10(0xff00) 11(0xff00) 12(0xff00) 13(0xff00) 
14(0xff00) 15(0xff00) 
May 19 02:22:50 usroom06 kernel: qinstart = 64719 qinfifonext = 64719
May 19 02:22:50 usroom06 kernel: QINFIFO:
May 19 02:22:50 usroom06 kernel: WAITING_TID_QUEUES:
May 19 02:22:51 usroom06 kernel: Pending list:
May 19 02:22:51 usroom06 kernel:  30(CTRL 0x60 ID 0x17 N 0x64 N2 0x64 SG
0x1ba2f80a, RSG 0xaf68, KSG 0x1ba2f80a)
14(0xff00) 15(0xff00) 
May 19 02:22:50 usroom06 kernel: qinstart = 64719 qinfifonext = 64719
May 19 02:22:50 usroom06 kernel: QINFIFO:
May 19 02:22:50 usroom06 kernel: WAITING_TID_QUEUES:
May 19 02:22:51 usroom06 kernel: Pending list:
May 19 02:22:51 usroom06 kernel:  30(CTRL 0x60 ID 0x17 N 0x64 N2 0x64 SG
0x1ba2f80a, RSG 0xaf68, KSG 0x1ba2f80a)
May 19 02:22:51 usroom06 kernel: ,  43(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG
0x1ba2ac0a, RSG 0x4768, KSG 0x1ba2ac0a)
May 19 02:22:51 usroom06 kernel: ,  47(CTRL 0x60 ID 0x17 N 0xffc0 N2 0xffa2 SG
0x1ba29c0a, RSG 0x6767, KSG 0x1ba29c0a)
May 19 02:22:51 usroom06 kernel: ,  37(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffa2 SG
0x1ba2b40a, RSG 0xf766, KSG 0x1ba2b40a)
May 19 02:22:51 usroom06 kernel: ,  24(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG
0x1ba3000a, RSG 0x4005761, KSG 0x1ba3000a)
May 19 02:22:51 usroom06 kernel: ,  44(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffa2 SG
0x1ba2900a, RSG 0x1005765, KSG 0x1ba2900a)
May 19 02:22:51 usroom06 kernel: ,  61(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG
0x1ba2440a, RSG 0x400575d, KSG 0x1ba2440a)
May 19 02:22:51 usroom06 kernel: ,  69(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG
0x11cef40a, RSG 0x4005759, KSG 0x11cef40a)
May 19 02:22:51 usroom06 kernel: ,  59(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG
0x1ba25c0a, RSG 0x3002f56, KSG 0x1ba25c0a)
May 19 02:22:51 usroom06 kernel: ,  92nding list:
May 19 02:22:51 usroom06 kernel:  30(CTRL 0x60 ID 0x17 N 0x64 N2 0x64 SG
0x1ba2f80a, RSG 0xaf68, KSG 0x1ba2f80a)
May 19 02:22:51 usroom06 kernel: ,  43(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG
0x1ba2ac0a, RSG 0x4768, KSG 0x1ba2ac0a)
May 19 02:22:51 usroom06 kernel: ,  47(CTRL 0x60 ID 0x17 N 0xffc0 N2 0xffa2 SG
0x1ba29c0a, RSG 0x6767, KSG 0x1ba29c0a)
May 19 02:22:51 usroom06 kernel: ,  37(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffa2 SG
0x1ba2b40a, RSG 0xf766, KSG 0x1ba2b40a)
May 19 02:22:51 usroom06 kernel: ,  24(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG
0x1ba3000a, RSG 0x4005761, KSG 0x1ba3000a)
May 19 02:22:51 usroom06 kernel: ,  44(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffa2 SG
0x1ba2900a, RSG 0x1005765, KSG 0x1ba2900a)
May 19 02:22:51 usroom06 kernel: ,  61(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG
0x1ba2440a, RSG 0x400575d, KSG 0x1ba2440a)
May 19 02:22:51 usroom06 kernel: ,  69(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG
0x11cef40a, RSG 0x4005759, KSG 0x11cef40a)
May 19 02:22:51 usroom06 kernel: ,  59(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG
0x1ba25c0a, RSG 0x3002f56, KSG 0x1ba25c0a)
May 19 02:22:51 usroom06 kernel: ,  92(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffa2 SG
0x19fc00a, RSG 0x4002f52, KSG 0x19fc00a)
May 19 02:22:51 usroom06 kernel: ,  67(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG
0x1ba22c0a, RSG 0x4002f4e, KSG 0x1ba22c0a)
May 19 02:22:51 usroom06 kernel: ,  29(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG
0x1ba2f40a, RSG 0x4002f4a, KSG 0x1ba2f40a)
May 19 02:22:51 usroom06 kernel: ,  89(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG
0x19fd40a, RSG 0x4002f46, KSG 0x19fd40a)
May 19 02:22:51 usroom06 kernel: ,  40(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG
0x1ba2a00a, RSG 0x4002f42, KSG 0x1ba2a00a)
:


Comment 3 R. Gran 2003-05-19 14:38:02 UTC
I totally had better things to do with today than debug scsi "issues".
I spent the day reproducing this failure and watching what happened.
After the software raid kicked out one of the two drives,
I continued to try to load the remaining drive via rsync.

It found sequence of "hardware failures" like before, but under this
circumstance the driver thougth that the following would be a good idea:

...complaints about hardware errors and target failures deleted...
May 19 16:08:29 usroom06 kernel: (scsi0:A:0:0): Locking max tag count at 64

The rsync completed after this point with no further errors.
With the tag count still locked at 64, I deleted and rsynced again
to see if this more conservative setting helps.

This second complete try completed with no further errors.

So -- I conclude that the funamental system is half-clever, and that my 
problems might be fixed if I can set this value at 64 or 32 right away, 
at boot time or in hardware BIOS or worst-case at compile time.

So I did not remember or realize that I need to run mkinitrd if I modify
the scsi part of /etc/modules.conf  with something like
options aic79xx aic79xx=global_tag_depth:32
I get it mkinitrd thing now.  Is the syntax right?  I did not get it 
before, so 20 failed reboots later, I gave up and tried else.

Why does gibbs' latest 1.3.8 driver package contain a bug that makes it 
unusable?  Something about too many arguments passed to something that scrolled
past the screen way too fast and I did not bother to try to remember how to
make it scroll more slowly.  1.3.7 installs and reboots fine and is configured
to use 32 as a conservative default just like his documentation says it will,
but unlike the default values compiled into RedHat.  What does Redhat know that
the author of the adaptec driver not know?  And vice versa?  You kids should
talk to each other more.

Anyway, I hope that fixes in the 1.3.7 driver, or simply using the conservative
tag count settings will be good enough to let me proceed. 




Let me summarize my best guess about what is happening here, so that someone
doing a search will discover this and your reply to it.

Some aspect of the aic79xx driver and adaptec card and the interaction between
the two is unstable in some respect on some systems with some mix of 
old and new hardware some of the time.  Or all of the time.  I'm not suprised.

Users might want to use an updated driver, supplied by Gibbs but not Red Hat
nor adaptec, or set the redhat to more conservative settings.

Or I could have recompiled the module and/or kernel with these conservative 
settings, which is another 2 HOWTO's.

And the intriquing problem in this case is that the RAID folks don't talk
with the SCSI folks and neither talk to RedHat.  So the oops in scsi causes
the RAID to kick out a disk when ordinarily the scsi would just think that it 
was using unreasonably agressive settings and back off before any corruption 
occurred?

But running on one disk, the RAID does not kick things out, so finally the scsi 
driver realizes that 253 is an unreasonable setting for the tag queue on this 
combination of hardware and resets it to something sane.  But this is after
losing one or more disks, which gives the sys-admins a heart attack, so 
too late.  In other words, maybe the scsi part is nicely self-correcting,
but the software RAID part does not know bout that.

Does this logic also apply to at least 3 other bugs here in bugzilla, plus
the several more I found through google?  But please note that I am not the
gentleman who started this thread, so this bug might not be fixed yet.
And at this point, it looks like I had different problems.

Now if only I can apply all of this to the latest errata 2.4.20 kernel,
so that I can have my scsi raid and my security patch at the same time.
Maybe I can get the mkinitrd thing to work, or gibbs will update his thing.

In 100 minutes, the recovery of the two raid disks will have completed,
and I can try to stress it again in its new enlightened state.

Comment 4 Terry Barnaby 2003-06-11 09:06:42 UTC
I have fixed my original bug. It was caused by a combination of a bug in
the aic7xxx SCSI driver and a bug in the Seagate ST336607LW drive. In a certain
condition the aic7xxx driver would cause the Seagate ST336607LW drive to lock up
solid with no SCSI reset possible.
I have fixed this by updating to the aic7xxx 1.3.4 driver.
This has been running for about a month now with no problems.

Terry

Comment 5 Bugzilla owner 2004-09-30 15:40:35 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/