Bug 85350
Summary: | (SCSI AIC79XX)SCSI Error: Adaptec aic79xx: "Saw underflow" with Seagate ST336607LW | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Terry Barnaby <terry1> | ||||
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.3 | ||||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-09-30 15:40:35 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Terry Barnaby
2003-02-28 14:28:52 UTC
Created attachment 90412 [details]
/var/log/messages
SCSI hangs on heavy load (in my case, copy a directory from remote machine to local scsi raid) This could be a dupe of about 8 different bugs reported here, but they are all reported with different configurations. Please transfer this to the one you are really tracking. This one seems the most similar, so I have posted it here. System Intel 440bx (ASUS P2B-D) PentiumIII SMP Adaptec 39320-R (host raid disabled in scsi-bios) Two Seagate 373307 drives, RAID 1 configuration /home is on the SCSI RAID. / is on an IDE drive (boot to IDE) Install RedHat 8.0 from CD (except for the no scsi driver in install) /home is on SCSI RAID / and others on IDE drive, boot to IDE. scsi works for a while with no load very quickly up2dated to 2.4.20-13.8smp kernel copy /home directory from remote machine via rsync through ssh. First try: errors detected and one SCSI drive taken off line. rsync continues but I stop it in order to inspect the damage. Second try: check hardware, reboot, resyncronize RAID, and then rsync via ssh again. Fails harder this time, after minutes. Am about to try reproducing the error with stock 2.4.18-24smp kernel. If that fails also, what are my options? Previous bugs say different driver (1.3.x from gibbs) Or different kernel (which ones work, which ones do not?) Different kernel options (noapic?) There seems to e a lot of speculatoin, but no clear answers, and no bug reports tagged FIXED. So, whats up? I am somewhat short on time, about to leave the country for a while, if you could narrow the possibilities of what I should do to make this work, that would be very, very helpful. Some text below. an example of some error messages that repeat for an awful long time. May 19 02:22:50 usroom06 kernel: Abort called for cmd c223ba00 May 19 02:22:50 usroom06 kernel: scsi0: Dumping Card State at program address 0x22 Mode 0x11 May 19 02:22:50 usroom06 kernel: Softc pointer is c2566000 May 19 02:22:50 usroom06 kernel: IOWNID == 0x7, TOWNID == 0x7, SCSISEQ1 == 0x12 May 19 02:22:50 usroom06 kernel: SCSISIGI == 0x0 May 19 02:22:50 usroom06 kernel: QFREEZE_COUNT == 0, SEQ_FLAGS2 == 0x1 May 19 02:22:50 usroom06 kernel: scsi0: LASTSCB 0x3f CURRSCB 0x3f NEXTSCB 0xff40 SEQINTCTL 0x0 May 19 02:22:50 usroom06 kernel: SCSISEQ = 0x0 May 19 02:22:50 usroom06 kernel: SCB count = 108 May 19 02:22:50 usroom06 kernel: Kernel NEXTQSCB = 63 May 19 02:22:50 usroom06 kernel: scsi0: LQCTL1 = 0x0 May 19 02:22:50 usroom06 kernel: scsi0: WAITING_TID_LIST == 0xffcf:0xff3f May 19 02:22:50 usroom06 kernel: scsi0: WAITING_SCB_TAILS: 0(0xff3f) 1(0xff64) 2(0xff00) 3(0xff00) 4(0xff00) 5(0xff00) 6(0xff00) 7(0xff00) 8(0xff00) 9(0xff00) 10(0xff00) 11(0xff00) 12(0xff00) 13(0xff00) 14(0xff00) 15(0xff00) May 19 02:22:50 usroom06 kernel: qinstart = 64719 qinfifonext = 64719 May 19 02:22:50 usroom06 kernel: QINFIFO: May 19 02:22:50 usroom06 kernel: WAITING_TID_QUEUES: May 19 02:22:51 usroom06 kernel: Pending list: May 19 02:22:51 usroom06 kernel: 30(CTRL 0x60 ID 0x17 N 0x64 N2 0x64 SG 0x1ba2f80a, RSG 0xaf68, KSG 0x1ba2f80a) 14(0xff00) 15(0xff00) May 19 02:22:50 usroom06 kernel: qinstart = 64719 qinfifonext = 64719 May 19 02:22:50 usroom06 kernel: QINFIFO: May 19 02:22:50 usroom06 kernel: WAITING_TID_QUEUES: May 19 02:22:51 usroom06 kernel: Pending list: May 19 02:22:51 usroom06 kernel: 30(CTRL 0x60 ID 0x17 N 0x64 N2 0x64 SG 0x1ba2f80a, RSG 0xaf68, KSG 0x1ba2f80a) May 19 02:22:51 usroom06 kernel: , 43(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG 0x1ba2ac0a, RSG 0x4768, KSG 0x1ba2ac0a) May 19 02:22:51 usroom06 kernel: , 47(CTRL 0x60 ID 0x17 N 0xffc0 N2 0xffa2 SG 0x1ba29c0a, RSG 0x6767, KSG 0x1ba29c0a) May 19 02:22:51 usroom06 kernel: , 37(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffa2 SG 0x1ba2b40a, RSG 0xf766, KSG 0x1ba2b40a) May 19 02:22:51 usroom06 kernel: , 24(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG 0x1ba3000a, RSG 0x4005761, KSG 0x1ba3000a) May 19 02:22:51 usroom06 kernel: , 44(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffa2 SG 0x1ba2900a, RSG 0x1005765, KSG 0x1ba2900a) May 19 02:22:51 usroom06 kernel: , 61(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG 0x1ba2440a, RSG 0x400575d, KSG 0x1ba2440a) May 19 02:22:51 usroom06 kernel: , 69(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG 0x11cef40a, RSG 0x4005759, KSG 0x11cef40a) May 19 02:22:51 usroom06 kernel: , 59(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG 0x1ba25c0a, RSG 0x3002f56, KSG 0x1ba25c0a) May 19 02:22:51 usroom06 kernel: , 92nding list: May 19 02:22:51 usroom06 kernel: 30(CTRL 0x60 ID 0x17 N 0x64 N2 0x64 SG 0x1ba2f80a, RSG 0xaf68, KSG 0x1ba2f80a) May 19 02:22:51 usroom06 kernel: , 43(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG 0x1ba2ac0a, RSG 0x4768, KSG 0x1ba2ac0a) May 19 02:22:51 usroom06 kernel: , 47(CTRL 0x60 ID 0x17 N 0xffc0 N2 0xffa2 SG 0x1ba29c0a, RSG 0x6767, KSG 0x1ba29c0a) May 19 02:22:51 usroom06 kernel: , 37(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffa2 SG 0x1ba2b40a, RSG 0xf766, KSG 0x1ba2b40a) May 19 02:22:51 usroom06 kernel: , 24(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG 0x1ba3000a, RSG 0x4005761, KSG 0x1ba3000a) May 19 02:22:51 usroom06 kernel: , 44(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffa2 SG 0x1ba2900a, RSG 0x1005765, KSG 0x1ba2900a) May 19 02:22:51 usroom06 kernel: , 61(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG 0x1ba2440a, RSG 0x400575d, KSG 0x1ba2440a) May 19 02:22:51 usroom06 kernel: , 69(CTRL 0x60 ID 0x17 N 0xff00 N2 0xffa2 SG 0x11cef40a, RSG 0x4005759, KSG 0x11cef40a) May 19 02:22:51 usroom06 kernel: , 59(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG 0x1ba25c0a, RSG 0x3002f56, KSG 0x1ba25c0a) May 19 02:22:51 usroom06 kernel: , 92(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffa2 SG 0x19fc00a, RSG 0x4002f52, KSG 0x19fc00a) May 19 02:22:51 usroom06 kernel: , 67(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG 0x1ba22c0a, RSG 0x4002f4e, KSG 0x1ba22c0a) May 19 02:22:51 usroom06 kernel: , 29(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG 0x1ba2f40a, RSG 0x4002f4a, KSG 0x1ba2f40a) May 19 02:22:51 usroom06 kernel: , 89(CTRL 0x60 ID 0x17 N 0xff40 N2 0xffcf SG 0x19fd40a, RSG 0x4002f46, KSG 0x19fd40a) May 19 02:22:51 usroom06 kernel: , 40(CTRL 0x60 ID 0x17 N 0xff80 N2 0xffcf SG 0x1ba2a00a, RSG 0x4002f42, KSG 0x1ba2a00a) : I totally had better things to do with today than debug scsi "issues". I spent the day reproducing this failure and watching what happened. After the software raid kicked out one of the two drives, I continued to try to load the remaining drive via rsync. It found sequence of "hardware failures" like before, but under this circumstance the driver thougth that the following would be a good idea: ...complaints about hardware errors and target failures deleted... May 19 16:08:29 usroom06 kernel: (scsi0:A:0:0): Locking max tag count at 64 The rsync completed after this point with no further errors. With the tag count still locked at 64, I deleted and rsynced again to see if this more conservative setting helps. This second complete try completed with no further errors. So -- I conclude that the funamental system is half-clever, and that my problems might be fixed if I can set this value at 64 or 32 right away, at boot time or in hardware BIOS or worst-case at compile time. So I did not remember or realize that I need to run mkinitrd if I modify the scsi part of /etc/modules.conf with something like options aic79xx aic79xx=global_tag_depth:32 I get it mkinitrd thing now. Is the syntax right? I did not get it before, so 20 failed reboots later, I gave up and tried else. Why does gibbs' latest 1.3.8 driver package contain a bug that makes it unusable? Something about too many arguments passed to something that scrolled past the screen way too fast and I did not bother to try to remember how to make it scroll more slowly. 1.3.7 installs and reboots fine and is configured to use 32 as a conservative default just like his documentation says it will, but unlike the default values compiled into RedHat. What does Redhat know that the author of the adaptec driver not know? And vice versa? You kids should talk to each other more. Anyway, I hope that fixes in the 1.3.7 driver, or simply using the conservative tag count settings will be good enough to let me proceed. Let me summarize my best guess about what is happening here, so that someone doing a search will discover this and your reply to it. Some aspect of the aic79xx driver and adaptec card and the interaction between the two is unstable in some respect on some systems with some mix of old and new hardware some of the time. Or all of the time. I'm not suprised. Users might want to use an updated driver, supplied by Gibbs but not Red Hat nor adaptec, or set the redhat to more conservative settings. Or I could have recompiled the module and/or kernel with these conservative settings, which is another 2 HOWTO's. And the intriquing problem in this case is that the RAID folks don't talk with the SCSI folks and neither talk to RedHat. So the oops in scsi causes the RAID to kick out a disk when ordinarily the scsi would just think that it was using unreasonably agressive settings and back off before any corruption occurred? But running on one disk, the RAID does not kick things out, so finally the scsi driver realizes that 253 is an unreasonable setting for the tag queue on this combination of hardware and resets it to something sane. But this is after losing one or more disks, which gives the sys-admins a heart attack, so too late. In other words, maybe the scsi part is nicely self-correcting, but the software RAID part does not know bout that. Does this logic also apply to at least 3 other bugs here in bugzilla, plus the several more I found through google? But please note that I am not the gentleman who started this thread, so this bug might not be fixed yet. And at this point, it looks like I had different problems. Now if only I can apply all of this to the latest errata 2.4.20 kernel, so that I can have my scsi raid and my security patch at the same time. Maybe I can get the mkinitrd thing to work, or gibbs will update his thing. In 100 minutes, the recovery of the two raid disks will have completed, and I can try to stress it again in its new enlightened state. I have fixed my original bug. It was caused by a combination of a bug in the aic7xxx SCSI driver and a bug in the Seagate ST336607LW drive. In a certain condition the aic7xxx driver would cause the Seagate ST336607LW drive to lock up solid with no SCSI reset possible. I have fixed this by updating to the aic7xxx 1.3.4 driver. This has been running for about a month now with no problems. Terry Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |