Bug 37633

Summary:	[aic7xxx] SCSI bus hangs on tape access
Product:	[Retired] Red Hat Linux	Reporter:	Kevin Range <range006>
Component:	kernel	Assignee:	Doug Ledford <dledford>
Status:	CLOSED WONTFIX	QA Contact:	Brock Organ <borgan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.1	CC:	mrjones
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-04-12 04:27:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Kevin Range 2001-04-25 16:25:17 UTC

I have the following devices on an  Adaptec AHA-2940U2/W / 7890:

0,0,0     0) 'QUANTUM ' 'ATLAS IV 18 WLS ' '9312' Disk
0,1,0     1) 'SEAGATE ' 'ST51080N        ' '0958' Disk
0,2,0     2)  PLEXTOR ' 'CD-R   PX-W124TS' '1.04' Removable CD-ROM
0,3,0     3) *
0,4,0     4) 'TECMAR  ' 'TRAVAN NS8      ' 'S267' Removable Tape

The Quantum drive is on the Ultra chain by itself.  The seagate, tape
drive, and external cdrom are all on the narrow chain.

I can read from the tecmar okay, and I can restore from the tecmar to the
quantum just fine.  But when I try to write from the tecmar to the seagate,
the system hangs and a message to the effect that writes are timing out to
the quantum drive (which holds / and all of that good stuff).  This is not
logged of course, since it can't write to the log files.

For now, I am just avoiding using any two devices on the narrow interface
at the same time.  Any ideas?

Oh, the kernel is the stock 7.1 one: kernel-2.4.2-2

Comment 1 Arjan van de Ven 2001-04-25 16:29:23 UTC

We ship an alternative aic7xxx module called aic7xxx_mod.o 
Could you try using that instead (look in /etc/modules.conf and add "_mod" to
the appropriate place if your root filesystem isn't on SCSI)

Comment 2 Kevin Range 2001-04-26 14:24:57 UTC

My root disk is SCSI (the Quantum).  I suppose I could try to make a new initial
root|ram disk with the _old module on it in place of the default aic driver.
Any hints on how I could do that?  I have some ideas, but maybe you know a
better way.

Comment 3 Kevin Range 2001-04-26 14:26:25 UTC

I mean _mod, not _old...

Comment 4 Doug Ledford 2001-05-02 18:58:50 UTC

Use the mkinitrd command to make a new boot initrd image after changing the name
of the aic7xxx module in the /etc/modules.conf file.  mkinitrd will
automatically use the module named in modules.conf, so once you've changed that,
all initrd images it build will use the correct module.  After you've made a new
initrd image (and optionally added a new stanza in the /etc/lilo.conf file so
that you can keep both initrd images under two different names, enter them both
in lilo, and be able to boot to either one by using different lilo config
stanzas), then re-run lilo to activate the changes.

Comment 5 Kevin Range 2001-05-07 15:02:02 UTC

I made the initrd's like you said, but now I can't reproduce the problem.  Maybe
it only happens under load...  I don't know.  Perhaps you want to close the bug
report until I can figure out how to reproduce it all of the time (it happened
three or four times in a row when i wasn't trying, and now when i want it to
happen... argh!)

Comment 6 Kevin Range 2001-05-07 21:44:16 UTC

New news.  So I had given up reproducing this bug and went back to work.
Anouther user ssh'd in and did some "heavy lifting" in /tmp (mounted on sdb1
which is on the narrow chain with the tape drive).  Lo and behold, lock-up.  So
I rebooted with the aic7xxx_mod initrd I had made this morning.  I had to do a
manual fsck.  Then on that reboot (also selecting aic7xxx_mod) I got a kernel
panic (somthing about interupt handlers...) when init was "updating /etc/fstab".
 That scared me back to the default kernel.

I had the user who crashed the machine login and run his job again.  Everything
went off without a hitch.  It seems to me that the problem only manifests itself
on the narrow interface after the machine has been in use for a while (a few
hours) or I have xmms open (not necessarily playing anything, just open).
Unfortunately these last two conditions are usually found together (if I have
been using my machine for a while chances are I have xmms open).

I will look into this some more.  (I really don't see how xmms could have
anything to do with it...)

Comment 7 Michael May 2001-06-07 07:07:20 UTC

Have had similar problems with the system corrupting access to the tape drive.
Have tried the old an new drivers and have compiled into the kernel instead of 
loadable modules. Still causing probs. Can clear fault by rebooting computer 
(power off). Tried 2.4.5 still no go. changed hardware. Still crook.
When you backup and verify its fine. Next time you do it the verify fails 
around 40% of the files up to 50 -60%.
Entire config was working flawlessly under version 7.0.

Now have to go sort out prob with windows clients ftp through 7.1. they get out 
o.k. but time out once they log into ftp site.

Comment 8 Kevin Range 2003-04-12 04:29:03 UTC

I got rid of all of my Travan drives, so I don't care and can't test this
anymore anyway.