38608 – Periodic system "hangs" with aic7xxx driver

Bug 38608 - Periodic system "hangs" with aic7xxx driver

Summary: Periodic system "hangs" with aic7xxx driver

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-05-01 17:20 UTC by William W. Austin
Modified:	2007-04-18 16:32 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2003-06-09 17:40:26 UTC
Embargoed:

Attachments	(Terms of Use)

Description William W. Austin 2001-05-01 17:20:35 UTC

This was a fresh install of 7.1 on a former 7.0 system.

I had no problems with the install -- this showed up immediately after the
installation.

This box has a Tyan S1834 (Tiger 133) mb with 1 833 MHz P3 & 384 MB memory
Cards:  Matrox G450 (32 MB)
        Adaptec 2740 U/W
        Adaptec 2740 U
        3Com 3c594 (I think 594 -- not sure -- from memory) 10/100 MB
network card
        Digi PC-Xem 16-port serial board
        Soundblaster AWE64

On the 2 Adaptec cards I have:
        2740 U/W:       id 00:  QUANTUM  Model: QM318000TD-SW   18GB HD
(fast/wide scsi2)
                        id 03:  PLEXTOR  Model: CD-ROM PX-32TS  CDrom
(narrow)
                        id 04:  HP       Model: HP35480A        tape drive
(narrow)
                        id 05:  YAMAHA   Model: CRW8824S        CD-RW
(narrow)
                        id 08:  IBM      Model: DDRS-39130W     9 GB HD
(fast/wide scsi2)
                        id 09:  IBM OEM  Model: DFHSS4W         4.3 GB HD
(fast/wide scsi2)
        2740 U:         id 00:  UMAX     Model: Astra 1200S     scanner
                        id 05:  HP       Model: C1553A          tape drive
(narrow)

On the udma66 conroller I have one drive, and IBM 46 GB udma100 hd
running as a udma66

This is the same hardware on which I was running 7.0 with no problems.

I am experiencing frequent "system hangs" which remind me of the delay
you see with a scsi bus reset.

I am running no disk-mirroring or raid s/w, BTW.

However, the delays are shorter -- about 6 seconds instead of the
15-20 which I would get from a reset.  Also after checking log files,
I am getting NO error messages indicating any scsi problems at all.

Typically I *had* thought these delays occur whenever I did a file copy
of > 4MB.  However, today I started looking for and seeing them
with file copies as small as 120 KB -- not every time, just intemittantly.

The same "hangs" occur if I disconnect the ide drive and disable its
controller on the MB (it's built in, so I can't just pull the board).

So I tried an experiment -- I have another system running 7.0 with
a slightly smaller scsi drive.  I pulled that drive and put it in
this system.  The system came up and ran 7.0 without the pauses.  So I
reattached the other drives, tape drives, ide drive (re-enabling the
controller) and still the system behaves.

Likewise putting the drive with my 7.1 system (along with the u/w
controller) in that other machine and booting 7.1 there, I get the same
"pauses".  BTW, this drive worked fine under 7.0.

So I finally put these two systems "back in order" (right drives,
controllers in each machine), backed up EVERYTHING onto tape, and
re-installed 7.0 on my 7.1 box (fresh install) -- the pauses disappeared
altogether.  Reinstall 7.1 and they're back.

I suspect that the often-mentioned problem(s) with the aic7xxx driver
may be the root cause, although it could be something else in the kernel.

Comment 1 Arjan van de Ven 2001-05-01 17:24:02 UTC

Then could you please try the other scsi driver ? aic7xxx_mod ?

Comment 2 William W. Austin 2001-05-03 13:33:46 UTC

I tried the aic7xxx_mod driver doing the following procedure:
A) edit /etc/modules.conf
B) change the line reading
        alias scsi_hostadapter aic7xxx
    to readalias scsi_hostadapter aic7xxx_mod
C) do a depmod -a
D) reboot.

The machine reboots, recognizes the drives, goes to run level 3, then I get the
messages:

Entering non-interactive startup
updating /etc/fstab 

And then the system hangs for 10-20 seconds, does a stack trace (which goes off
the screen too fast to read), and then I get a repeating loop:

+ the repeated series of error messages on scsi0:-1:-1:-1 (too many to count --
  several screensfull at least):
  scsi 0:-1:-1:-1 Referenced SCB 255 not valid during SELTO
  SCSISEQ=0x5a SEQ ADDR=0x18 SSTAT0=0x10 SSTAT1=0x8a
+ then another error message which goes off the screen to fast to read.
  
which continues until I hit the reboot switch. 

So either I did the wrong procedure to try the driver OR it has a problem with
my
system as well.  

Is there a procedure to specify the module at boot up instead of putting it into 
/etc/modules.conf?

Comment 3 Doug Ledford 2001-05-08 19:33:39 UTC

Did you remember to do the mkinitrd and lilo steps when changing the SCSI
module?  Just doing a depmod -a doesn't activate the change, you have to make a
new initrd and you have to run the lilo command so lilo can map the new initrd
image.  Then, when you reboot, you should have the new driver.

Comment 4 Mike Kinney 2001-05-13 19:03:40 UTC

FWIW, I've got a Netfinity 5000 that has AIC 7895 Ultra SCSI. I tried the 
following, which did NOT work:

- boot 7.1 cdrom
- boot XFS boot.img
- boot XFS bootnet.img
- boot XFS cd (created from ISO image)

What *DID* work:
I installed RH71 on another computer, copied the RPMS, base from CD's 1 & 2 
to /usr/ftp/pub/rh71, changed ftp user to mount /usr/ftp instead of /var/ftp, 
added my uid/gid to /etc/ftpaccess, restarted xinetd by /sbin/service xinetd 
restart (not sure if I needed to), made sure ftp anonymous was working, booted 
with the 7.1 CDROM, did an FTP install from the other RH71 box.

Hopefully that will help others!

BTW, I, too am a little disenchanged with RH. They have done an execellent job 
over the years, but the 7.0 release really upset me. Couldn't compile the 
kernel out of the box?!?! I bought RH70 and was disappointed that I did. When 
7.1 was released, I felt it'd be better do download the ISO's. Boy, am I *GLAD* 
that I did. I didn't want to WASTE more money... Hopefully, RH7.2 (or whatever 
the next one will be...) will install smoother...

Also, I want to thank the person that suggested to do ALT-F2, ALT-F3, ALT-F4 on 
the installs. I had no idea that there was more detailed info avail during the 
install. Now I have *SOMETHING* more entertaining to watch. :-)

Comment 5 Alexander Gd_ler 2001-05-20 01:59:56 UTC

1I have installed an old pc system as server running RH7.1.
PC system:  DEC, Digital Celebris 560
            96 MByte RAM
            Onboard VGA, Keyboard, Mouse
            Adaptec AHA-2940 Ultra/Ultra W
            3COM Etherlink XL, 3C900 Combo
            Harddisk IBM DCAS-34330W and Seagate ST39173LW (connected on wide 
scsi)
            CD-ROM Plextor PX-6XCS (connected on wide scsi via adapter)
            LILO in /boot partition
            e2fs filesystem
            running DHCP

The system was newly installed with RH7.1, for use as an experimental 
nfs/ftp/web/kickstart server. For normal/console operation (without network 
connections) the system works properly.

With a second pc I tried using a network based kickstart installation. During 
this installation the server hang completely and forever (until manual hard !!! 
reset).
In most cases (>90%) the server hangs during the loading of the selected 
packages, sometimes after 3 minutes (or a few mbytes) and sometimes after 30 
minutes (or more than 1 gbyte) (during testing nfs and ftp based installation).
Then: From a third pc the server is not reachable with the ping command but the 
second pc is.

After searching bugzilla I found several problems regarding the aic7xxx scsi 
driver. I switched to the described driver aic7xxx_mod (/etc/modules.conf, 
aic7xxx_mod; depmod -a; mkinitrd; second section in /etc/lilo.conf, lilo) and 
the server is running well.
Testing this by switching back to the original driver always reproduce the 
problem.

Comment 6 William W. Austin 2001-05-20 03:39:06 UTC

dledford@redhat wrote:
Did you remember to do the mkinitrd and lilo steps when changing the SCSI
module?  Just doing a depmod -a doesn't activate the change, you have to make a
new initrd and you have to run the lilo command so lilo can map the new initrd
image.  

Yes, I did the new mkinitrd.  Unfortunately I misspelled the output file in 
/boot and didn't catch it before the reboots.  After I saw your note, I 
checked and fixed my blunder -- I have been running the aic7xxx_mod module 
with improved performance ever since.  Thanks.

Note You need to log in before you can comment on or make changes to this bug.