Bug 50001

Summary: (SCSI AIC7XXX)Machine dies with SCSI AIC7xxx problems
Product: [Retired] Red Hat Linux Reporter: gsr.bugs
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED CURRENTRELEASE QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: high    
Version: 7.1CC: chedong, djuran, gibbs
Target Milestone: ---   
Target Release: ---   
Hardware: i586   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-09-30 15:39:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description gsr.bugs 2001-07-25 21:18:43 UTC
Description of Problem:

The machine hangs after detecting previous install. It reports where
the previous install is (/dev/sda2), then goes into "Finding packages to
upgrade" and I have to press reset after a while, cos the disk only
makes noises and VT4 becomes full of reseting SCSI bus messages.

How Reproducible:

It happens always I try to use the RH7.1 CDs to upgrade from previous
versions.

Steps to Reproduce:

1. Set machine to use CD as boot (ISO from Internet, md5sum passed).
2. Enter "linux expert text" when prompted.
3. Select options (language, kbd, update installation) and wait for
Finding packages.

Actual Results:

Upgrade does not work. In next reboot I have to check all the ext2
partitions, btw, including the data one that only has personal things,
no OS files. No data loss seem to happenm luckly.

Expected Results:

Upgrade to 7.1. It was my test machine, I have another two with similar
SCSI cards and I would like to move to 7.1. :]

Additional Information:

I had 6.1 in that machine and tried to upgrade to 7.1, which failed...
so I tried step by step upgrades. 6.2 and 7.0 worked fine, I upgraded
one after another without caring about config changes or such (like
check for new options in servers or reinstall my personalizations to
bashrc), just see that they worked. Now it has a (unconfigured) 7.0.
The step from 7.0 to 7.1 failed in the same way than the direct one
from 6.1, so problem seems to be in something that the kernel in 7.1
does.

The computer has an AMD K6-II 400MHz CPU, an Asus P5A-B MoBo, 64 MB
RAM, Adaptec 2940UW PCI, a 4GB SCSI Samsung HD, a Plextor CDR, a S3
Virge PCI videocard, an ISA NE2000 Ethernet clone and a SB16ASP ISA.
lspci (with the 7.0) shows:

00:00.0 Host bridge: Acer Laboratories Inc. [ALi] M1541 (rev 04)
00:01.0 PCI bridge: Acer Laboratories Inc. [ALi] M5243 (rev 04) (prog-if 00
[Normal decode])
00:03.0 Bridge: Acer Laboratories Inc. [ALi] M7101 PMU
00:07.0 ISA bridge: Acer Laboratories Inc. [ALi] M1533 PCI to ISA Bridge
[Aladdin IV] (rev c3)
00:0a.0 SCSI storage controller: Adaptec AIC-7881U (rev 01)
00:0b.0 VGA compatible controller: S3 Inc. ViRGE/DX or /GX (rev 01)
(prog-if 00 [VGA])
00:0f.0 IDE interface: Acer Laboratories Inc. [ALi] M5229 IDE (rev c1)
(prog-if 8a [Master SecP PriP])

In VT4, when the disk makes noise, I can see lots of (hand copied,
maybe typos):
<4>SCSI host 0 abort (pid 0) timed out - reseting
<4>SCSI bus is being reset for host 0 channel 0.

From time to time it shows something about a error in the clock or a
spurious IRQ 7 interrupt. But mainly it just dumps error about SCSI,
reseting and reseting, with some "trying harder" ones.

I read some bug reports, but they are about Intel chipset. Anyway I
tried one time the "noprobe" param to load the aic7xxx_mod module and
I got a kernel hang, with CPU registers dumped to screen, and only VT
switching working, so that was even worse.

More info avaliable upon request, of course.

Comment 1 gsr.bugs 2001-07-28 14:34:55 UTC
I tried one more time and changed to VT4 ASAP, and I also saw the following
lines:

<4> (scsi0) BRKADDRINT error  (0x20):
<4>  Scratch Ram/SCB Array Ram Parity Error
<4> (scsi0) SEQADDR=0x69

It repeats changing the 0x69 to other two digit hex numbers like 0x68 or
0x6a, and after some sec with that the resetting messages appear.

Comment 2 Doug Ledford 2001-08-02 18:15:32 UTC
This appears to be a general problem with the 2.4 kernel and your machine that
is merely makeing the aic7xxx driver and card look suspect.  Try going into the
BIOS on your motherboard and changing the settings related PCI delayed
transactions, PCI Write Posting, and any PCI caching.  Hopefully, one of those
settings will get your machine working with the 2.4 kernel.  Outside of that, I
have no way to reproduce your problem here.

Comment 3 gsr.bugs 2001-08-02 22:21:28 UTC
The BIOS only seems to have Passive Release (default Enabled) and Delayed
Transaction (default Disabled). I tried with both Enabled and both Disabled,
same results, except that now the partitions need manual checking, and so
I delayed up the fourth test until tomorrow (Disabled & Enabled). It also
has PnP OS, that is set to default value of No. There is nothing about
Write Posting or Caching (except the CPU cache). Do I change PnP OS option
too? The defaults worked pretty fine with kernels 2.2.x.

When you say general, you mean general as in "many people", or something that
is inside the kernel and is hard to trace? If you point me to the instructions
I could try to get all the info since bootup via serial console with other
computer or dump to floppy (if it does not damage it due errors) so you can
get a better idea of what the kernel is doing. Could have been fixed in a
newer kernel? Any kernel params I should use?

Comment 4 gsr.bugs 2001-08-11 15:48:02 UTC
OK, tried the other posibility, Disabled & Enabled, and it still has
the problems with the SCB and the reseting. I am getting a nice collection
of files in lost+found. :[

All seem to be Berkeley DB files, and as the errors start after detecting
where is the previous install, in the Finding packages (probably doing the
backup copy of the DB, cos the RPM DB still works but I am getting a lots
of anaconda-rebuilddb dirs in /var/lib/), I guess this error is of the
"aic7xxx has problems under load" kind, as noted in the gotchas page.

Other things to test?

Comment 5 Doug Ledford 2001-08-23 22:55:40 UTC
Try the new aic7xxx driver instead of the old (default) one.  To do that, boot
with the command line "linux noprobe" at the boot disk's boot: prompt.  Then
when it asks for drivers, select SCSI driver, then select the new aic7xxx driver
(not the standard aic7xxx driver) and see if it makes a difference.

Comment 6 gsr.bugs 2001-08-26 22:06:05 UTC
OK, booted with linux text expert noprobe, and used the modules marked as
new experimental aic7xxx_mod. Created a device for the disk and tried to
do a dd if=/dev/sda of=/dev/null, which made the led go red without errors.

Then deleted the device and keep on with the install. Now it dies with
something like the following text, when trying to detect the packages
installed so it can perform the update (= same place, different text):

<4> scsi0: brkaddr, Scratch or SCB Memory Parity Error at seqaddr = 0x196
<4> scsi0: Dumping Card State at seqaddr 0x196
<4> scsiseq = 0x12, sblkctl = 0x2, sstatq 0x0
<4> SCB count = 28
<4> kernel NEXTQSCB = 3
<4> Card NEXTQSCB = 22
<4> QINFIFO entries: 22 23 16 17 18 19 12 13
<4> Waiting Queue entries:
<4> Disconnected Queue entries: 14:20 15:27
<4> QOUTFIFO entries
<4> Sequencer Free SCB List: 7 3 6 5 12 8 9 11 2 0 1 10 4
<4> Pending List: 13 12 19 18 17 16 23 22 21 20 27
<4> Kernel Free SCB List: 0 6 11 2 10 14 8 15 7 1 9 5 4 26 25 24
<4> DevQ(0:0:0): 0 waiting
<4> DevQ(0:1:0): 0 waiting

Some secs after that, I get two of those CPU register dumps done by kernel,
talking about unabling to handle a kernel null pointer dereference at virtual
address 00000000 (I was tired of copying number).

I had to power off and then on, cos with hot reboot the disk was not detected
and the SCSI BIOS was not installed. On the boot to RH70, I had to do a manual
fsck, and a dir was connected to lost+found.

Comment 7 Doug Ledford 2001-08-27 14:24:14 UTC
I don't have the hardware to duplicate your problem (and I'm unable to cause it
on any of my systems here).  However, this obviously isn't just a problem with
the old aic7xxx driver, it also effects the new aic7xxx driver.  So, I've Cc:ed
Justin Gibbs on this report in case he knows what the problem is or in case he
has the necessary hardware to reproduce the problem.

Comment 8 Need Real Name 2002-02-05 22:22:28 UTC
I'm trying to install Redhat 7.1 on a PC with Adpatec 7896 SCSI
driver and Intel 440 motherboard. I've read through the bug
report #29555, Thanks to Doug Ledford's solution, I'm able to
successfully install the package onto that PC. (FYI, here's
what I did:
   1. install with "linux apic" option
   2. Select "kernel-smp" package during installation, and
   3. Boot with "default=linux-smp" option.)

Then I tried to recompile the kernel (you know, to make it smaller
and faster), then I got the following error during boot up:

....
blk: queue c7dc7e18 I/O limit 4095Mb
(scsi0:0:0:0) Synchornous at 80.0 Mbyte/sec, offset 15
SCSI device sda: 17783240 512-byte hdwr sectors 9105MB
Partition check
  /dev/scsi/host0/bas0/target0/lun0: p1 p2 <p5 p6>
...
mkrootdev: mknod failed 17
mount error 16 mounting ext2
pivot_root: pivot_root (sysroot, /sysroot/initrd) failed: 2
...

Kernel panic ...


I compared it with the booting message of a healthy kernel
(the default one downloaded by ftp), and noticed that the difference
is during "Partition check", which the healthy one gives the following
message:

...
Partition check
  sda: sda1 sda2 < sda5 sda6 >
....


Seems that the scsi driver can't map file system correctly onto the
hard disk. I've tried with both aic7xxx.o and aic7xxx_old.o drivers,
either of them gives me the same answer. Is it a aic bug or file system
bug? Or maybe I forget to include some option during kernel-compiling?

Comment 9 Doug Ledford 2002-02-13 22:33:14 UTC
You enabled devfs support in your kernel by default, which we don't do, and
which changes how drives are mapped.

Comment 10 Need Real Name 2002-12-06 03:12:58 UTC
I tried following version of redhat on my ( pIII 866*2 with SCSI aha-2940):
6.2  hang up at disk druid
7.1  hang up at probing mouse
7.2  hang up at disk druid
7.3  hang up at disk druid 
8.0  hang up at loading aic7xxx driver(some time can pass through)



Comment 11 Bugzilla owner 2004-09-30 15:39:05 UTC
Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/