Bug 92129

Summary: (SCSI AACRAID)kernel: aacraid: Host adapter reset request. SCSI hang ?
Product: Red Hat Enterprise Linux 3 Reporter: Javier Rodriguez <jlrodriguez>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: andykinney, anielsen, asousa, bbrock, bryan, bugzilla-redhat, cpbarton, david_gardi, diego_leccardi, elliot, gregj, honermeyer, jacob_liberman, james, jcpeck, jdenny, jharrop, jparsons-redhat, kevin.mcmahon, k.georgiou, magnus.ahl, manuel.wenger, mark_salyzyn, nospam, pd, petrides, redhat.com, rknigh, roy.olsen, sturolla, sysadmin, tao, timh, viggiani, wenthe
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-11-07 03:51:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
latest driver version from Mark_Salyzyn@adaptec.com
none
logfile and system information from web1 (PowerEdge 2650)
none
Redhat Kernel debugging pdf none

Description Javier Rodriguez 2003-06-03 00:57:52 UTC
Description of problem:

We are looking for assistance in resolving the following problem.
 
We recently purchased two Dell PowerEdge 2650 servers with PERC3/Di 
controllers. Both servers were loaded with RedHat Linux 9.0. Both servers are 
encountering the following error with RedHat's kernel-smp-2.4.20-9 and kernel-
smp-2.4.20-13.9:
 
<<< Portion of server message log >>>
May 31 16:14:07 server1 kernel: aacraid: Host adapter reset request. SCSI 
hang ?
May 31 16:14:17 server1 kernel: scsi: device set offline - command error 
recover failed: host 0 channel 0 id 0 lun 0
May 31 16:14:17 server1 kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 
return code = 6000000
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 83200
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 13568
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 13616
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 83200
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 22030904
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 88348712
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 72976
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 13624
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 13752
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 13768
May 31 16:14:17 server1 kernel:  I/O error: dev 08:03, sector 72976
<<< I/O error messages continue until the server is rebooted >>>
 
 
Here are a few notes regarding the error and operating environment:
 
- The error occurs with RedHat's kernel-smp-2.4.20-9 and kernel-smp-2.4.20-
13.9. The problem occurs more frequently with 2.4.20-13.9. We are testing 
kernel-2.4.20-9 to determine if the problem occurs under a non-smp environment.
- The servers are configured for RAID 5 with 3 disks each.
- The time between failures varies from several hours to several days.
- The failures occur both during light and heavy system loads.
- PowerEdge 2650 BIOS is at 1.10 A10
- Backplane firmware is at 1.01
- PERC3/Di BIOS is at V2.7-1 (build 3170)
- A full system diagnostics has been successfully executed on both servers.
- The RAID media has been successfully 'verified' on both servers.
- Dell hardware support has been contacted, and so far no hardware problems 
have been uncovered.
 
Thank you in advance for your assistance to get this problem resolved.


Version-Release number of selected component (if applicable):


How reproducible:

This problem can not be reproduced at will. It shows up sometimes after 
several hours of operation, and some times after several days.


Steps to Reproduce:
1.
2.
3.
    
Actual results:


Expected results:


Additional info:

In all cases, the server needs to be rebooted to resolve the problem. All data 
with pending disk writes is lost. At times, the system reamins up long enough 
to allow for the capture of information such as the portion of the message log 
showing the error.

Comment 1 Alan Cox 2003-06-05 20:00:18 UTC
Currently working with adaptec on updating aacraid and fixing bugs in it


Comment 2 Manuel Wenger 2003-07-24 10:45:04 UTC
Same problem here with the same hardware, but on Redhat 7.3 Kernel 2.4.20-19.7.

Comment 3 Stefano Turolla 2003-08-26 12:32:05 UTC
we have the same problem with poweredge 1650 and 2650.
We tried different versions of kernel and redhat releases (7.3 and 9)
kernel tried 
2.4.18-18.7.x
2.4.18-24.7.x
2.4.18-26.7.x
2.4.20-13.7
2.4.20-19.7
2.4.21.ac2-rc2
As a workaround i removed raid controller form some 1650 and re-install
the machine with only scsi interface connected. We didn't have any more crash in
the last month!

Comment 4 Javier Rodriguez 2003-08-26 15:18:57 UTC
Does anyone developing the aacraid driver have an update regarding the problem 
below? Disabling HyperThreading (Logical Processor) within the Dell 2650 BIOS 
has without a doubt circumvented the problem for us (as well as a few others), 
but it would be nice to reenable the feature.
 
For reference, with HyperThreading disabled, we've been able to successfully 
execute Red Hat's distribution of Linux kernel-2.4.20-9, kernel-smp-2.4.20-9, 
kernel-2.4.20-13.9, kernel-smp-2.4.20-13.9, kernel-2.4.20-18.9 and kernel-smp-
2.4.20-18.9. We currently have two Dell 2650s executing kernel-smp-2.4.20-18.9 
for 74 days without incident. Prior to disabling HyperThreading, our systems 
would normally crash within 24 hours (no longer than 48 hours) with both the 
smp and non-smp version of the kernel.
 

Comment 5 Stefano Turolla 2003-08-27 08:28:27 UTC
For most of our machines (1650) disabling the hyperthreading has no sense
as they have one cpu (pentium III from 1.4 to 1.7 GHz) with no hyperthreading,
of course.
On the other hand we have other 2650 that are running since 2 or three moths
without problems, some of them with hyperthreading disabled.
A couple of other 2650 had several crashes whene they were used as ftp server.
I don't know what it really means but it seems something not really related to
hyperthreading, but only to a high substained i/o   

Comment 6 Stefano Turolla 2003-09-24 10:23:13 UTC
any good news on the aacraid driver?
the problem is still there for us.
thanks



Comment 7 Alan Cox 2003-09-24 11:07:36 UTC
None really. I've got some possible test patches but not much else.


Comment 8 Javier Rodriguez 2003-09-24 14:41:15 UTC
Based on the many posts Iâve been reading, it appears that the aacraid driver 
problems have been occurring for some time now, and although people are 
working on it, a formal resolution is not available. What can we do to help 
expedite a resolution for the ongoing aacraid problems? If there are possible 
patches available, can Red Hat provide an RPM or procedure to allow us to 
install and test the patches? Is there specific debug or configuration 
information that we can provide to help the aacraid developers better isolate 
the problem? Thanks.

Comment 9 Alan Cox 2003-09-24 14:54:37 UTC
I'm currently doing some testing in upstream kernels (2.4.22-ac4 patches). I
guess whoever takes over from me will pick up on that, or Adaptec will
eventually fix this in their driver updates, since they at least have firmware
source/docs so
can trace such problems in detail.

Mark Salyzyn <mark_salyzyn> is probably the person who can tell you
best about any updated drivers and Adaptec recommended updates as well as give
you test driver bits


Comment 10 Alan Cox 2003-09-24 14:55:54 UTC
(removing from Cc as I'll be away for some time from tomorrow)


Comment 11 Stefano Turolla 2003-09-24 15:02:05 UTC
We have tried to involve DELL (Germany) several times to help us with this problem.
After a lot of suggestions like "update the firmware here and there" their
final conclusion was to subscribe to RedHat Enterprise Linux (ES) standard edition 
and ask support to RedHat :-)
I hope someone can find a solution otherwise we will be forced not to buy DELL
servers anymore and disable RAID controller on the existing ones.
I don't know if can be of some help to involve Adaptec, but who better then Alan
can know this. 
I totally agree with previous comment, if there is something we can do to help
the developers just let us know.

Comment 12 acount closed by user 2003-09-24 21:22:49 UTC
There is a aacraid devel mailing list at
http://lists.us.dell.com/pipermail/linux-aacraid-devel
There are DELL and ADAPTEC guys, and others, that they can help you

Comment 13 acount closed by user 2003-10-16 00:41:19 UTC
Created attachment 95216 [details]
latest driver version from Mark_Salyzyn

Comment 14 Mimmus 2003-10-20 10:41:14 UTC
I can confirm that under 2.4.20-8smp problems persists.
Now I'm running 2.4.20-8 (not smp) and machine is still up after 3 days.
I'm planning to update to 2.4.20-20.9smp/not-smp and retry.

Comment 15 Anders Nielsen 2003-11-30 20:16:52 UTC
I am having the same problem on a server running RHELv3. 

Any news on this bug?

Comment 16 Mimmus 2003-12-01 08:12:47 UTC
I had 2 system disk in RAID1 (mirror) and 2 single disks configured as
'VOLUME'.
Now I have reconfigured my disks from 'Volume' to 'Raid0' in Perc 3Di
setup and my proxy server is up since 40 days.

Can anyone confirm this?

Comment 17 Kevin McMahon 2003-12-01 17:12:21 UTC
I had my disks in hardware Raid0.  I've now turned off the RAID in 
BIOS, and I'm using a Linux software Raid0.  That hasn't crashed yet, 
and has been up for 24 days, which I think is a record for us with 
these machines.

Comment 18 Anders Nielsen 2003-12-02 14:09:32 UTC
The problem has also been discussed at Dell's Poweredge mailing list at

http://lists.us.dell.com/pipermail/linux-poweredge/

On the October 20th Mark Salyzyn mentions that he has a driver
workaround. Can anybody confirm this? And is redhat going to include it?

Comment 19 acount closed by user 2003-12-03 02:03:14 UTC
not latest code, but it's near, is here:

http://www.adaptec.com/worldwide/support/driverdetail.html?sess=no&language=English+US&cat=%2fOperating+System%2fLinux&filekey=aacraid-drv_1.1.4-rh9.rpm

and don't forget to update the firmware of the RAID board.

Comment 20 Anders Nielsen 2003-12-15 10:58:24 UTC
If you have Seagate disks this article describes a timeout problem
that can be fixed by updating the firmware of the _disks_.

http://www.seagate.com/support/disc/u320_firmware.html


Comment 21 Peter Dieth 2003-12-16 21:35:16 UTC
I can confirm, that we are having the same problem with a brand new Dell 
PowerEdge 2650 system, 2 x 2.8 GHz Xeon, Perc 3/DI, 4 Maxtor Atlas
73GB drives configured as one RAID5 contaier running RedHat Enterprise
Linux 3 with latest patches installed (kernes is
vmlinux-2.4.21-4.0.1.ELsmp).

Comment 22 acount closed by user 2003-12-16 23:17:35 UTC
with HW RAID boards it's essential to have _latest_ firmware releases
in the hard disks and RAID board. And latest drivers ;)

Comment 23 Mimmus 2003-12-17 14:33:53 UTC
I confirm that problems disappeared after changing container from
'Volume' to 'Raid0'. No experience with RAID5.

Comment 24 Peter Dieth 2003-12-19 16:59:42 UTC
Created attachment 96638 [details]
logfile and system information from web1 (PowerEdge 2650)

Comment 25 Peter Dieth 2003-12-19 17:11:41 UTC
We are still experiencing the same SCSI problems after 
making & running a decent static RHEL3-ES kernel with the
new 1.1.4 driver from Mark Salyzyn posted here (attachment from 
2003-10-15). The uptime was about 1 day.

I/O error: dev 08:02, sector 0
I/O error: dev 08:02, sector 3944840
I/O error: dev 08:02, sector 3942371
EXT3-fs error (device sd(8,2)) in ext3_reserve_inode_write: IO failure
I/O error: dev 08:02, sector 3881238
...

Please see attachment "logfile and system information from web1
(PowerEdge 2650)" for details.

If you need more information, please contact me directly.

Cheers,
Peter


PS: I also opened a service call with Dell Germany but they told me,
that this is not a hardware but a driver problem and RedHat is
responsible to fix this. 

Comment 26 Arjan van de Ven 2003-12-19 17:13:14 UTC
for support on RHEL bugs, bugzilla is the wrong medium, contact your
RH support contact instead.

Comment 27 Peter Dieth 2004-01-06 11:16:47 UTC
Arjan:
We are currently evaluation RHEL3-ES(x86) on the Dell 2650 platform and
we were not able to file a bug against RHEL, because we did not activate
the product yet. This is because we fear that we have to replace the 
hardware platform (other PC based servers) in the future due to the
pending aacraid problem.
May you tell me how we are able to file a bug against RedHat ES without 
product activation?

Comment 28 Theron Toomey 2004-01-20 22:46:05 UTC
I am experiencing this problem on two systems, a 2650
w/hyperthreading/smp and a 1650 (non-smp). Both systems run RH 7.3
with latest available kernel.

Has anyone else seen this bug on a non-smp kernel besides me?

Thanks

Comment 29 Stefano Turolla 2004-01-21 08:50:21 UTC
our 1650 we are still experiencing this problem 
have only one cpu and no hyperthreading.
My understanding is this bug is strictly related to
aacraid driver for the PERC/3D and not to smp/hyperthreading

Comment 30 Mimmus 2004-01-21 12:58:27 UTC
Bug still persists even if it is not easily reproducible.
On Jan 19, Matt Domsch from Dell said on the AACRAID mailing-list:

"aacraid was not updated in [Red Hat 3 AS] Update 1.  We're still
trying to duplicate the failure in our labs, and we've got a failing
customer system on its way back to us for exactly that purpose.  Once
we have root-caused the issues, we'll work with Red Hat and kernel.org
to get any driver-side fixes included."

Comment 31 Brian Whitehead 2004-01-22 16:34:59 UTC
Well I'd say you can have mine, but being a production system I have
to format the hard drive and install another OS.  Looking at all of
the information on the net about it, it seems to be effecting
primarily Dell servers with Adaptec based RAID controllers, mostly
PERC3.  I can send all of the logs that I sent RHES tech support and
Dell Gold Support, who both said they couldn't help.  It also appears
to be only effecting systems based on the later 2.4.x kernels, because
I had no problem with RH7.2, but many people with 7.3 and newer are
experiencing it.  So, what has changed between these current versions.
 This driver was rewritten around that time.  This problem has been
going on for at least a year from looking at this bug report and the
Adaptec and Dell lists.  I hate to be the critic here, but somebody
needs to take ownership of this problem and work with the vendors to
get it resolved and stop pointing fingers, making excuses and waiting
for someone else to fix it.

Comment 32 Norman Elton 2004-01-27 05:53:08 UTC
At the risk of making Bugzilla into a forum, I figured I'd chime in to
say our 2650, running RHEL AS 3, just had it's first crash after
running for approx 2 months without incident. The utilization is very
light, but we are in the process of deploying it as a NFS/NIS server.
We purchased a production level Dell box (as opposed to a home-built
server) for it's reliability. Oops.

Are there any tests (seti@home, bonnie, etc), that are known to
trigger a crash? The "once every two months" is almost more disturbing
than "once a night".

Hopefully this will get addressed very soon.

Comment 33 Brian Whitehead 2004-01-28 17:10:08 UTC
I have moved my production system to an old Compaq server to allow for
testing this issue on the Dell that is experiencing the problem. 

First, I disabled the postgres database on this server and it appears
that the crashing has not happened over the last few days.  But I had
setup cron job to reboot the server to try to keep it from crashing
too often.  I've removed this reboot and will see if stopping the
postgres database does truely have an effect on the server.  With it
running the system wouldn't stay up more than 8-12 hours.  Also, a
contact at Dell had me check the cache setting on the  RAID
controller.  The write caching was disabled, so I've now enabled it. 
If anyone out there is experiencing this same 'scsi hang' issue with
their RAID controller and the aacraid driver, I would suggest checking
to see if postgres is running.  If it is and you don't need it then
shut it down and see what happens.

Comment 34 Eric D. Hendrickson 2004-01-28 22:42:00 UTC
This has been happening to me on almost a daily basis.  I have two 
dual2.4Ghz Dell 2650's running Redhat 8.0 with kernel 2.4.23, with 
two drives in RAID-1 configuration.  I have tried disabling 
Hyperthreading to no avail.

We are looking at using more of these systems as we migrate from 
Windows 2000 to Linux (specifically RedHat Enterprise 3.0 and Fedora) 
but as of now our production Linux systems are still on Redhat 8.0.  
So far I have two registered AS 2.1 systems and one AS 3.0 system 
that do not have this problem but they are different hardware (Dell 
6650).

Any advice would be appreciated.

Comment 35 Mark Salyzyn 2004-01-30 13:44:07 UTC
This is the root of the fix that is part of the source code in this 
thread on October 15 2003. The 2.6 tree has had this fix for some 
time, the 2.4 tree is still cogitating over accepting this patch. 
Please apply this patch *NOW* to the drivers/scsi/aacraid/linit.c 
file. The original was taken from 2.4.21-9.EL

Sincerely -- Mark Salyzyn

--- linit.c.orig	Fri Jan 30 05:38:19 2004
+++ linit.c	Fri Jan 30 05:38:38 2004
@@ -605,8 +805,99 @@
 
 static int aac_eh_reset(Scsi_Cmnd* cmd)
 {
-	printk(KERN_ERR "aacraid: Host adapter reset request. SCSI 
hang ?\n");
-	return FAILED;
+	Scsi_Device * dev = cmd->device;
+	struct Scsi_Host * host = dev->host;
+	Scsi_Cmnd * command;
+	int count;
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0))
+	unsigned long flags;
+#endif
+
+	printk(KERN_ERR "%s: Host adapter reset request. SCSI hang ?
\n", AAC_DRIVER_NAME);
+	if (nblank(dprintk(x))) {
+		int active = 0;
+
+		active = active;
+		dprintk((KERN_ERR "%s: Outstanding commands on (%d,%
d,%d,%d):\n", AAC_DRIVER_NAME, host->host_no, dev->channel, dev->id, 
dev->lun));
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0))
+		spin_lock_irqsave(&dev->list_lock, flags);
+		list_for_each_entry(command, &dev->cmd_list, list)
+#else
+		for(command = dev->device_queue; command; command = 
command->next)
+#endif
+		{
+			dprintk((KERN_ERR "%4d %c%c %02x %02x %02x %
02x %02x %02x %02x %02x %02x %02x\n",
+			  active++,
+			  (command->serial_number) ? 'A' : 'C',
+			  (cmd == command) ? '*' : ' ',
+			  command->cmnd[0], command->cmnd[1], command-
>cmnd[2],
+			  command->cmnd[3], command->cmnd[4], command-
>cmnd[5],
+			  command->cmnd[6], command->cmnd[7], command-
>cmnd[8],
+			  command->cmnd[9]));
+		}
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0))
+		spin_unlock_irqrestore(&dev->list_lock, flags);
+#endif
+	}
+
+	if (aac_adapter_check_health((struct aac_dev *)host-
>hostdata)) {
+		printk(KERN_ERR "%s: Host adapter appears dead\n", 
AAC_DRIVER_NAME);
+		return -ENODEV;
+	}
+
+	/*
+	 * Wait for all commands to complete to this specific
+	 * target (block maximum 60 seconds).
+	 */
+	for (count = 60; count; --count) {
+		int active = 0;
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0))
+		__shost_for_each_device(dev,host) {
+			spin_lock_irqsave(&dev->list_lock, flags);
+			list_for_each_entry(command, &dev->cmd_list, 
list) {
+				if (command->serial_number) {
+					++active;
+					break;
+				}
+			}
+			spin_unlock_irqrestore(&dev->list_lock, 
flags);
+			if (active)
+				break;
+		}
+#else
+		for (dev = host->host_queue; dev != (Scsi_Device *)
NULL; dev = dev->next) {
+			for(command = dev->device_queue; command; 
command = command->next) {
+				if (command->serial_number) {
+					++active;
+					break;
+				}
+			}
+		}
+#endif
+		if (active == 0)
+			return SUCCESS;
+#if (defined(SCSI_HAS_HOST_LOCK) || (LINUX_VERSION_CODE >= 
KERNEL_VERSION(2,5,0)))
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,4,21)) && 
((LINUX_VERSION_CODE > KERNEL_VERSION(2,4,21) || !defined
(CONFIG_CFGNAME)))
+		spin_unlock_irq(host->host_lock);
+#else
+		spin_unlock_irq(host->lock);
+#endif
+#else
+		spin_unlock_irq(&io_request_lock);
+#endif
+		scsi_sleep(HZ);
+#if (defined(SCSI_HAS_HOST_LOCK) || (LINUX_VERSION_CODE >= 
KERNEL_VERSION(2,5,0)))
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,4,21)) && 
((LINUX_VERSION_CODE > KERNEL_VERSION(2,4,21) || !defined
(CONFIG_CFGNAME)))
+		spin_lock_irq(host->host_lock);
+#else
+		spin_lock_irq(host->lock);
+#endif
+#else
+		spin_lock_irq(&io_request_lock);
+#endif
+	}
+	printk(KERN_ERR "%s: SCSI bus appears hung\n", 
AAC_DRIVER_NAME);
+	return -ETIMEDOUT;
 }
 
 /**

Comment 36 J. Parsons 2004-01-31 19:44:51 UTC
Note to RedHat:
 Dell is shipping systems with RHEL 3.0 pre-installed, and this bug present.  This is a 
big problem for the Dell/RedHat relationship.  I strongly suggest RedHat managment 
escalation.

Mark:
 This patch seems to address how we reset the scsi bus after it locks up.  Any idea 
why it's locking up in the first place?  Is that an expected thing?

Thank you.

Comment 37 J. Parsons 2004-01-31 19:47:19 UTC
... also, I have a system that is suffering these lockups as often as once per week.  If 
you need any debugging information from this system, feel free to contact me at 
jparsons dash redhat at saffron dot net.

Comment 38 Brian Whitehead 2004-02-01 03:53:34 UTC
I just spent a week working with Dell engineers and their final
response is that they haven't signed off on RHES 3.0 yet and they the
vendor of the Lyris Listmanager software is report bugs with RHES 3.0.
 Well the last part is bogus and just an excuse for no one to take
responsibility again.  They do have issues, but I spoke with their
engineers and it's simply an interface issue to do with Tcl/Tk.  I did
check my PERC3/Di configuration and found the caching turned off.  I
turned it on and it's been running for 3 days without crashing,
whereas before it wouldn't go 8-12 hours without locking up with a
scsi hang.  I would like to see if anyone else could check this
setting and see if it makes any difference.  I have yet to apply the
patch that Mark Salyzyn posted above.  I will probably do that this
week, before I rebuild the system with a different OS.

Comment 39 J Mora 2004-02-01 04:01:10 UTC
I just "had my first" this evening (hang, that is). Please let me know
if you need debugging information. Had to power-cycle the box to bring
it back to life. Error/debug messages are all identical to previous
reports.

Comment 40 Tim Harter 2004-02-01 05:04:02 UTC
My 2650 has also been suffering from this problem.  I have rebuilt it 
a few times every time scrubbing the raid 5 3disk array, it will run 
fine for up to 3 months.  Then the scsi hang's start to occur, 
starting from 1 week intervals, to me waking up every day at 3:00 am 
to reset it.  I have tried turning off hyperthreading, and will also 
try turning caching off.  I cannot believe all the people reporting 
this identical problem however dell and redhat have not taken any 
ownership.  I have also seen posts with similar windows errors on 
this same system.  Unfortunately this is already a production system 
so it looks as if I will be forced to replace it.

Comment 41 Arjan van de Ven 2004-02-01 15:31:48 UTC
This bugreport is less than useful to both RH and Dell actually. It is
a mishmash of different reports about different OSes and different
hardwares. The bug is filed against Red Hat Linux 9 and not Red Hat
Enterprise Linux or Red Hat Linux 8 or ...
Bugs to different products are handled by different people and also
bugzilla is *not* a support mechanism where Dell or RH "owe" anyone a
response within a certain amount of time. Red Hat appreciates
bugreports but in several cases, especially with hardware/firmware
interactions, bugzilla is not the best medium since it bypasses all
regular mechanisms to get joint problemsolving going with partners
such as Dell.

Comment 42 Brian Whitehead 2004-02-01 18:08:17 UTC
Unfortunately, from your comments above and the responses that myself
and several other users have received from both Redhat and Dell, it is
obvios that neither of you really care about this problem.  The
problem does effect several versions of Redhat.  From reading the
posts here in bugzilla and the posts on both Dell and Adaptecs forums
this problem has been around for over a year and is effecting every
version of Redhat from version 7.3 to 9.0 and ES 3.0.  I have ES
Standard support and Dell Gold support and the response I received
from both of those "support mediums" is essentially 'your out of luck'
and 'we really don't care'.  So there is a good reason for this to be
posted in Bugzilla, because contacting RHEL support did no good. 
Well, what is pay support for if it doesn't do any good.  While I
agree that a bug tracking system is not the appropriate place for
unhelpful comments, yours included, this is the only place to share
the different experiences that everyone is having about the same
problem.  You being the all smart one, if you would like to point out
a REAL way to get this resolved then please shed some light on this
for the rest of the community, otherwise keep your comments (Arjan)
about telling everyone to go somewhere else for support to yourself.

Comment 43 Arjan van de Ven 2004-02-01 20:43:52 UTC
If you are unsatisfied about the support Red Hat provides that is very
unfortionate, and I can try to escalate problems with that within our
organisation, but I need some specifics for that (eg who did you talk
to about what etc). 
(Obviously I cannot escalate problems with the support Dell provides)

As for the problem; this *really* looks like a hardware/firmware issue
since it happens with such a broad spectrum of drivers. I just looked
at the third level support queue and there is at least one simimlar
complaint there where both Red Hat and Dell are investigating and
where also the firmware route is being investigated.

This really is the sort of issue that needs investigation by all the
parties involved (RH, Dell, Adaptec) not just Red Hat, and for that
bugzilla is the wrong tool. I wanted to point that out before to make
sure no false expectations about bugzilla arrise.


Comment 44 Brian Whitehead 2004-02-02 07:44:33 UTC
If you want to provide me with a better place to post/send this 
information I will be glad to use it.  The person I spoke with is 
ckloiber.  The response I received is that I would just have to wait 
for the downlevel developer to fix the problem.  (ie.. Alan Cox or 
Adaptec since they developed the most recent version)  I don't 
understand what you mean by 'such a broad spectrum of drivers'.  All 
of the problems that I've seen here and on the Adaptec and Dell sites 
all center around the aacraid driver.  Dell was actually trying to 
help me for over a week, and Mark from Adaptec has posted a possible 
patch above.  These are both more effort then I recieved from Redhat, 
I'm sorry to say.   Dell ultimately came back and said they haven't 
signed off on version 3.0 of ES.  Again that's really not the issue 
since the problem exists in 7.3/8/9, but not 2.1 since it is based on 
7.2.  Right now testing the patch that Mark provided and confirming 
the issue with aacraid and then getting it committed to the source 
tree would be the best route that Redhat/Dell/Adaptec could take.  
Then if it is a firmware issue a team effort to help Dell fix it 
would be next step.  Again, this problem looks centered around 
aacraid included in ALL RH versions after 7.2.  I really just want to 
help find a solution to this issue, so any help I can provide just 
ask.

Comment 45 Eric D. Hendrickson 2004-02-02 18:03:46 UTC
We have all seen reports of this occuring on RedHat 8.0, 9.0 and 
unregistered AS3.0 systems, and AFAICT the aacraid driver revision is 
the same across these platforms.  RedHat support isn't available for 
8.0, 9.0 or unregistered AS3.0 systems, but if RedHat wants AS3.0 to 
work on *registered* systems then RedHat needs to pay attention to 
all such reports, not just those for which they are directly being 
paid.

Someone, somewhere (at Redhat, Dell or Adaptec, or all three) needs 
to install AS 3.0 on a 2650 on hardware RAID 1, wait for this problem 
to occur, and fix it.  And then a patch for the driver against the 
stock 2.4 kernel would be very helpful.

As far as expectations with Bugzilla go, Bugzilla is an issue-
tracking tool, yes?  This issue is being tracked here because it was 
first reported here.  Perhaps the version of RedHat affected by this 
bug entry above needs to be changed from "9" to "AS3.0" - however 
that's not even an option(!)

All that matters is that we are evaluating RedHat AS3.0 and we aren't 
necessarily going to do it on a registered system.  The code is the 
same either way, no?  And I'm sure you can imagine how bad it will 
look to my management when I tell them I'm having a problem in our 
test environment with the AS 3.0 RAID device driver on the hardware 
we currently use in production, such that the system needs to be 
rebooted every day or two.  So far I haven't had to make a final 
statement on the suitability of this for production use, but I can't 
continue delaying that forever.

One aside: how could RedHat release AS3.0 without signoff from all 
the major hardware vendors?  Dell is a pretty big hardware vendor for 
RedHat, no?

These 2650's have version A15 of the BIOS - is it possible that a 
BIOS update could affect this?  (I see a revision A17 available at 
dell.com.)  I might try that.

Also, I'm also going to take a look at the caching for the RAID 
controller, as someone above mentioned.  And I'll try patching the 
driver with the patch included above in my next kernel build.

Or I might have to see about disabling the hardware RAID entirely and 
using Linux software RAID.  But that of course, will degrade the 
performance of the system and defeats the purpose of purchasing 
Dell's hardware RAID in the first place.  I wonder if Dell even sells 
these systems without that controller - and how much cheaper it would 
be...  (and how much Dell would hate that).

Anyway, thank you all.

Comment 46 Bryan Field-Elliot 2004-02-10 02:29:08 UTC
Can anyone confirm that Mark Salyzyn's patch on Jan 30th has had a
positive effect? Our production server (Dell 2650/RHES 3.0/RAID 1) is
hard locking about once a week and I'm desparate for a solution before
we experience any data loss.


Comment 47 Brian Whitehead 2004-02-10 02:47:35 UTC
I still have not applied the patch, but simply changing the cache
setting on the controller to enable the machine has not locked up in
over a week.  It was locking every 8-12 hours with zero load.  Make
sure you check this setting.  The default setting from Dell was
disable and has never been changed until now.

Comment 48 Mark Salyzyn 2004-02-10 13:52:49 UTC
Response to "Additional Comment #36 From J. Parsons (jparsons-
redhat) on 2004-01-31 14:44":

Sorry if I am not a frequenter of this `list' ...

The root cause is the Adapter places a high priority on flushing it's 
internal cache, high enough that it is reticent to new commands for 
periods longer than a minute. The trigger for this cache flush 
varies, and is usually tied to high load and less than perfect 
activities over the SCSI bus.

The real fix is for the firmware to reduce this cache flush priority, 
but we must deal with the combinations of firmware out in the field. 
The change I proposed in the driver is to wait for the Adapter to 
`come back' and complete all outstanding commands when the timeout 
mechanism is detected by the SCSI layer and calls the bus reset 
handler of the driver.

This fix also differentiates between adapters or busses that are in 
fact dead, and this temporary condition. Unfortunately I left in the 
code to call for adapter health which requires more extensive changes 
in the hardware interface layers. This section of the patch can be 
dropped. However, I still suggest that Adaptec's latest driver code 
be used as the driver in the kernel tree is full of bitrot (new patch 
submissions are automatically rejected because they are too large to 
go through the lists).

Please feel free to contact me directly for latest sources.

Sincerely -- Mark Salyzyn <Mark_Salyzyn>

Comment 49 Bryan Field-Elliot 2004-02-11 02:58:28 UTC
I've just installed Mark Salyzyn's latest RPM. For those using RHEL
3.0, Perl is evidently slightly broken in this distribution. Here is
what I had to do to install the latest Adaptec RPM:

1. Install the RPM - lots of error messages during execution of
install.sh will occur.
2. cd /opt/Adaptec/aacraid.rhel3
3. export LANG=en_US
4. export LANGVAR=en_US
5. ./install.sh

I still got one "sed" error during execution of install.sh, but
everything seems ok anyway. Now running aacraid 1.1-4[2323].


Comment 50 Alan Cox 2004-02-14 00:01:39 UTC
There is one visible error in this fix - the return of -ETIMEDOUT from
the scsi reset handler isnt a valid return for scsi reset handlers.

(2.6 and 2.4 eh are a bit different, the 2.6 code is way saner as it
can do stuff like sleeping properly). 2.4 new_eh can be persuaded to
do so but needs a bit more work.

Otherwise this looks a reasonable quickfix.


Comment 51 Dave Jones 2004-02-18 18:38:57 UTC
patch above doesn't compile in a current 2.4 tree. (No nblank, no
aac_adapter_check_health).


Comment 52 Andrew Kinney 2004-02-21 22:05:37 UTC
aacraid 1.1-4[2323] does not appear to fix this problem for us.  We 
have the new driver installed on *two* PE2500 servers with PERC3/DI 
RAID controllers and we're still seeing the driver mark the host 
controller as dead periodically.  We're using a 2.4.20 kernel with a 
bunch of security patches.  In our case, since our primary storage is 
attached to this controller, we get crashes when this happens.  

We didn't start having trouble until we upgraded from RedHat 7.1 to 
RedHat 9.0.  These two systems ran perfectly for a year and a half 
with no crashes, so I'm not entirely sure I buy the explanation about 
the firmware problem.  It may have contribured, but right now, I 
suspect an overly zealous code performance optimizer got in there in 
the 2.4 kernel tree and introduce disk subsystem timing issues that 
result in this problem.  In the research I've done on this problem, 
this appears to affect all 2.4.2x kernels, no matter which 
distribution.  Those who have had this problem and changed to a 
different OS (FreeBSD, for instance) have not had any more trouble.

We have a growing suspicion that a certain pattern of disk activity 
can actually trigger the problem almost at will.  We suspect that 
when PostGreSQL is compiled with the right set of options and is fed 
a high volume of queries that it triggers the problem.  This is still 
a preliminary theory and we still have a lot of testing to do to 
verify it, but so far all evidence is pointing us in that direction.

Since we have two servers that are affected, we have the dubious 
honor of being able to move certain activities to certain servers.  
In this case, we're running some "virtual server" software (kind of 
like a chroot jail under FreeBSD) and we've managed to narrow down 
which virtual servers are causing the trouble for us.  We had no 
crashes on RedHat 9.0 for several months.  Then, out of the blue, we 
started having crashes after certain virtual servers were added to 
the system.  When those virtual servers are migrated to the other 
PE2500, the crashes follow them.  Their predominant activity is 
PostGreSQL usage.  I know that that doesn't necessarily mean 
PostGreSQL is the agitator, but as we dig deeper into this, the 
correlation continues to increase.  

For now, we have the suspect virtual servers running on an IDE based 
system so I can get some sleep this weekend.

I welcome any feedback, even if it is just to pick apart my 
theories.  I'm willing to endure most anything as long as we're 
collectively able to arrive at some kind of solution.

Comment 53 Scott Walker 2004-02-23 15:52:51 UTC
This was posted to a Dell mailing list, might be something to look into:

-------- Original Message --------
Subject: 	aacraid problems summary
Date: 	Wed, 18 Feb 2004 12:25:21 +0000
From: 	Jose Celestino <japc.pt>
To: 	linux-poweredge


Could anyone summarize the Dell vs aacraid Perc 3/Di problems latest
achievements. I've out of this list for a while.

We recently had some Dell dudes coming here and they did a
s/tg3/bcm5700/ on the network driver and also changed the driver we had
for our intel etherexpress quad port nics. They said the instability on
the raid controller was due to the nic drivers.


Comment 54 Brian Huffman 2004-02-23 16:08:25 UTC
I hate to make this comment, but the resolution to this bug is long
overdue.  My company has over 160 linux boxes that I've been thinking
of migrating to RedHat Enterprise server (they are currently 7.3).  As
it stands, our main monitoring box continues to crash with all of the
latest updates and it's due to this bug.   There is no way that
management is going to go for such a sum of money when I can't even
prove that these 7.3 boxes are stable (and as far as I know, the bug
still exists in RHEL 3.0).

Comment 55 Javier Rodriguez 2004-02-23 19:54:37 UTC
We recently loaded RedHat Enterprise Linux ES v.3 (from distribution 
CDs) and updated it to current levels using RedHatâs âup2dateâ 
facility. We ran the server in test for about two weeks without any 
failures. We are now trying it in production and hoping that 
everything will continue to remain stable. Here is the environment:

   Server: Dell PowerEdge 2650 with PERC/Di
   System BIOS: A17 (upgraded from A10)
   Backplane firmware: 1.01
   PERC3/Di firmware: 2.7-1 (build 3170)
   ERA firmware: 3.0 (upgraded from 2.20)
   Hyperthreading: Enabled
   Linux OS version: 2.4.21-9.0.1.ELsmp
   Aacraid driver version: 1.1.2 â This is based on the
      AAC_DRIVER_VERSION number in 
      /usr/src/linux-2.4.21-9.0.1.EL/drivers/scsi/aacraid/linit.c
	
Prior to this upgrade, we were executing successfully for 6 months on 
two servers using Linux loaded from the RedHat 9.0 ISO distribution 
plus its corresponding updates. Here is the environment:

   Server: Dell PowerEdge 2650 with PERC/Di
   System BIOS: A10
   Backplane firmware: 1.01
   PERC3/Di firmware: 2.7-1 (build 3170)
   ERA firmware: 2.20
   Hyperthreading: Disabled
   Linux OS version: 2.4.20-18.9smp 
   Aacraid driver version: 0.9.9ac6-TEST â This is based on the
      AAC_DRIVER_VERSION number in 
      /usr/src/linux-2.4.20-18.9/drivers/scsi/aacraid/linit.c

Please note that in the older configuration, disabling hyperthreading 
(upon Dellâs recommendation) circumvented the original problem we 
posted in this trouble ticket on 06/02/2003. In our case, the problem 
would normally show up within 24 hours, regardless of load on the 
server. With the current configuration, we are once again 
experimenting with enabling hyperthreading.

Hopefully we will continue to execute trouble free. I will keep this 
list posted on my results. I agree that a formal resolution for the 
various aacraid problems needs to materialize, especially since 
RedHat and Dell both continue to advertise that RedHat Linux releases 
are 100% compatible with the PowerEdge 2650 and PERC3/DI controllers.


Comment 56 Bryan Field-Elliot 2004-02-25 04:07:42 UTC
I'm ready to report that aacraid 1.1-4[2323] does seem to have fixed
the problem for me. My server was hanging approximately once per week
previously, now it's been running for just over two weeks on the new
RAID driver version. Since then, normal disk activity has increased
(heavier utilization), and I just banged it through some disk trashing
tests (copying a 3GB file over and over again) without a hiccup. I
know it's only been 2X the average previous hangspan, and I won't feel
totally safe until a month or so passes. But there are many on this
list eager to hear some positive news on this bug, so here's some
(tentatively).

Comment 57 James Oliver 2004-02-25 04:15:28 UTC
Unfortunately after installing the new aacraid driver (aacraid
1.1-4[2323]) we have not had the same success.  The server was OK for
under a week and exhibited a very similar problem to what we
previously encountered.  The main difference withthe new driver was
that we could log in and shut the box down cleanly (previously the
only option was to power cycle).  BTW - the box is a PE 2650 running
the 2.4.20-28.8smp kernel.

Comment 58 Scott Walker 2004-02-25 20:33:22 UTC
Created attachment 98049 [details]
Redhat Kernel debugging pdf

Redhat Kernel debugging capture process pdf

Comment 59 Scott Walker 2004-02-25 20:34:45 UTC
Attaching a .pdf file that explains how to capture memory and CPU
infomation after a mysterious freezes/hangs.  We just had another
crash about 30mins ago and these did not work for us.... system
hangs to hard, kernel logs were not even produced after this crash.
We are still running with Red Hat/Adaptec aacraid driver (1.1.2 
Feb 9 2004 22:40:31).   Look at section #2 in the pdf.

Maybe these will work with the 1.1-4 raid driver.

Comment 60 acount closed by user 2004-02-26 22:03:12 UTC
Scott Walker:
/usr/src/linux-XX/Documentation/nmi_watchdog.txt is also interesting

Comment 61 Mark Salyzyn 2004-02-27 14:59:12 UTC
James Oliver (james) case is perplexing. We have had a 
high degree of reported success for the driver workaround with many 
others.

Please note that there are several other real-world sources of 
possible SCSI bus lockups. The Driver fix is not a panacea for the 
problems of the world ;-/

These include Hardware problems and Drive Firmware issues. Most of 
which can not be worked around in Firmware or Driver. Fortunately due 
to al lthe testing that has gone into the controllers and hardware, 
they are rare.

This driver patch is used to mask an issue that surfaced in 3170 FW, 
where Adapter Cache Flush caused commands to time out. The best 
combination *I* can recomment today is 3/Di 3170 Firmware and the 
mentioned driver.

Any other issue will need the involvement of Dell Technical Support 
to track down the root cause. This does not mean I have abandoned 
other lockup issues, just that realistically other uncovered issues 
may require a finer level detailed analysis.

Sincerely -- Mark Salyzyn


Comment 62 Scott Walker 2004-02-27 15:16:55 UTC
This might be helpful as well:
http://www.redhat.com/support/wpapers/redhat/netdump/

netdump is similar to Sun Solaris core dumps.  We have a RHEL
support case open and are waiting for the next crash to catch
a core for analysis.  I will post any information that comes
forth.

Comment 63 Andrew Kinney 2004-02-27 23:44:52 UTC
For what it's worth, we couldn't reproduce this in a non-production 
environment using PostGreSQL.  However, I was not aware that the 
updated driver code worked best with a specific version of firmware, 
so I'll update the firmware on our systems to build 3170.  Hopefully 
that will solve it for us.  If we have any further crashes after 
updating the firmware (we're already using the new driver), I'll 
report back here and try some of the suggestions for getting useful 
debug info out of the system.

Comment 64 Javier Rodriguez 2004-03-01 12:15:50 UTC
Its been a week that we've been successfully executing under the 
environment described in comment #55 posted on 2004-02-23. Under 
previous releases of the OS and driver, we normally would have 
encountered the original problem for this post within 24 hours and no 
longer than 48 hours.

We are hopeful that the problem has been resolved, and for now, we 
plan to continue using the standard RedHat Enterprise Linux ES v.3 
distribution and updates as described in comment #55.

Mark, RedHat, and /or Dell support folks⦠are their any significant 
enhancements to the aacraid driver that should cause us to consider 
manually upgrading from version 1.1-2 to 1.1-4 rather than waiting 
for RedHat to distribute the latest version as part of their update 
process?


Comment 65 Mark Salyzyn 2004-03-01 15:25:22 UTC
Andrew Kinney, I may have been misunderstood. 3170 is the latest 
Firmware, it contains improvements over 3157 or it would never have 
been released. The driver workaround was not meant to function best 
specifically with any Firmware, but was a workaround for a new issue 
that surfaced with 3170. The workaround permits 3170 to function, and 
allows one to take advantage of the other improvements in the 3170 
Firmware.

Some are perplexed by this issue. From my view there are several 
problems, of which 3170's reticense during an Adapter Cache Flush 
accounts for a *substantial* majority of the experiences. Until the 
remaining issues are duplicated at Dell for Firmware Engineers to 
view, I might not be part of the loop ...

Comment 66 Javier Rodriguez 2004-03-01 19:46:28 UTC
Well, I have to take my optimistic news back. After one week, I had 
to reload the server via a power cycle as the entire server was 
completely locked up. I had no console access, and I was not even 
able to access the ERA module to perform a remote server restart. 
After the reload, the various system logs do not contain any 
information regarding the problem.

I turned off hyperthreading in hopes that it will circumvent the 
problem as in the past.


Comment 67 Alex Sousa 2004-03-01 22:39:06 UTC
We have a DELL PE 4600 and started experiencing this bug since
upgrading from RH 8.0 to RHEL 3.0 (kernel 2.4.21-9ELsmp), at the time
with controller firmware build #3157. The problem was invariantly
triggered during daily backup jobs, which impose heavy I/O loads on
the Perc3/Di controller. 
Following suggestions on this forum and from DELL Technical Support,
we have updated BIOS(A10), ESM(A31) and Perc3/Di(2-7-1, build #3170)
firmwares as well as the aacraid driver (1.1-4[2323]). We are running
with disabled hyperthreading for a week now. Since then, no change was
seen in the server behavior, with daily crashes at backup time. The
backup script is identical to the one that worked smoothly with RH 8.0.
I can reproduce the problem interactively from the console. I enabled
the nmi_watchdog as suggested above, but the lockup is always severe
wnough that no messages are produced. I am happy that the new driver
seems to attenuate the PE 2650 problems, and therefore is a step in
the right direction, but unfortunately it had no effect on our case.
Will keep looking...
Many thanks for all the helpful posts I have found here.
 

Comment 68 Javier Rodriguez 2004-03-02 13:31:52 UTC
The ERA access issue in comment #67 was the result of an MTU sizing 
problem with the VPN I was using to attempt to restart the server 
remotely via the ERA. The rest of the problems are similar to what 
others are reporting. I will try executing without hyperthreading 
enabled for a bit, and then I will also try driver version 1.1-4.

Comment 69 Bryan Field-Elliot 2004-03-04 18:48:21 UTC
Following up on my comment from Feb 24th, my server just experienced a
hard lock. Running RHEL 3.0 with aacraid 1.1-4[2323] on a Dell 2650,
RAID 1. Unfortunately I can't see the console as it is managed by
Rackspace, but their tech indicates that he saw this message on the
console:

ext3_fs error (device sd(8,2)): ext3_get_inode_loc: unable to read
inode block

Obviously it would be nice to know what other messages were on screen
- such as the "SCSI hung?" message - but unfortunately it's gone now.

For a couple of minutes the server was accepting connections (e.g.
SSH), but would not complete the login sequence (no bash prompt). Then
two or three minutes into this, the server hard locked and would not
accept any connections at all.

I suspect this is the same issue again, though I can't be certain
(what else would cause ext3-fs to fail like this?). Total uptime: 3
1/2 weeks.

Comment 70 Andrew Kinney 2004-03-05 00:34:39 UTC
Bryan Field-Elliot, which firmware are you using?  Is it the build 
3170 or later?  The reason I ask is that we're just getting ready to 
put our two misbehaving PE2500's back into the usage scenario that 
was predicating our lockups and we're using build 3170 of the 
firmware and aacraid 1.1-4[2323] in the hopes that the specific 
combination will be stable.  If I can avoid a crash on these 
production machines, I'd like to do so until a better fix is 
available if your crash was on the build 3170 firmware.

Comment 71 Andrew Kinney 2004-03-08 22:49:52 UTC
One workaround we're going to expriment with is increasing the stripe 
size from 32KB (the default) to the largest possible (128KB I 
believe).  Since we're using RAID 5 and that requires 4 I/O 
operations per write, we're hoping that minimizing the number of 
times the host controller has to issue writes to the array will also 
minimize the amount of time the controller takes to come back from a 
cache flush or heavy I/O peak.  We also turned on write caching on 
the drives so that the controller spends less time waiting on the 
drives.  That should also help the controller become available again 
faster.  

Of course, what would make more sense to me is a firmware/driver 
combo that doesn't allow the cache to get crammed to the point that 
the OS has to wait on the controller.  Let some of those disk hungry 
processes spend time blocking instead.  The OS is better able to 
handle the disk I/O backlog than the controller and it sure beats a 
crash with file system corruption.  

For what it's worth, I say this without any actual knowledge of how 
to rewrite the firmware, card driver, or SCSI subsystem driver. ;-) I 
was hoping someone with that knowledge would pick up on this.

Comment 72 Andrew Kinney 2004-03-10 20:00:14 UTC
A question for all that have had the problem:

Are you all using ext3?  Anyone using anything other than ext3?

The reason I ask is that it would seem possible that the existing 
problems are only exposed when the added overhead of journaling is 
brought into the picture.  Normally, journaling replaces typical 
filesystem table updating, but in the case of ext3, it is really just 
added activity over and above what already happens with ext2 (unless 
I misunderstood how ext3 works).  This is still just a theory until I 
get some feedback on what filesystems are in use for those that see 
this problem.

As for us, we didn't start having trouble until we upgraded from 
RedHat 7.1 to RedHat 9.0.  We were using an older kernel (2.4.1), 
reiserfs (a true journaling filesystem, not a hybrid like ext3), and 
ext2.  Now we're using a newer kernel (2.4.20) and ext3.  It's 
entirely possible that the filesystem and not the kernel is to 
blame.  The likelihood of that being the case increases if nobody can 
cite an instance where this happened on a non-ext3 filesystem.

Anecdotally, it would appear that most instances (if not all) of this 
problem are when ext3 is in use.

So, can anyone cite an instance of this problem where ext3 was not in 
use?

Comment 73 David Kelertas 2004-03-10 23:13:46 UTC
We have 3 Dell PE 2650 running RedHat Linux 8.0 (kernel 2.4.20-
20.8smp and 2.4.20-28.8smp) with SCSI hang problem since Nov 2003.
We did not experience any problem when running RH 7.2/3 using ext3 
file systems, hyperthreading and smp kernels for a whole year. 

When we upgraded to  RH 8.0 (ext3), the SCSI hang occurs on average
once per week.  We have other Dell servers (2400s, 2500s) running
RH 8.0 without any problem.  We also have a Dell 2650 server running
RH 7.3 (kernel 2.4.20-19.7smp) with full ext3 filesystems and 
hyperthreading without problem.

We have already applied Mark Salyzyn's aacraid 1.1.4 and we are also
running Dell firmware build 3170.  We have tried various kernels 
from 2.4.18-14 to 2.4.20-28.8 without any success. We don't seem 
to have much choice now  except to try the megaraid driver on the 
LSI Logic PERC/3DC PCI raid card. The problem is we need to migrate 
the raid configuration from the 3Di to the 3/DC the 3/DC does not
support drive/raid roaming and we probably have to install from 
scratch.

Will let others know if we are successful, if anyone has any other
solution, please let us know.

Steven Ng (stevenng)

Comment 74 Greg Johnson 2004-03-10 23:23:00 UTC
We just experienced this error for the first time. We have been
running with this configuration for over 3 months with no errors. We
are running JFS fs. We are also running Postgresql.

Server: Dell PowerEdge 2650 with PERC/Di
Hyperthreading: Disabled

Linux OS version: Linux version 2.4.21-9.0.1.ELsmp
Red Hat/Adaptec aacraid driver (1.1.2 Feb  9 2004 22:40:31)
AAC0: kernel 2.7.4 build 3170
AAC0: monitor 2.7.4 build 3170
AAC0: bios 2.7.0 build 3170
AAC0: serial 474c61d3fafaf001



Comment 75 Andrew Kinney 2004-03-11 19:10:50 UTC
Found something that might be of use, but you'll need to check to see 
if your kernel already has these patches applied.

Apparently, these bugs have been around for a very long time and 
require patches to the 2.4.20 kernel to fix them.  The patches didn't 
make it into the mainstream kernel until 2.4.21, but they still fixed 
a couple more ext3 bugs in 2.4.22. Basically, the ones of concern 
relate to how the kernel interacts at the OS buffer level with a 
journalling filesystem.

Found them from here:
http://www.zip.com.au/~akpm/linux/ext3/

They're located here:
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.20/

This one is of special interest:
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.20/ext3-scheduling-
storm.patch

The sync_fs series of patches also could be of interest for this 
problem.

I noted that several of those patches were not specific to ext3 and 
thus might have an impact on those using other journalling 
filesystems.

I'll be checking with our vendor to see if our 2.4.20smp kernels have 
these patches installed. I could see how they might get committed to 
the uniprocessor code and the smp/hyperthreading specific code may 
have been overlooked or additional work was required to get them 
working on smp/hyperthreading, so it didn't happen until later.

So far, from all the cases I've seen, this issue is isolated to 
systems with the PERC3/DI card, journalling filesystems, an SMP 
kernel by way of hyperthreading or real SMP, and disk activity with 
many small files or database use that has frequent small I/Os.  I'm 
not sure if this is relevant or not, but the majority of cases also 
appear to be with kernels later than 2.4.20-pre5 but earlier than 
2.4.22, which roughly corresponds to the period of time between when 
the bug was created and when it was discovered and fixed.

Comment 76 J Mora 2004-03-11 21:36:17 UTC
PE2550 with PERC3/DI
reiserfs: format "3.6" with standard journal
Kernel.org 2.4.24 SMP (Kills Uptimes Dead)
PostgreSQL 7.x

Kernel.org 2.4.9 did not exhibit this problem and ran for 462 days.

Other info:
<http://lists.us.dell.com/pipermail/linux-poweredge/2004-February/018302.html>

dmesg output using 2.4.9:
percraid device detected
Device mapped to virtual address 0xf8800000
percraid:0 device initialization successful
percraid:0 AacHba_ClassDriverInit complete
scsi0 : percraid
  Vendor: DELL      Model: PERCRAID Mirror   Rev: 0001
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: DELL      Model: PERCRAID Mirror   Rev: 0001
  Type:   Direct-Access                      ANSI SCSI revision: 02

dmesg output using 2.4.24:
Red Hat/Adaptec aacraid driver (1.1-3 Jan  7 2004 23:58:20)
AAC0: kernel 2.5.4 build 2991
AAC0: monitor 2.5.4 build 2991
AAC0: bios 2.5.0 build 2991
AAC0: serial 556c01d2fafaf001
scsi0 : percraid
  Vendor: DELL      Model: PERCRAID Mirror   Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: DELL      Model: PERCRAID Mirror   Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02



Comment 77 jcpeck 2004-03-26 20:43:16 UTC
I have encountered this bug when running NFS benchmark SpecSFS3.0 
(see www.specbench.org), using PowerEdge 2650 running RH9.0 kernel 
2.4.20-8smp as an NFS server.

Here is the configuration that leads to reproducing the problem 
reliably:

1a.  Configure two U320 SCSI disks as RAID0 stripe container
1b.  Enable RAID controller write cache memory
2.  PE is running NFS server and exports mount point to container 
created in #1
3.  Run standard config file for SpecSFS on a RH9.0 client that 
targets machine in #2

Result:

Fails with ext3 errors like those observed in this report 100% of the 
time.


Comment 78 Roy Olsen 2004-04-06 11:13:13 UTC
I've encountered and reproduced this problem on our PE2650 systems 
running Red Hat AS 3.0 with the 2.4.21-9.0.1ELsmp kernel. Our 3/Di 
controllers have firmware version 2.8-0.

When the error occurs all disc activity on all RAID containers stop. 
Nothing is written to log, but an error message like "I/O error: dev 
08:02, sector 83200" can bee seen on the console.

Copying or rsyncing a large number of files from an NFS share, on 
gigabit ethernet, to a local RAID 1 container causes the PE2650 to 
crash within minutes. Using rsync through ssh does not trigger the 
problem, even with 300,000 files summing up 25GB.







Comment 79 Jason Harrop 2004-04-10 02:08:37 UTC
For what it is worth, this problem still occurs even with kernel
2.4.25 (not RedHats), and aacraid 1.1.5.

PERC3/Di firmware 3170
ext3 filesystem

It crashed less than 2 days after rebooting, which (for this
statistically small sample) is less uptime than with RedHat kernel
2.4.18-14smp

Logs:

Apr 10 04:03:51 promo kernel: aacraid: Host adapter reset request.
SCSI hang ?
Apr 10 04:04:51 promo kernel: aacraid: SCSI bus appears hung
:
:
Apr 10 04:15:31 promo kernel: aacraid: Host adapter reset request.
SCSI hang ?
Apr 10 04:15:31 promo kernel: aacraid: SCSI bus appears hung
Apr 10 04:15:31 promo kernel: scsi: device set offline - command error
recover failed: host 0 channel 0 id 0 lun 0
Apr 10 04:15:31 promo kernel: SCSI disk error : host 0 channel 0 id 0
lun 0 return code = 6000000
Apr 10 04:15:31 promo kernel:  I/O error: dev 08:03, sector 11014520
Apr 10 04:15:31 promo kernel:  I/O error: dev 08:03, sector 12058624
Apr 10 04:15:31 promo kernel:  I/O error: dev 08:03, sector 12058856
:
Apr 10 04:15:31 promo kernel:  I/O error: dev 08:03, sector 39846008
Apr 10 04:15:31 promo kernel:  I/O errev 08:03, sector 262368
:


Comment 80 Mark Salyzyn 2004-04-12 13:28:15 UTC
Your Firmware has hung solid (the driver can not do anything about 
it ...). Please confirm that the 1.1.5 driver is loaded 
(cat /proc/scsi/aacraid/?) to be sure it is loaded (but the default 
2.4.25 driver does not report SCSI bus appears hung, so I am just 
covering the bases).

I had a desire to send you an instrumented driver reporting the 
Adapter Health in more details, but since your driver did not report 
any `health check' problems (Host adapter appears dead message did 
not show), I fear that this driver will just report the same.

You may be able to get more details about why the Adapter or SCSI bus 
is hung on your system by commenting out the AAC_DETAILED_STATUS_INFO 
line in aacraid.h line 43. There is probably a set of reports just 
prior to the failure that may indicate which target is the problem.

I strongly urge you to report your problem to Dell technical support. 
They have an Adaptec Firmware engineer embedded on staff to try to 
trace this problem down. In addition, they may be able to trace your 
problem down to other possible sources (one cause is a bad power 
supply). The appearance of this problem getting worse on additional 
samples may be a set of deteriorating hardware.

Sincerely -- Mark Salyzyn


Comment 81 Jean-Philippe Houde 2004-04-22 01:50:47 UTC
Hi,  
is this thread still alive? Because I have the problem and still  
didn't found any solution? Today, doing a search (again) about this  
problem, I found this big thread. I have the same problem on a PE  
2650 with PERC/3Di with Red Hat AS 2.1 running Oracle and ext3.  
  
BUT, here is another thing, I also have the same errors on an IBM  
xSeries 255 with a ServRAID-6M (also with Oracle and ext3). The  
module used is aic7xxxx (witch is for Adaptec too if i'm not  
mistaking).  
  
Little thing that I have notice. Our other dell servers (not  
PE2650), that are on ext2 are not crashing. We also have 2 other IBM  
xSeries 255. One is not crashing (ServRAID-4).  
  
And the other one, with ext2 for / and /boot, ext3 for the rest of  
the partitions, got the error messages in the log files, but NO  
crash!  
 
In all those servers I had to add a 1Gb NIC and change the tg3 
module for the bcm5700.  
 
I don't remember the versions by heart but I will check that and the  
thread again in deep tomorrow.  
  
Here is a portion of the log from the IBM  
---------------------------------------------------------  
Apr 21 09:19:49 ibmfin kernel: (ips0) Reset Request - Flushed Cache  
Apr 21 09:21:05 ibmfin kernel: scsi: device set offline - not ready  
or command retry failed after host reset: host 0 channel 0 id 2 lun  
0  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:21, sector 37576  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:21, sector 37600  
Apr 21 09:21:07 ibmfin kernel: SCSI disk error : host 0 channel 0 id  
2 lun 0 return code = 30000  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:22, sector  
22922064  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:22, sector  
22922072  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:22, sector  
22922064  
  
...  
  
Apr 21 09:21:07 ibmfin kernel: EXT3-fs error (device sd(8,33)):  
ext3_get_inode_loc: unable to read inode block - inode=4816905,  
block=9633799  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:21, sector 0  
Apr 21 09:21:07 ibmfin kernel: EXT3-fs error (device sd(8,33)) in  
ext3_reserve_inode_write: IO failure  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:21, sector 0  
Apr 21 09:21:07 ibmfin kernel:  I/O error: dev 08:21, sector  
77070392  
Apr 21 09:21:07 ibmfin kernel: EXT3-fs error (device sd(8,33)):  
ext3_get_inode_loc: unable to read inode block - inode=4816905,  
block=9633799  
----------------------------------------------------------------  
  
For now, it is the biggest discussion that I found about that  
problem.  
  
Thanks,  
Jean-Philippe Houde  
  

Comment 82 James Oliver 2004-04-22 04:15:05 UTC
For the record we installed a PERC3/DC card on the systems.  These use
the megaraid driver as opposed to the aacraid driver.  This has solved
(worked around) our problem (though we had to shell out cash for the
other cards).

<rant>
I feel compelled to say that the lack of support from Dell on this
(and RedHat for those who have been experiencing it under RHEL) is
disappointing.  Especially as these servers are reportedly certified
to run with this hardware and Dell and RedHat purport to have a
partnership arrangement.
</rant>

Aside from that Mark Salyzyn from Adaptec was very helpful in trying
to assist with the problem with the PERC3/Di driver even though it was
ultimately unsuccessful - thanks Mark.

Comment 83 Andrew Kinney 2004-04-22 22:10:10 UTC
James Oliver,

Agreed regarding rant.  The irony is that while I probably will buy 
more Dells, I'll probably avoid Adaptec chipsets.  As for Red Hat, 
this is just reinforcing my belief that it isn't worth it to pay a 
company to support Linux because they rarely do better than the open 
source community at large.

For what it's worth, we haven't had any more problems.  We took a 
shotgun approach to solving this because I didn't have time to fiddly-
dink with trying out individual solutions that *might* work.  Here's 
what we did:

1. Moved all active PostGreSQL users with low buffer and cache 
settings to different hardware.

2. Turned on write-caching on the drives themselves in addition to 
the writeback controller cache.

3. Updated the RAID firmware.

4. Updated the RAID driver.

5. Updated the kernel.

6. Crossed fingers.

7. Crossed toes.

8. Tried to convince our important customers that we were doing 
everything possible to ensure their important systems stayed up and 
running.

9. Revisited this bugzilla thread frequently in the hopes that 
someone (a kernel developer in particular) might actually fix this.

10. Mourned the fact a $7500 server is now as useful as a five year 
old P*ckard-Bell desktop. (ok, that's a big exageration, but it's 
painful to realize that a $600 generic server is more reliable)

For what it's worth, I also appreciate the effort that Adaptec put 
into this, though I still firmly believe the problem lies in the 
kernel buffering code and not the driver or firmware code.

Comment 84 Mimmus 2004-04-26 08:42:47 UTC
Andrew,
how can I turn on write-caching on drives, as you wrote in previous
message at item 2)?

Comment 85 Andrew Kinney 2004-04-26 16:38:47 UTC
It's a setting in the RAID BIOS screens.  I don't remember which one 
since it's been several months since I looked at those screens.  It's 
likely in the documentation, though.

Comment 86 Todd Davenport 2004-04-26 19:03:04 UTC
How have you guys been seeing the error messages?  It looks like in 
the source its kern.err.  Are you writing ker.err out to some text 
file?  Or do you have a VT100/320 terminal plugged into the serial 
port?

My systems (Dell 2650, RHEL 3.0WS) keep locking up, but I'm seeing 
exactly zero error messages.  Perhaps because the RAID write buffer 
is not flushing when the SCSI device hangs?

Thanks.

Comment 87 Javier Rodriguez 2004-04-26 20:33:22 UTC
Todd, in our situation, we are using ânetdumpâ to send the syslog 
messages and dump of physical memory to an external netdump server. 
At times the problem causes a slow loss of services rather than an 
instant crash. If we notice the problem in time, we are able to SSH 
into the server and capture the logs. Once the server locks up and it 
is rebooted, the log information is lost since the original problem 
noted in this report prevents the error information from being 
written to disk.

Have you tried disabling hyper threading? Its not a fix for the 
problem, but in our situation, it has allowed our servers to run for 
over 6 months under RedHat 9, and  now over a month under RedHat ES 
3.0.
 

Comment 88 Jean-Philippe Houde 2004-04-27 18:19:09 UTC
Is it possible that the problem occure not only on Dell PE 2650 but 
also on IBM xSeries 255 witch has a SeveRAID-6M (with an Adaptec 
chip). We have the same problem on both Dell PE 2650 and IBM xSeries 
255. The ServeRAID use the ips.o module.

IBM told us that the fix was in those patches:
https://listman.redhat.com/archives/ext3-users/2002-
December/msg00123.html 

I doubt it is! but we will try them anyway.

Javier, I have tried disabling the hyperthreading and unfortunatly it 
didn't work for us.

Comment 89 Andrew Kinney 2004-04-30 21:47:46 UTC
Dell just released a new PERC3/DI firmware that supposedly addresses 
this problem, at least on the PowerEdge 2500 systems.  For those with 
Dell systems, you'll want to login to the Dell support site and go to 
the downloads area for your system to see if you can get your hands 
on the new firmware.  Just be aware it requires the latest system 
BIOS be installed before updating the RAID controller firmware.  It 
might be just the thing we've all been needing to kick this problem 
completely.

Comment 90 Sean Hussey 2004-05-03 19:44:17 UTC
Is this new Perc3/DI firmware as of April 14-16? Or are you referring to another update?  I 
don't see one more recent than that, but I'm on 2650s, not 2500s.

Add another voice to the fray.  Dell had us upgrade two packages (aacraid-1.1.4
-2302dkms.noarch.rpm and dkms-1.00-1.noarch.rpm), but neither of them seems to 
have loaded (of note is that we installed 1.1.4 of aacraid, but 1.1.2 still loads).  After that, I 
upgraded the RAID BIOS, but, still, no difference.

Have any RH EL admins found anything good from an upgrade to kernel-2.4.21-9.0.3.EL-
i686 from kernel-2.4.21-9.EL-i686?  I know little about kernels or whether anything in the 
changelogs describes a fix to any of these issues.

Comment 92 Lance French 2004-05-04 23:13:29 UTC
The new firmware for the 2650 is version 2.8-0 build 6089 and is
provided on Dell's site with the aforementioned aacraid driver rpm
packages. "Cache flush loop decreased to prevent extensive spin-lock
hold during high I/O" is mentioned as a specific fix in this release.   

The driver packages did not work for me either (the RHEL kernel has
been upgraded since their release) and I have written Dell support
regarding that.  

However, you can use the dkms rpm to install your own kernel module
using the aacraid source files on adaptec's site: 
http://www.adaptec.com/worldwide/support/driverindex.jsp?sess=no type
"aacraid source" in the search box.

If you are not faint of heart, feel free to email me and I will send
you instructions on how I went about installing the kernel module.

Anyway, I have never been able to achieve the lockup condition, so it
would be nice if someone would stress test the new Dell BIOS and most
recent aacraid driver:

Red Hat/Adaptec aacraid driver (1.1-5[2326])
AAC0: kernel 2.8-0 build 6089
AAC0: monitor 2.8-0 build 6089
AAC0: bios 2.8-0 build 6089
AAC0: serial 635441d3


Comment 93 Mimmus 2004-05-10 14:50:51 UTC
Found in latest linux kernel changelog
(http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.6):

<markh>
[PATCH] aacraid reset handler update
This is an update from the Adaptec version of the driver to the aacraid
reset handler.  The current code has a logic error that is fixed by
this	version.  This builds against 2.6.5-rc1.

Comment 94 John Denny 2004-05-13 16:48:30 UTC
Sean - I had the same issue with kernel-2.4.21-9.0.3.EL-
i686 from kernel-2.4.21-9.EL-i686.   I am now running 2.4.21-15.ELsmp
for 30 hours now and still running.  I never made it past 15 hours before.

As for using dkms and the lastest aacraid driver from Dell I am
running both.  There is s typo in the
/var/dkms/aacraid/1.1.4-2302/source/dkms.conf file however.

The line

PACKAGE_VERSION="1.1.4.2302"

Should read


PACKAGE_VERSION="1.1.4-2302"

After changing that you can do a manual build/install for your kernel
by doing:

dkms build -m aacraid -v 1.1.4-2302
dkms install -m aacraid -v 1.1.4-2302

I have my fingers crossed that this driver and this new kernel resolve
the issue.




Comment 95 John Denny 2004-05-17 13:47:34 UTC
Just following up.  My system has been up and running since 5-13
without issue.  End users and myself been using it pretty hard all
weekend as well.



Comment 96 jcpeck 2004-05-21 23:22:47 UTC
After installing the latest firmware and aacraid driver as noted in 
earlier addendum, I find that the problem still exists for at least 
the case of a RAID0 container configured with 4 disks.  I have not 
observed the problem for RAID5 however.

For me, this bug is sensitized when the PowerEdge 2650 is configured 
as an NFS server and a client sources a workload using the SpecSFS 
benchmark utility.

Here is a summary of all testing using SpecSFS to date.

RAID level  # disks  Status  Test cnt
RAID5       3        PASS    1
RAID5       4        PASS    2
RAID0       4        FAIL    3

The messages in /var/log/messages are of the form:

aacraid:  Host adapter reset request.  SCSI hang?
aacraid:  SCSI bus appears hung
scsi:  device set offline - command error recovery failed
I/O error:  dev 08:10, sector xxx

The system becomes unusable.


Comment 97 Igor Trofimov 2004-05-28 12:32:05 UTC
We have server with Adaptec 2120S controller running RedHat ES 3.
Raid configuration RAID 10
probed BIOS of controller- 6008, 6013, 7224
Tested with kernels 2.4.21-4, 2.4.21-9.0.3 (RH), vanilla kernels
2.4.26 with(and without) adaptec driver 1.1.5, and 2.6.6 kernel.
With all kernels and all flash- 
aacraid:SCSI bus reset issued on channel 0
aacraid: Host adapter reset request. SCSI hang ?
aacraid:Drive 0:1:0 online on container 2:
aacraid:Drive 0:1:0 online on container 63:
aacraid:Drive 0:3:0 online on container 2:
aacraid:Drive 0:3:0 online on container 62:
aacraid:Drive 0:4:0 online on container 2:
aacraid:Drive 0:4:0 online on container 63:
aacraid:Drive 0:5:0 online on container 2:
aacraid:Drive 0:5:0 online on container 62: 
Server works, but very unstable. After 2 days need reboot, another -
stopped.
With vanilla kernel 2.4.26- server 30 min running without errors, but
dead hangs.
On another server (2200S), RAID 10 - with 7224 bios and 2.6.6 kernel-
no problem.
It's problem of Adaptec or Linux???????

Comment 98 Matt Domsch 2004-06-17 15:24:51 UTC
http://lists.us.dell.com/pipermail/linux-poweredge/2004-
May/020104.html

describes the workaround we are encouraging all customers with 
Adaptec ROMBs to use.

Comment 99 Eric D. Hendrickson 2004-07-09 21:07:50 UTC
I have made this change 3-4 weeks ago using the above afacli rpm to 8 
Dell 2650's here, all running RedHat 8.0, Fedora Core 1 and 2, and AS 
3.0.  This problem has gone away since then.

What impact, if any, does this have on performance?

Thanks!!

Eric

Comment 100 Doug Ledford 2004-07-09 21:15:54 UTC
Matt, any word on whether or not the latest aacriad driver solves this
issue or whether the workaround might still be needed even after a
driver update?

Comment 101 Joshua M. Thompson 2004-07-12 14:40:53 UTC
This problem has happened for us since 2.6.6. Backing up to 2.6.5
makes the problem go away. This is on six identical 2650s with all the
latest firmware updates, running FC1 and custom kernel RPMs built from
stock 2.6 sources without any patches applied.

What's very strange about this is that it does not seem to crash as
long as Apache isn't running. My MX servers (running Postfix 2.0) have
been up for 20+ days whereas my POP servers (which also run
SquirrelMail) and my Cpanel server crash at least once per day. As a
test I shut off Apache on one of my POP servers and it's now been up
for 5 days without incident. Cpanel runs Apache 1.3 and the POP
servers run Apache 2.0.

Comment 102 Matt Domsch 2004-07-14 16:39:49 UTC
Doug, it's not a driver issue, it's a firmware issue.  We've got a 
newer firmware available for testing, and hope to have it released on 
support.dell.com "soon".

Comment 103 Matthias Wenthe 2004-07-22 10:27:48 UTC
I discoverd the "kernel I/O error" three days ago on a brand new Dell
Power Edge 2650 System after 14 hours of kernel compiling tests under
Debian Woody with testing kernel 2.4.26 aacraid 1.1-3.  As Matt Domsch
recommended I disabled the Perc's read/write cache and the machine is
now stable for 48 hours compiling kernels and writing tar files in an
endless loop until the edge of the disk. The impact on the performance
measured with bonnie seems to be small:

Dell Power Edge 2650, 2 x 2.8 Xeon, 2GByte RAM, PERC 3Di 2x73 GByte Raid 1
Debian Woody 3.0
Kernel 2.4.26 Red Hat/Adaptec aacraid driver 1.1-3


Read Write Cache enabled

Version 1.02b       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
 /sec %CP
ns1              4G 22077  82 21839  13 13008   4 23723  69 55881  11
444.3   1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 /sec %CP
                 16  2835  99 +++++ +++ +++++ +++  2860  99 +++++ +++
 6415  98
ns1,4G,22077,82,21839,13,13008,4,23723,69,55881,11,444.3,1,16,2835,99,+++++,+++,+++++,+++,2860,99,+++++,+++,6415,98


Read Write Cache disabled

Version 1.02b       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
 /sec %CP
ns1              4G 26513  99 33153  20 17939   6 23534  71 61521  11
422.3   0
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 /sec %CP
                 16  2815  98 +++++ +++ +++++ +++  2845  98 +++++ +++
 6246  95
ns1,4G,26513,99,33153,20,17939,6,23534,71,61521,11,422.3,0,16,2815,98,+++++,+++,+++++,+++,2845,98,+++++,+++,6246,95

Overall results indicate that disabling the cache even slightly
increases the performance on the cost of a higher CPU load. 

I can easily live with that but still some doubts remain whether it's
wise to start a production system on insecure hardware. As Matt I
encourage all victims of this bug to disable the controller cache and
*please* do publish your results! From my point of view two questions
remain:

1. In case of further errors would it be a *safe* solution to disable
the Perc3 and upgrade to a Perc4 LSI Controller? Any experiences with
this hardware on Power Edge Servers under Linux?

2. When will the new firmware be available :-) ?


Comment 104 Doug Ledford 2004-07-22 15:21:49 UTC
Matthias,

One correction to your post.  With caching disabled you are getting
increased throughput, but it is *not* at the cost of increased CPU. 
There is a certain amount of system overhead for every page of data
read or written.  When you increase the number of pages per second,
then your CPU use goes up.  Not because it costs you any more CPU time
per page, but because you are processing more pages.  So, in
meaningful terms, there is no CPU cost, just that since the pages are
coming in faster the CPUs have more data to work on and are sitting
around waiting on data less.  In a perfect disk setup, you would be
able to get 100% CPU usage in these sorts of tests which would mean
that the disks are able to feed the CPUs fast enough that the CPUs are
never waiting on the disks instead of doing work.  So, a step backward
is when you get the same amount of throughput and use more CPU, or
situations like that.  That's one of the reasons I like the tiobench
program for benchmarks since it divides throughput by CPU usage to get
a number for CPU usage per page processed.  That's a *far* more
meaningful CPU metric than just CPU usage total.


Comment 105 Matthias Wenthe 2004-07-22 17:16:00 UTC
Dough, 

fair enough, thank you for explanations. Here are the tiobench values
for the same machine:

Dell Power Edge 2650, 2 x 2.8 Xeon, 2GByte RAM, PERC 3Di 2x73 GByte Raid 1
Debian Woody 3.0
Kernel 2.4.26 Red Hat/Adaptec aacraid driver 1.1-3

Read Write Cache enabled

tiobench
No size specified, using 1792 MB
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/sec

         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     1792   4096    1  35.72 8.39% 12.05 1.54% 21.33 14.4% 8.687 3.89%
   .     1792   4096    2  19.68 4.89% 14.54 1.86% 20.24 21.2% 8.767 4.48%
   .     1792   4096    4  16.71 4.39% 16.27 3.47% 20.36 31.5% 8.438 6.30%
   .     1792   4096    8  15.43 4.44% 18.14 3.77% 20.36 37.0% 8.532 7.23%


Read Write Cache disabled

tiobench
No size specified, using 1792 MB
Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/sec


         File   Block  Num  Seq Read    Rand Read   Seq Write  Rand Write
  Dir    Size   Size   Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
------- ------ ------- --- ----------- ----------- ----------- -----------
   .     1792   4096    1  57.55 13.9% 8.152 2.08% 27.80 19.1% 1.712 0.65%
   .     1792   4096    2  21.13 5.32% 9.246 1.77% 27.40 29.2% 1.701 0.92%
   .     1792   4096    4  16.80 4.36% 9.980 1.91% 24.46 37.8% 1.705 1.34%
   .     1792   4096    8  15.23 4.17% 10.53 2.19% 22.70 41.0% 1.709 1.50%


As expected random reading and more impressively writing is
benefitting from the cache whereas the sequential values are slightly
better with cache disabled which I find hard to explain.

Maybe you can help me with the interpretation?

Comment 106 Andrew Kinney 2004-07-23 23:48:20 UTC
On our systems, which are heavily multi-threaded (VPS servers) and do 
a lot of random read/writes, I'm not ready to take a 40% to 400% disk 
performance reduction.  With our disks running near their I/O 
capacity during backups, such a performance hit would be very 
noticeable.

Matt Domsch, once that new firmware is out of testing, please let us 
know here if you have a spare moment to do so.  It would help me and 
I'm sure it would help others as well.

Oh, and FWIW, since I've made the changes I detailed earlier in this 
thread (turned on all caching, including on-drive) plus installed the 
newer firmware (from April 2004), I've moved our "bug triggers" (as I 
like to call them) back to the hardware experiencing the trouble and, 
in the month+ since then, we've not had any more trouble.  My fingers 
are still crossed that it stays that way. :-)  Lately, we've been 
pressing the disk systems quite hard on these two systems that had 
the trouble (backups on top of normal activity), so I would expect 
that if it was going to have trouble that it would have done so by 
now.


Comment 107 Alex Sousa 2004-07-26 19:37:50 UTC
As described in Comment #67, we have a Dell PE4600 that has fallen 
victim to this bug since December 03. The reproducible trigger to
cause the crash were our daily backup jobs, which had to be given up
to prevent user madness and admin lynching. A few weeks ago, I finally
got the time to do firmware upgrades on the machine and got updated 
BIOS(A11), ESM(A31) and Perc3/Di(2-8-0, build #6089). I kept the
aacraid driver 1.1-4(build #2323) with kernel 2.4.21-9ELsmp. After
turning the daily backups on, I got 3 days of uptime, as opposed to 24
hours. I then followed Matt's advice and disabled read/write cache on
all containers and am now up to 17 days without a crash. It looks like
the problem is finally solved if one can live without the r/w cache. I
probably can but will be eagerly awaiting the next firmware upgrades.
Thanks for all the work! 

Comment 108 Andrew Kinney 2004-07-28 23:15:00 UTC
As seems to be typical of issues like this, I spoke too soon.  Within 
hours of posting that the problem appeared resolved with the newer 
firmware, the bug was triggered again.  Ironically, bugzilla was down 
at the time as well.  It probably doesn't mean anything, but the 
number of context switches (we graph a lot of things on our servers) 
went from the average of 12000 per second to almost 8 million per 
second just before the system became completely unresponsive.  I 
suspect that may have just been a symptom of running processes losing 
access to the disks, though.

At any rate, the bug is still alive and well in our machines.  As 
much as I hate to do it, I may just break down and turn off caching.  
Since we're running RAID5, we're probably going to take a big 
performance hit and overall system load will increase (waiting on 
disks), but I suppose that's better than a 2 hour boot sequence after 
a crash.

if ( $rock>$hardplace )
{
   $trash=cutoff ( $arm );
   print "$expletive. That hurt.\n";
}
elsif ( $hardplace>$rock )
{
   $trash=cutoff ( $leg );
   print "$expletive. That hurt.\n";
}
else
{
   $trash=cutoff( $leg ) + cutoff( $arm);
   print "$expletive. That hurt.\n";
}


Andrew

Comment 109 Robert Becker 2004-08-02 23:35:56 UTC
In addition, I believe the patch is corrupting the Model string 
reported in /proc/scsi/scsi.  It shows up as a blank string.  I 
checked it against the 2.4.22-686 prebuilt .o and noticed the same 
effect. 
   I tracked the vendor/model string into the aac_get_container_name-
>aac_fib_send() call where it gets corrupted.  

Comment 110 Andrew Kinney 2004-08-26 22:16:13 UTC
For what its worth, the latest firmware for the PERC3/DI (2.8.0 build 
6092) doesn't fix the problem for us.  Within hours of installing the 
new firmware and booting up, our crasholicious PE2500 took yet 
another nosedive with the exact same symptoms as before (device i/o 
error and whole slew of ext3 errors after controller was marked 
dead). All software, firmware, and BIOS are the latest available 
versions.

It is possible that we simply have a different problem, so I've taken 
the advice from Dell and disabled the controller cache.  As expected, 
the system load quadrupled and performance is sluggy. To prove that 
we are having the problem that the firmware was supposed to fix, 
we'll have to witness the absence of crashes after turning off the 
controller cache.  That can be problematic when the crash may take an 
hour or a month to manifest itself.  If we do see another crash after 
turning off cache, then we probably aren't dealing with the issue the 
new firmware was supposed to fix and I'll have to start harassing 
Dell about it.

Comment 111 Jean-Philippe Houde 2004-08-27 12:21:08 UTC
Andrew,

on my side I had that problem on a PE2650 and the new firmware fix 
it. But I also had the same problem on a IBM xSeries 255 (which is 
obviously not using the PERC 3/DI controller). It seems that the 
problem was a defect disk. All disks were tested OK, but one day I 
came in and notice the little amber led on one disk. Since I changed 
that disk, the problem never came back. Hope this can help you!

Comment 112 Andrew Kinney 2004-09-01 02:17:56 UTC
Jean-Philippe Houde,

I appreciate the friendly advice, but this is a systemic problem and 
not isolated to a single machine.  We have two identical PE2500 
servers that both exhibit the trouble and neither has any failed 
disks.  

I always check for a failed drive first after the server comes back 
up after a hard reboot after one of these crashes since one of the 
symptoms is that the controller always marks drive 1 (the second 
drive of the five disk array) as bad, but then immediately goes into 
rebuild mode once it is rebooted after a crash.  

Now, before anyone goes and tells me that it is an obviously bad 
drive, the clincher is that the same exact drive (drive 1, the second 
drive in the five disk RAID5 array) gets marked bad on both servers 
when the problem occurs and the SMART logs on the drive show no grown 
defects and changing the drive does not help.  It's almost as if the 
controller just automatically fails the second disk in the array 
under heavy cached write I/O.  I can't explain it.

At any rate, after turning off the write cache (I left the controller 
read cache on), both servers have seen some brutal activity (clueless 
web hosting customers with poorly programmed database driven web 
sites) and the only result has been really poor disk performance, as 
you might expect.  

We have been running this way for a few days now and not had any more 
trouble with crashes, so my suspicion at this point is that the 
problem we're having is indeed related to the controller write cache 
and it is not a defect with just a single machine.

I'll be spending most of the night on the phone with Dell until they 
can replace the RAID controller with one that works in both of these 
machines or find an acceptable work-around that doesn't involve 
destroying disk performance.

Andrew

Comment 113 Lance French 2004-09-03 23:41:57 UTC
FYI, RHEL 3 Update 3 was released September 2nd and includes the
aacraid 1.1.5.2340 driver, or a variation of it.  It's hard to say
because I haven't seen any U3 release notes around yet.  

AFA0> controller details
...
Component Revisions
-------------------
                CLI: 2.8-0 (Build #6076)
                API: 2.8-0 (Build #6076)
    Miniport Driver: 1.1-5 (Build #2340)
Controller Software: 2.8-0 (Build #6092)
    Controller BIOS: 2.8-0 (Build #6092)
Controller Firmware: (Build #6092)

Comment 114 Joshua M. Thompson 2004-09-07 01:45:43 UTC
For what it's worth the 6092 build has made the problem on my servers
less frequent but we are still seeing crashes across three different
2650s. What is interesting is that to this day I can *still* get a
rock solid system by building my kernels with the version of the
aacraid driver from 2.6.5. With the driver from 2.6.6+, average uptime
with pre-6092 firmware is about 2 days and 7 days with the 6092 firmware.

My 2.6.5 machines never crashed; they topped out at about 60 days
uptime which is when we rebooted to try the new firmware build and
newer kernel driver. Are we sure there isn't a driver problem here as
well?


Comment 115 Andrew Kinney 2004-09-08 06:44:44 UTC
We've now gone almost two weeks without a crash after turning off our 
write cache (we left read cache on, though it doesn't do much in our 
case), so I'm inclined to believe that we are indeed still having 
trouble with this controller.  Whether it be driver or firmware is 
anyone's guess at this point (developers seem convinced it's 
firmware), though I'm still leaning towards driver, kernel, or ext3 
issues since one does not see this issue with FreeBSD or RedHat 7.1 
or earlier installed (those OS's don't use ext3, they use ext2 or 
UFS).  It's only with RedHat 7.3 or later that these issues surface, 
and, as Joshua mentioned, some of the newer kernels don't appear to 
show the problem.  We never had the problem on these exact same 
machines for about a year that we ran them with RedHat 7.1 with ext2 
and reiserfs partitions under the same loads that are now causing 
crashes.

Just for kicks, has anyone tried reformatting with ext2 instead of 
ext3 to see if the problem goes away?  Wouldn't that just boggle 
everyone if this issue is specific to ext3 due to its added I/O 
overhead? :-)

Our kernel (virtuozzo from SWSoft):
2.4.20-020stab009.24.777-smp #1 SMP Tue Aug 17 13:42:53 MSD 2004 i686 
i686 i386 GNU/Linux

Our controller info:
Component Revisions
-------------------
                CLI: 2.8-0 (Build #6076)
                API: 2.8-0 (Build #6076)
    Miniport Driver: 1.1-4 (Build #9999)
Controller Software: 2.8-0 (Build #6092)
    Controller BIOS: 2.8-0 (Build #6092)
Controller Firmware: (Build #6092)

The driver we're using is essentially aacraid 1.1-4 (build #2323) 
ported to the virtuozzo kernel (hence the build #9999).  We're 
currently using ext3 for the filesystems.

Andrew

Comment 116 Christopher Barton 2004-09-08 18:51:45 UTC
For us, the problem was there in RHL7.2 as well.  However, the 
problem happened a lot less often, and the problem resulted in a 
panic (completely fatal) rather than the I/O error (mostly fatal).

We've back-leveled our driver to 0.9.9ac4-rel.

Comment 117 Andrew Kinney 2004-09-09 00:12:58 UTC
Christopher,

> For us, the problem was there in RHL7.2 as well.

What filesystem were you using? ext2 or ext3? I think ext2 was 
default until RH 7.3, but I'm pretty sure it was a prominent option 
in RH 7.2.

> We've back-leveled our driver to 0.9.9ac4-rel.

Has this worked for you?  If so, what filesystem, kernel, and 
controller firmware are you using?

Andrew

Comment 118 Christopher Barton 2004-09-09 01:31:58 UTC
We were using ext3 in RHL7.2.  The 0.9.9ac4-rel driver works as well 
as any version of RHL ever did.  These days we are running the RHEL3 
kernels with aacraid modifications, ext3, and firmware ~6089.  The 
I/O error problem started with the initial release of RHL8.0.  RHL7.3 
was later infected via the errata kernels, but RHL7.3 didn't 
initially have the I/O problem.  But again, the occasional aacraid 
panics were there in RHL7.2/7.3 with ext3.

We plan to apply the 6092 firmware during our next patching cycle.

Comment 119 Andrew Kinney 2004-09-09 03:20:56 UTC
Thanks for the reply Christopher.  Others can draw what conclusions 
they will, but I think that the combined experiences shown here and 
in other discussions elsewhere pretty squarely point the finger at 
ext3.  While it may not be the underlying root cause of the problem, 
it appears to be what is causing the problem to manifest itself.  At 
this point, all I care is getting rid of the problem.  If a man with 
heart disease can't cure his heart disease, he'll still avail himself 
of whatever he can to ensure his heart keeps beating.

One thing I did find peculiar is that ext3 chose the "ordered data" 
mode of journaling operation as its default while most other 
journaling filesystems chose "writeback" as their default journaling 
method.

From 'man mount' regarding fs mount options for ext3:

data=journal / data=ordered / data=writeback
Specifies the journalling  mode  for  file  data.   Metadata  is
always journaled.

journal
   All  data  is  committed  into the journal prior to being
   written into the main file system.

ordered
   This is the default mode.  All data  is  forced  directly
   out  to  the main file system prior to its metadata being
   committed to the journal.

writeback
   Data ordering is not preserved - data may be written into
   the  main file system after its metadata has been commit-
   ted to the journal.  This is rumoured to be the  highest-
   throughput  option.   It  guarantees internal file system
   integrity, however it can allow old  data  to  appear  in
   files after a crash and journal recovery.


I'm going to perform an experiment (unless someone has already tried 
this and it still failed) that involves changing to "writeback" 
journaling mode for non-root filesystems where all the data activity 
resides. My theory, for the moment, is that the mechanism in ext3 
ordered data mode that 'forces' data to disk before the metadata is 
somehow jamming the write cache hard in certain circumstances and the 
controller goes unresponsive.  By changing to "writeback" journaling 
mode, the hope is that the OS will be able to use 'lazy' writes much 
like you'd find in FreeBSD's UFS w/softupdates which doesn't exhibit 
this problem on this controller.

Any comments?  Am I nuts or on to something?

Andrew

Comment 120 J. Parsons 2004-09-09 03:57:20 UTC
For what it's worth, I found that disabling hyperthreading worked around this issue for me.  
I believe that you're on the right track that anything that causes fast, synchronous disk 
writes will tickle this bug.  My guess is that disabling hyperthreading just slowed down the 
filesystem code enough that it couldn't overload the controller.  I would *guess* that 
running writeback journaling will decrease the incidence of the bug, but won't make it go 
away.  

The only solution I found was to get rid of our PERC3/DCs and replace them with PERC3/
DIs.  We haven't had a single problem since upgrading to the DI.  If you argue a bit, Dell 
might give you a discount on trading in the DC for a DI.  

For the archives, I encountered this problem on at least three separate systems, with write 
caching both enabled and disabled, and with the ext3 filesystem.  I *do not* think this is 
ext3's fault -- I just think that ext3's journaling may be more likely to actually *use* the 
controller.

Hope this helps.
 - Jason Parsons

Comment 121 Andrew Kinney 2004-09-14 19:16:44 UTC
Quoting myself:
>I'm going to perform an experiment (unless someone has already tried
>this and it still failed) that involves changing to "writeback" 
>journaling mode for non-root filesystems where all the data activity 
>resides.

The experiment failed.  Within two days of turning on write cache 
after changing to "writeback" journaling mode, the crash occurred 
again.  So, it sounds like Jason Parsons was right about ext3 just 
being an agitator and not the cause.  

We've turned off write cache again.  We're currently working to get 
the 1.1-5 driver compiled into our kernel at the suggestion of Dell 
in the hopes that when/if that doesn't fix it, we'll have more 
effective options presented to us by Dell.

Comment 122 Andrew Kinney 2004-09-16 20:30:00 UTC
For what it's worth, our problem likely is not a firmware problem, 
though it is manifesting itself very similarly.  This morning, the 
one of the affected systems crashed with the same symptoms even with 
the RAID cache turned off.  I'm on the phone with Dell at the moment 
to see what they can tell me.  I'll probably duck off this thread 
until I can find a solution to avoid cluttering otherwise useful 
information.

Comment 123 J. Parsons 2004-09-16 20:34:23 UTC
You should be aware that there is a bug in some versions of the firmware that causes the 
RAID cache to not be turned off even when it is configured to be off (and the UIs all say 
that it is off).  The workaround was to set the cache to 'enable if protected', then pull  the 
battery off the controller.  Ugh.

I don't recall any further details, unfortunately.

Comment 124 Becky Sander 2004-09-27 18:36:17 UTC
I am seeing this problem on a redhat 9 box with an adaptec 2200S raid
card and JBOD.  I have tried both the smp and non-smp kernel with the
same results.  I have verified the media with the adaptec software.
 As mentioned in comment #123, I have the cache to "enable if
protected" and there is no battery installed.  I'm pretty sure I have
the latest raid card firmware and driver.  I can recreate this crash
easily, just by launching a full backup.  My system does not hang,
but the filesystem becomes unusable in that a df shows the fs mounted,
but an ls- l shows zero files, and I can't umount it or fdisk the
device.  I'd sure appreciate some help.  This system is unusable in
this condition.  Here is my firmware, driver, kernel info, logfile
portions, etc


Adaptec Raid Controller 1.1-4[2322]
Vendor: ADAPTEC  Model: Adaptec 2200S
kernel: 4.1-0[7244]
monitor: 4.1-0[7244]
bios: 4.1-0[7244]

                CLI: 4.1-0 (Build #6151)
                API: 4.1-0 (Build #6151)
    Miniport Driver: 1.1-4 (Build #2322)
Controller Software: 4.1-0 (Build #7244)
    Controller BIOS: 4.1-0 (Build #7244)
Controller Firmware: (Build #7244)
Controller Hardware: 2.64

Red Hat Linux release 9 (Shrike)

uname -a
Linux ladon 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686
i386 GNU/Linux

Sep 25 07:13:22 ladon kernel: aacraid: Host adapter reset request.
SCSI hang ?
Sep 25 07:14:22 ladon kernel: aacraid: SCSI bus appears hung
Sep 25 07:14:32 ladon kernel: scsi: device set offline - command error
recover failed: host
 1 channel 0 id 0 lun 0
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 217841696
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 218890256
Sep 25 07:14:32 ladon kernel: SCSI disk error : host 1 channel 0 id 0
lun 0 return code = 6
000000
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 218995960
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 218995968
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 218995960
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 218890256
Sep 25 07:14:32 ladon kernel: EXT3-fs error (device sd(8,33)):
ext3_get_inode_loc: unable t
o read inode block - inode=13680655, block=27361282
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 0
Sep 25 07:14:32 ladon kernel: EXT3-fs error (device sd(8,33)) in
ext3_reserve_inode_write:
IO failure
Sep 25 07:14:32 ladon kernel:  I/O error: dev 08:21, sector 0
Sep 25 07:14:33 ladon kernel:  I/O error: dev 08:21, sector 13072
Sep 25 07:14:33 ladon kernel: journal_bmap_Rsmp_e68c71a3: journal
block not found at offset
 2511 on sd(8,33)
Sep 25 07:14:33 ladon kernel: Aborting journal on device sd(8,33).
Sep 25 07:14:33 ladon kernel:  I/O error: dev 08:21, sector 4776
Sep 25 07:14:34 ladon kernel:  I/O error: dev 08:21, sector 218894352
Sep 25 07:14:34 ladon last message repeated 98 times

Comment 125 Andrew Kinney 2004-09-28 06:18:00 UTC
Our problem is likely triggered by bad hardware causing a single 
drive in our array to go unresponsive, but the handling of the 
unresponsive drive by the firmware and/or driver is ultimately what 
causes the controller to be unresponsive for long enough that the OS 
marks it dead.

I noticed something about the controller log from our last crash.  
There are times listed with some of the log messages in the form of 
number of seconds since controller boot.  This gives us a way to 
calculate how long the controller took to finish its fiddly dinking 
with the unresponsive drive.  From that log:

[50]: ID(0:01:0) Cmd[0x28] Fail: Block Range 3424717 : 3424718 at
[51]:  509184 sec
.....
[78]: ID(0:01:0) Cmd[0x28] Fail: Block Range 0 : 0 at 509262 sec


From this, we can tell it took 78 seconds between those two log 
messages.  It likely took longer than that for the entire operation 
since there were other things done before and after those times were 
logged.  

I'm making an educated guess that due to problems with the firmware 
and/or driver code, the controller doesn't respond to the OS during 
that entire period.

Correct me if I'm wrong, but I believe the SCSI subsystem in Linux 
only waits a maximum of 60 seconds before it determines that a 
controller is unresponsive and issues a reset.  I'm also guessing 
that the reason the controller doesn't respond to the reset is that 
it is already jammed with commands from the prolonged period of 
commands being queued while it worked to try to reset the 
unresponsive drive.

Of course, it begs the question, why is the drive going 
unresponsive?  We're working with Dell to get that answer. 

It also begs the question, why does the aacraid firmware and/or 
driver take so long to determine the drive is unresponsive?  That's 
where firmware programmers and driver programmers come in.  If linux 
thinks a controller should respond in under 60 seconds, wouldn't it 
stand to reason that the firmware should make an effort to get done 
with uninterruptable work in less than 60 seconds or make the work 
interruptable?

I'm speaking in general terms because I'm not a programmer by trade 
(only by necessity), but the logic should be clear enough.

Can anybody else confirm similar lengths of time from their 
controller logs?  You'd get the controller log by going into afacli 
(my afacli is version 2.8-0 Build #6076), opening your controller, 
and running the following command (assuming this is your first 
controller boot after your crash):

diagnostic show history /old

Andrew


Comment 126 Andrew Kinney 2004-10-05 00:40:21 UTC
LSI has apparently identified a kernel issue (note I said *kernel*, 
not driver) in which the kernel is writing "phantom" commands to the 
controller.  Granted, that is a different driver, but could a similar 
thing be happening with other drivers?  If I'm way wrong, that's 
fine.  Just an idea that hasn't been looked at on this thread before.

Reference:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=118432#c10

Comment 127 Javier Rodriguez 2004-10-21 01:00:47 UTC
On October 10th, we loaded the following configuration on our Dell 
2650:

   Server: Dell PowerEdge 2650 with PERC3/DI
   System BIOS: A19
   Backplane firmware: 1.01
   PERC3/DI firmware: 2.8.0 (build 6092)
   ERA firmware: 3.14
   Hyperthreading: Enabled
   Linux OS Version:  2.4.21-20.ELsmp (RedHat Enterprise Linux 3.0)
   Aacraid driver version: 1.1.5 (installed with RedHat)
   
So far, the system has been running error free and performing much 
better than with prior firmware and software releases. Previously, 
the server would usually fail within 48 hours and the longest 
availability time we experienced was two weeks. We'll provide another 
update next week.


Comment 128 Javier Rodriguez 2004-10-24 12:34:05 UTC
We are now at the two week mark (reference post #127) and we are 
still running error free. A correction to post #127, the longest 
availability time we previously experienced was one week, not two 
weeks as noted on the post.

Comment 129 José Morelli Neto 2004-10-28 17:39:50 UTC
Javier,

You had turned off write cache on BIOS?

Thank's,
 Neto.

Comment 130 Javier Rodriguez 2004-10-30 03:51:35 UTC
As per our discussion, the write cache parameter is set to disabled. 
The matches the Dell documentation you referenced which states that 
this is the default and also the recommendation.

Comment 131 Javier Rodriguez 2004-10-30 03:55:50 UTC
We are now reaching the three week mark and we are still running 
error free. We are going to implement the maintenance on our second 
server. It appears that the configuration in post #127 has corrected 
the problem. Thank you to everyone who helped to resolve this problem.

Comment 132 Javier Rodriguez 2004-11-07 03:51:42 UTC
We are considering this item as resolved based on the implementation 
of the items in comment #127.

Comment 133 Mimmus 2004-11-17 14:53:51 UTC
Any news about this problem in newer Dell2850 server?

Comment 134 Andrew Kinney 2004-11-17 17:17:50 UTC
IIRC, the PE2850 uses a different SCSI/RAID chipset, so any problems 
it has will be completely different than what was represented here.

For posterity's sake, our problem was resolved by some combination of 
the following that was done all simultaneously:

1. Replaced the drive that the system marked as bad, though it had no 
visible operational problems and came back fine after a power cycle. 
It was replaced with a newer different brand U320 10K RPM though 
we're only using it in U160 mode.  The newer drive seems to be a bit 
snappier in processing SCSI commands, but it could just be my 
imagination.

2. Replaced the SCSI backplane.

3. Replaced the cables.

4. Replaced the mainboard since the RAID controller was on the 
mainboard.

5. Changed from RAID/SCSI (channel 1, channel 2) mode to RAID/RAID 
mode since that is the default and probably best supported, though 
we're not using channel 2.

6. Rebuilt the entire array using a 64KB stripe size instead of a 
32KB stripe size since bigger appears to be better when it comes to 
minimizing physical I/O requests during large transfers.

7. Used the latest BIOS and firmware on the mainboard and controller. 
A07 and 2.8-0 (Build #6092).

8. Used the newest driver I could get my hands on for my commercial 
2.4 series kernel (1.1-4 Build 2323).  From what I could tell from 
the changelog (which is a bit sketchy), there weren't many changes 
between that version and 1.1-5, so I couldn't see a need to go 
through the hassle of getting 1.1-5 on my kernel.

At any rate, I'm not sure if my experience will help anyone else, but 
there it is anyway.

Andrew