Bug 92129
Summary: | (SCSI AACRAID)kernel: aacraid: Host adapter reset request. SCSI hang ? | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Javier Rodriguez <jlrodriguez> | ||||||||
Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 3.0 | CC: | andykinney, anielsen, asousa, bbrock, bryan, bugzilla-redhat, cpbarton, david_gardi, diego_leccardi, elliot, gregj, honermeyer, jacob_liberman, james, jcpeck, jdenny, jharrop, jparsons-redhat, kevin.mcmahon, k.georgiou, magnus.ahl, manuel.wenger, mark_salyzyn, nospam, pd, petrides, redhat.com, rknigh, roy.olsen, sturolla, sysadmin, tao, timh, viggiani, wenthe | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2004-11-07 03:51:42 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Javier Rodriguez
2003-06-03 00:57:52 UTC
Currently working with adaptec on updating aacraid and fixing bugs in it Same problem here with the same hardware, but on Redhat 7.3 Kernel 2.4.20-19.7. we have the same problem with poweredge 1650 and 2650. We tried different versions of kernel and redhat releases (7.3 and 9) kernel tried 2.4.18-18.7.x 2.4.18-24.7.x 2.4.18-26.7.x 2.4.20-13.7 2.4.20-19.7 2.4.21.ac2-rc2 As a workaround i removed raid controller form some 1650 and re-install the machine with only scsi interface connected. We didn't have any more crash in the last month! Does anyone developing the aacraid driver have an update regarding the problem below? Disabling HyperThreading (Logical Processor) within the Dell 2650 BIOS has without a doubt circumvented the problem for us (as well as a few others), but it would be nice to reenable the feature. For reference, with HyperThreading disabled, we've been able to successfully execute Red Hat's distribution of Linux kernel-2.4.20-9, kernel-smp-2.4.20-9, kernel-2.4.20-13.9, kernel-smp-2.4.20-13.9, kernel-2.4.20-18.9 and kernel-smp- 2.4.20-18.9. We currently have two Dell 2650s executing kernel-smp-2.4.20-18.9 for 74 days without incident. Prior to disabling HyperThreading, our systems would normally crash within 24 hours (no longer than 48 hours) with both the smp and non-smp version of the kernel. For most of our machines (1650) disabling the hyperthreading has no sense as they have one cpu (pentium III from 1.4 to 1.7 GHz) with no hyperthreading, of course. On the other hand we have other 2650 that are running since 2 or three moths without problems, some of them with hyperthreading disabled. A couple of other 2650 had several crashes whene they were used as ftp server. I don't know what it really means but it seems something not really related to hyperthreading, but only to a high substained i/o any good news on the aacraid driver? the problem is still there for us. thanks None really. I've got some possible test patches but not much else. Based on the many posts Iâve been reading, it appears that the aacraid driver problems have been occurring for some time now, and although people are working on it, a formal resolution is not available. What can we do to help expedite a resolution for the ongoing aacraid problems? If there are possible patches available, can Red Hat provide an RPM or procedure to allow us to install and test the patches? Is there specific debug or configuration information that we can provide to help the aacraid developers better isolate the problem? Thanks. I'm currently doing some testing in upstream kernels (2.4.22-ac4 patches). I guess whoever takes over from me will pick up on that, or Adaptec will eventually fix this in their driver updates, since they at least have firmware source/docs so can trace such problems in detail. Mark Salyzyn <mark_salyzyn> is probably the person who can tell you best about any updated drivers and Adaptec recommended updates as well as give you test driver bits (removing from Cc as I'll be away for some time from tomorrow) We have tried to involve DELL (Germany) several times to help us with this problem. After a lot of suggestions like "update the firmware here and there" their final conclusion was to subscribe to RedHat Enterprise Linux (ES) standard edition and ask support to RedHat :-) I hope someone can find a solution otherwise we will be forced not to buy DELL servers anymore and disable RAID controller on the existing ones. I don't know if can be of some help to involve Adaptec, but who better then Alan can know this. I totally agree with previous comment, if there is something we can do to help the developers just let us know. There is a aacraid devel mailing list at http://lists.us.dell.com/pipermail/linux-aacraid-devel There are DELL and ADAPTEC guys, and others, that they can help you Created attachment 95216 [details]
latest driver version from Mark_Salyzyn
I can confirm that under 2.4.20-8smp problems persists. Now I'm running 2.4.20-8 (not smp) and machine is still up after 3 days. I'm planning to update to 2.4.20-20.9smp/not-smp and retry. I am having the same problem on a server running RHELv3. Any news on this bug? I had 2 system disk in RAID1 (mirror) and 2 single disks configured as 'VOLUME'. Now I have reconfigured my disks from 'Volume' to 'Raid0' in Perc 3Di setup and my proxy server is up since 40 days. Can anyone confirm this? I had my disks in hardware Raid0. I've now turned off the RAID in BIOS, and I'm using a Linux software Raid0. That hasn't crashed yet, and has been up for 24 days, which I think is a record for us with these machines. The problem has also been discussed at Dell's Poweredge mailing list at http://lists.us.dell.com/pipermail/linux-poweredge/ On the October 20th Mark Salyzyn mentions that he has a driver workaround. Can anybody confirm this? And is redhat going to include it? not latest code, but it's near, is here: http://www.adaptec.com/worldwide/support/driverdetail.html?sess=no&language=English+US&cat=%2fOperating+System%2fLinux&filekey=aacraid-drv_1.1.4-rh9.rpm and don't forget to update the firmware of the RAID board. If you have Seagate disks this article describes a timeout problem that can be fixed by updating the firmware of the _disks_. http://www.seagate.com/support/disc/u320_firmware.html I can confirm, that we are having the same problem with a brand new Dell PowerEdge 2650 system, 2 x 2.8 GHz Xeon, Perc 3/DI, 4 Maxtor Atlas 73GB drives configured as one RAID5 contaier running RedHat Enterprise Linux 3 with latest patches installed (kernes is vmlinux-2.4.21-4.0.1.ELsmp). with HW RAID boards it's essential to have _latest_ firmware releases in the hard disks and RAID board. And latest drivers ;) I confirm that problems disappeared after changing container from 'Volume' to 'Raid0'. No experience with RAID5. Created attachment 96638 [details]
logfile and system information from web1 (PowerEdge 2650)
We are still experiencing the same SCSI problems after making & running a decent static RHEL3-ES kernel with the new 1.1.4 driver from Mark Salyzyn posted here (attachment from 2003-10-15). The uptime was about 1 day. I/O error: dev 08:02, sector 0 I/O error: dev 08:02, sector 3944840 I/O error: dev 08:02, sector 3942371 EXT3-fs error (device sd(8,2)) in ext3_reserve_inode_write: IO failure I/O error: dev 08:02, sector 3881238 ... Please see attachment "logfile and system information from web1 (PowerEdge 2650)" for details. If you need more information, please contact me directly. Cheers, Peter PS: I also opened a service call with Dell Germany but they told me, that this is not a hardware but a driver problem and RedHat is responsible to fix this. for support on RHEL bugs, bugzilla is the wrong medium, contact your RH support contact instead. Arjan: We are currently evaluation RHEL3-ES(x86) on the Dell 2650 platform and we were not able to file a bug against RHEL, because we did not activate the product yet. This is because we fear that we have to replace the hardware platform (other PC based servers) in the future due to the pending aacraid problem. May you tell me how we are able to file a bug against RedHat ES without product activation? I am experiencing this problem on two systems, a 2650 w/hyperthreading/smp and a 1650 (non-smp). Both systems run RH 7.3 with latest available kernel. Has anyone else seen this bug on a non-smp kernel besides me? Thanks our 1650 we are still experiencing this problem have only one cpu and no hyperthreading. My understanding is this bug is strictly related to aacraid driver for the PERC/3D and not to smp/hyperthreading Bug still persists even if it is not easily reproducible. On Jan 19, Matt Domsch from Dell said on the AACRAID mailing-list: "aacraid was not updated in [Red Hat 3 AS] Update 1. We're still trying to duplicate the failure in our labs, and we've got a failing customer system on its way back to us for exactly that purpose. Once we have root-caused the issues, we'll work with Red Hat and kernel.org to get any driver-side fixes included." Well I'd say you can have mine, but being a production system I have to format the hard drive and install another OS. Looking at all of the information on the net about it, it seems to be effecting primarily Dell servers with Adaptec based RAID controllers, mostly PERC3. I can send all of the logs that I sent RHES tech support and Dell Gold Support, who both said they couldn't help. It also appears to be only effecting systems based on the later 2.4.x kernels, because I had no problem with RH7.2, but many people with 7.3 and newer are experiencing it. So, what has changed between these current versions. This driver was rewritten around that time. This problem has been going on for at least a year from looking at this bug report and the Adaptec and Dell lists. I hate to be the critic here, but somebody needs to take ownership of this problem and work with the vendors to get it resolved and stop pointing fingers, making excuses and waiting for someone else to fix it. At the risk of making Bugzilla into a forum, I figured I'd chime in to say our 2650, running RHEL AS 3, just had it's first crash after running for approx 2 months without incident. The utilization is very light, but we are in the process of deploying it as a NFS/NIS server. We purchased a production level Dell box (as opposed to a home-built server) for it's reliability. Oops. Are there any tests (seti@home, bonnie, etc), that are known to trigger a crash? The "once every two months" is almost more disturbing than "once a night". Hopefully this will get addressed very soon. I have moved my production system to an old Compaq server to allow for testing this issue on the Dell that is experiencing the problem. First, I disabled the postgres database on this server and it appears that the crashing has not happened over the last few days. But I had setup cron job to reboot the server to try to keep it from crashing too often. I've removed this reboot and will see if stopping the postgres database does truely have an effect on the server. With it running the system wouldn't stay up more than 8-12 hours. Also, a contact at Dell had me check the cache setting on the RAID controller. The write caching was disabled, so I've now enabled it. If anyone out there is experiencing this same 'scsi hang' issue with their RAID controller and the aacraid driver, I would suggest checking to see if postgres is running. If it is and you don't need it then shut it down and see what happens. This has been happening to me on almost a daily basis. I have two dual2.4Ghz Dell 2650's running Redhat 8.0 with kernel 2.4.23, with two drives in RAID-1 configuration. I have tried disabling Hyperthreading to no avail. We are looking at using more of these systems as we migrate from Windows 2000 to Linux (specifically RedHat Enterprise 3.0 and Fedora) but as of now our production Linux systems are still on Redhat 8.0. So far I have two registered AS 2.1 systems and one AS 3.0 system that do not have this problem but they are different hardware (Dell 6650). Any advice would be appreciated. This is the root of the fix that is part of the source code in this thread on October 15 2003. The 2.6 tree has had this fix for some time, the 2.4 tree is still cogitating over accepting this patch. Please apply this patch *NOW* to the drivers/scsi/aacraid/linit.c file. The original was taken from 2.4.21-9.EL Sincerely -- Mark Salyzyn --- linit.c.orig Fri Jan 30 05:38:19 2004 +++ linit.c Fri Jan 30 05:38:38 2004 @@ -605,8 +805,99 @@ static int aac_eh_reset(Scsi_Cmnd* cmd) { - printk(KERN_ERR "aacraid: Host adapter reset request. SCSI hang ?\n"); - return FAILED; + Scsi_Device * dev = cmd->device; + struct Scsi_Host * host = dev->host; + Scsi_Cmnd * command; + int count; +#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)) + unsigned long flags; +#endif + + printk(KERN_ERR "%s: Host adapter reset request. SCSI hang ? \n", AAC_DRIVER_NAME); + if (nblank(dprintk(x))) { + int active = 0; + + active = active; + dprintk((KERN_ERR "%s: Outstanding commands on (%d,% d,%d,%d):\n", AAC_DRIVER_NAME, host->host_no, dev->channel, dev->id, dev->lun)); +#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)) + spin_lock_irqsave(&dev->list_lock, flags); + list_for_each_entry(command, &dev->cmd_list, list) +#else + for(command = dev->device_queue; command; command = command->next) +#endif + { + dprintk((KERN_ERR "%4d %c%c %02x %02x %02x % 02x %02x %02x %02x %02x %02x %02x\n", + active++, + (command->serial_number) ? 'A' : 'C', + (cmd == command) ? '*' : ' ', + command->cmnd[0], command->cmnd[1], command- >cmnd[2], + command->cmnd[3], command->cmnd[4], command- >cmnd[5], + command->cmnd[6], command->cmnd[7], command- >cmnd[8], + command->cmnd[9])); + } +#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)) + spin_unlock_irqrestore(&dev->list_lock, flags); +#endif + } + + if (aac_adapter_check_health((struct aac_dev *)host- >hostdata)) { + printk(KERN_ERR "%s: Host adapter appears dead\n", AAC_DRIVER_NAME); + return -ENODEV; + } + + /* + * Wait for all commands to complete to this specific + * target (block maximum 60 seconds). + */ + for (count = 60; count; --count) { + int active = 0; +#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)) + __shost_for_each_device(dev,host) { + spin_lock_irqsave(&dev->list_lock, flags); + list_for_each_entry(command, &dev->cmd_list, list) { + if (command->serial_number) { + ++active; + break; + } + } + spin_unlock_irqrestore(&dev->list_lock, flags); + if (active) + break; + } +#else + for (dev = host->host_queue; dev != (Scsi_Device *) NULL; dev = dev->next) { + for(command = dev->device_queue; command; command = command->next) { + if (command->serial_number) { + ++active; + break; + } + } + } +#endif + if (active == 0) + return SUCCESS; +#if (defined(SCSI_HAS_HOST_LOCK) || (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0))) +#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,4,21)) && ((LINUX_VERSION_CODE > KERNEL_VERSION(2,4,21) || !defined (CONFIG_CFGNAME))) + spin_unlock_irq(host->host_lock); +#else + spin_unlock_irq(host->lock); +#endif +#else + spin_unlock_irq(&io_request_lock); +#endif + scsi_sleep(HZ); +#if (defined(SCSI_HAS_HOST_LOCK) || (LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0))) +#if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,4,21)) && ((LINUX_VERSION_CODE > KERNEL_VERSION(2,4,21) || !defined (CONFIG_CFGNAME))) + spin_lock_irq(host->host_lock); +#else + spin_lock_irq(host->lock); +#endif +#else + spin_lock_irq(&io_request_lock); +#endif + } + printk(KERN_ERR "%s: SCSI bus appears hung\n", AAC_DRIVER_NAME); + return -ETIMEDOUT; } /** Note to RedHat: Dell is shipping systems with RHEL 3.0 pre-installed, and this bug present. This is a big problem for the Dell/RedHat relationship. I strongly suggest RedHat managment escalation. Mark: This patch seems to address how we reset the scsi bus after it locks up. Any idea why it's locking up in the first place? Is that an expected thing? Thank you. ... also, I have a system that is suffering these lockups as often as once per week. If you need any debugging information from this system, feel free to contact me at jparsons dash redhat at saffron dot net. I just spent a week working with Dell engineers and their final response is that they haven't signed off on RHES 3.0 yet and they the vendor of the Lyris Listmanager software is report bugs with RHES 3.0. Well the last part is bogus and just an excuse for no one to take responsibility again. They do have issues, but I spoke with their engineers and it's simply an interface issue to do with Tcl/Tk. I did check my PERC3/Di configuration and found the caching turned off. I turned it on and it's been running for 3 days without crashing, whereas before it wouldn't go 8-12 hours without locking up with a scsi hang. I would like to see if anyone else could check this setting and see if it makes any difference. I have yet to apply the patch that Mark Salyzyn posted above. I will probably do that this week, before I rebuild the system with a different OS. I just "had my first" this evening (hang, that is). Please let me know if you need debugging information. Had to power-cycle the box to bring it back to life. Error/debug messages are all identical to previous reports. My 2650 has also been suffering from this problem. I have rebuilt it a few times every time scrubbing the raid 5 3disk array, it will run fine for up to 3 months. Then the scsi hang's start to occur, starting from 1 week intervals, to me waking up every day at 3:00 am to reset it. I have tried turning off hyperthreading, and will also try turning caching off. I cannot believe all the people reporting this identical problem however dell and redhat have not taken any ownership. I have also seen posts with similar windows errors on this same system. Unfortunately this is already a production system so it looks as if I will be forced to replace it. This bugreport is less than useful to both RH and Dell actually. It is a mishmash of different reports about different OSes and different hardwares. The bug is filed against Red Hat Linux 9 and not Red Hat Enterprise Linux or Red Hat Linux 8 or ... Bugs to different products are handled by different people and also bugzilla is *not* a support mechanism where Dell or RH "owe" anyone a response within a certain amount of time. Red Hat appreciates bugreports but in several cases, especially with hardware/firmware interactions, bugzilla is not the best medium since it bypasses all regular mechanisms to get joint problemsolving going with partners such as Dell. Unfortunately, from your comments above and the responses that myself and several other users have received from both Redhat and Dell, it is obvios that neither of you really care about this problem. The problem does effect several versions of Redhat. From reading the posts here in bugzilla and the posts on both Dell and Adaptecs forums this problem has been around for over a year and is effecting every version of Redhat from version 7.3 to 9.0 and ES 3.0. I have ES Standard support and Dell Gold support and the response I received from both of those "support mediums" is essentially 'your out of luck' and 'we really don't care'. So there is a good reason for this to be posted in Bugzilla, because contacting RHEL support did no good. Well, what is pay support for if it doesn't do any good. While I agree that a bug tracking system is not the appropriate place for unhelpful comments, yours included, this is the only place to share the different experiences that everyone is having about the same problem. You being the all smart one, if you would like to point out a REAL way to get this resolved then please shed some light on this for the rest of the community, otherwise keep your comments (Arjan) about telling everyone to go somewhere else for support to yourself. If you are unsatisfied about the support Red Hat provides that is very unfortionate, and I can try to escalate problems with that within our organisation, but I need some specifics for that (eg who did you talk to about what etc). (Obviously I cannot escalate problems with the support Dell provides) As for the problem; this *really* looks like a hardware/firmware issue since it happens with such a broad spectrum of drivers. I just looked at the third level support queue and there is at least one simimlar complaint there where both Red Hat and Dell are investigating and where also the firmware route is being investigated. This really is the sort of issue that needs investigation by all the parties involved (RH, Dell, Adaptec) not just Red Hat, and for that bugzilla is the wrong tool. I wanted to point that out before to make sure no false expectations about bugzilla arrise. If you want to provide me with a better place to post/send this information I will be glad to use it. The person I spoke with is ckloiber. The response I received is that I would just have to wait for the downlevel developer to fix the problem. (ie.. Alan Cox or Adaptec since they developed the most recent version) I don't understand what you mean by 'such a broad spectrum of drivers'. All of the problems that I've seen here and on the Adaptec and Dell sites all center around the aacraid driver. Dell was actually trying to help me for over a week, and Mark from Adaptec has posted a possible patch above. These are both more effort then I recieved from Redhat, I'm sorry to say. Dell ultimately came back and said they haven't signed off on version 3.0 of ES. Again that's really not the issue since the problem exists in 7.3/8/9, but not 2.1 since it is based on 7.2. Right now testing the patch that Mark provided and confirming the issue with aacraid and then getting it committed to the source tree would be the best route that Redhat/Dell/Adaptec could take. Then if it is a firmware issue a team effort to help Dell fix it would be next step. Again, this problem looks centered around aacraid included in ALL RH versions after 7.2. I really just want to help find a solution to this issue, so any help I can provide just ask. We have all seen reports of this occuring on RedHat 8.0, 9.0 and unregistered AS3.0 systems, and AFAICT the aacraid driver revision is the same across these platforms. RedHat support isn't available for 8.0, 9.0 or unregistered AS3.0 systems, but if RedHat wants AS3.0 to work on *registered* systems then RedHat needs to pay attention to all such reports, not just those for which they are directly being paid. Someone, somewhere (at Redhat, Dell or Adaptec, or all three) needs to install AS 3.0 on a 2650 on hardware RAID 1, wait for this problem to occur, and fix it. And then a patch for the driver against the stock 2.4 kernel would be very helpful. As far as expectations with Bugzilla go, Bugzilla is an issue- tracking tool, yes? This issue is being tracked here because it was first reported here. Perhaps the version of RedHat affected by this bug entry above needs to be changed from "9" to "AS3.0" - however that's not even an option(!) All that matters is that we are evaluating RedHat AS3.0 and we aren't necessarily going to do it on a registered system. The code is the same either way, no? And I'm sure you can imagine how bad it will look to my management when I tell them I'm having a problem in our test environment with the AS 3.0 RAID device driver on the hardware we currently use in production, such that the system needs to be rebooted every day or two. So far I haven't had to make a final statement on the suitability of this for production use, but I can't continue delaying that forever. One aside: how could RedHat release AS3.0 without signoff from all the major hardware vendors? Dell is a pretty big hardware vendor for RedHat, no? These 2650's have version A15 of the BIOS - is it possible that a BIOS update could affect this? (I see a revision A17 available at dell.com.) I might try that. Also, I'm also going to take a look at the caching for the RAID controller, as someone above mentioned. And I'll try patching the driver with the patch included above in my next kernel build. Or I might have to see about disabling the hardware RAID entirely and using Linux software RAID. But that of course, will degrade the performance of the system and defeats the purpose of purchasing Dell's hardware RAID in the first place. I wonder if Dell even sells these systems without that controller - and how much cheaper it would be... (and how much Dell would hate that). Anyway, thank you all. Can anyone confirm that Mark Salyzyn's patch on Jan 30th has had a positive effect? Our production server (Dell 2650/RHES 3.0/RAID 1) is hard locking about once a week and I'm desparate for a solution before we experience any data loss. I still have not applied the patch, but simply changing the cache setting on the controller to enable the machine has not locked up in over a week. It was locking every 8-12 hours with zero load. Make sure you check this setting. The default setting from Dell was disable and has never been changed until now. Response to "Additional Comment #36 From J. Parsons (jparsons- redhat) on 2004-01-31 14:44": Sorry if I am not a frequenter of this `list' ... The root cause is the Adapter places a high priority on flushing it's internal cache, high enough that it is reticent to new commands for periods longer than a minute. The trigger for this cache flush varies, and is usually tied to high load and less than perfect activities over the SCSI bus. The real fix is for the firmware to reduce this cache flush priority, but we must deal with the combinations of firmware out in the field. The change I proposed in the driver is to wait for the Adapter to `come back' and complete all outstanding commands when the timeout mechanism is detected by the SCSI layer and calls the bus reset handler of the driver. This fix also differentiates between adapters or busses that are in fact dead, and this temporary condition. Unfortunately I left in the code to call for adapter health which requires more extensive changes in the hardware interface layers. This section of the patch can be dropped. However, I still suggest that Adaptec's latest driver code be used as the driver in the kernel tree is full of bitrot (new patch submissions are automatically rejected because they are too large to go through the lists). Please feel free to contact me directly for latest sources. Sincerely -- Mark Salyzyn <Mark_Salyzyn> I've just installed Mark Salyzyn's latest RPM. For those using RHEL 3.0, Perl is evidently slightly broken in this distribution. Here is what I had to do to install the latest Adaptec RPM: 1. Install the RPM - lots of error messages during execution of install.sh will occur. 2. cd /opt/Adaptec/aacraid.rhel3 3. export LANG=en_US 4. export LANGVAR=en_US 5. ./install.sh I still got one "sed" error during execution of install.sh, but everything seems ok anyway. Now running aacraid 1.1-4[2323]. There is one visible error in this fix - the return of -ETIMEDOUT from the scsi reset handler isnt a valid return for scsi reset handlers. (2.6 and 2.4 eh are a bit different, the 2.6 code is way saner as it can do stuff like sleeping properly). 2.4 new_eh can be persuaded to do so but needs a bit more work. Otherwise this looks a reasonable quickfix. patch above doesn't compile in a current 2.4 tree. (No nblank, no aac_adapter_check_health). aacraid 1.1-4[2323] does not appear to fix this problem for us. We have the new driver installed on *two* PE2500 servers with PERC3/DI RAID controllers and we're still seeing the driver mark the host controller as dead periodically. We're using a 2.4.20 kernel with a bunch of security patches. In our case, since our primary storage is attached to this controller, we get crashes when this happens. We didn't start having trouble until we upgraded from RedHat 7.1 to RedHat 9.0. These two systems ran perfectly for a year and a half with no crashes, so I'm not entirely sure I buy the explanation about the firmware problem. It may have contribured, but right now, I suspect an overly zealous code performance optimizer got in there in the 2.4 kernel tree and introduce disk subsystem timing issues that result in this problem. In the research I've done on this problem, this appears to affect all 2.4.2x kernels, no matter which distribution. Those who have had this problem and changed to a different OS (FreeBSD, for instance) have not had any more trouble. We have a growing suspicion that a certain pattern of disk activity can actually trigger the problem almost at will. We suspect that when PostGreSQL is compiled with the right set of options and is fed a high volume of queries that it triggers the problem. This is still a preliminary theory and we still have a lot of testing to do to verify it, but so far all evidence is pointing us in that direction. Since we have two servers that are affected, we have the dubious honor of being able to move certain activities to certain servers. In this case, we're running some "virtual server" software (kind of like a chroot jail under FreeBSD) and we've managed to narrow down which virtual servers are causing the trouble for us. We had no crashes on RedHat 9.0 for several months. Then, out of the blue, we started having crashes after certain virtual servers were added to the system. When those virtual servers are migrated to the other PE2500, the crashes follow them. Their predominant activity is PostGreSQL usage. I know that that doesn't necessarily mean PostGreSQL is the agitator, but as we dig deeper into this, the correlation continues to increase. For now, we have the suspect virtual servers running on an IDE based system so I can get some sleep this weekend. I welcome any feedback, even if it is just to pick apart my theories. I'm willing to endure most anything as long as we're collectively able to arrive at some kind of solution. This was posted to a Dell mailing list, might be something to look into: -------- Original Message -------- Subject: aacraid problems summary Date: Wed, 18 Feb 2004 12:25:21 +0000 From: Jose Celestino <japc.pt> To: linux-poweredge Could anyone summarize the Dell vs aacraid Perc 3/Di problems latest achievements. I've out of this list for a while. We recently had some Dell dudes coming here and they did a s/tg3/bcm5700/ on the network driver and also changed the driver we had for our intel etherexpress quad port nics. They said the instability on the raid controller was due to the nic drivers. I hate to make this comment, but the resolution to this bug is long overdue. My company has over 160 linux boxes that I've been thinking of migrating to RedHat Enterprise server (they are currently 7.3). As it stands, our main monitoring box continues to crash with all of the latest updates and it's due to this bug. There is no way that management is going to go for such a sum of money when I can't even prove that these 7.3 boxes are stable (and as far as I know, the bug still exists in RHEL 3.0). We recently loaded RedHat Enterprise Linux ES v.3 (from distribution CDs) and updated it to current levels using RedHatâs âup2dateâ facility. We ran the server in test for about two weeks without any failures. We are now trying it in production and hoping that everything will continue to remain stable. Here is the environment: Server: Dell PowerEdge 2650 with PERC/Di System BIOS: A17 (upgraded from A10) Backplane firmware: 1.01 PERC3/Di firmware: 2.7-1 (build 3170) ERA firmware: 3.0 (upgraded from 2.20) Hyperthreading: Enabled Linux OS version: 2.4.21-9.0.1.ELsmp Aacraid driver version: 1.1.2 â This is based on the AAC_DRIVER_VERSION number in /usr/src/linux-2.4.21-9.0.1.EL/drivers/scsi/aacraid/linit.c Prior to this upgrade, we were executing successfully for 6 months on two servers using Linux loaded from the RedHat 9.0 ISO distribution plus its corresponding updates. Here is the environment: Server: Dell PowerEdge 2650 with PERC/Di System BIOS: A10 Backplane firmware: 1.01 PERC3/Di firmware: 2.7-1 (build 3170) ERA firmware: 2.20 Hyperthreading: Disabled Linux OS version: 2.4.20-18.9smp Aacraid driver version: 0.9.9ac6-TEST â This is based on the AAC_DRIVER_VERSION number in /usr/src/linux-2.4.20-18.9/drivers/scsi/aacraid/linit.c Please note that in the older configuration, disabling hyperthreading (upon Dellâs recommendation) circumvented the original problem we posted in this trouble ticket on 06/02/2003. In our case, the problem would normally show up within 24 hours, regardless of load on the server. With the current configuration, we are once again experimenting with enabling hyperthreading. Hopefully we will continue to execute trouble free. I will keep this list posted on my results. I agree that a formal resolution for the various aacraid problems needs to materialize, especially since RedHat and Dell both continue to advertise that RedHat Linux releases are 100% compatible with the PowerEdge 2650 and PERC3/DI controllers. I'm ready to report that aacraid 1.1-4[2323] does seem to have fixed the problem for me. My server was hanging approximately once per week previously, now it's been running for just over two weeks on the new RAID driver version. Since then, normal disk activity has increased (heavier utilization), and I just banged it through some disk trashing tests (copying a 3GB file over and over again) without a hiccup. I know it's only been 2X the average previous hangspan, and I won't feel totally safe until a month or so passes. But there are many on this list eager to hear some positive news on this bug, so here's some (tentatively). Unfortunately after installing the new aacraid driver (aacraid 1.1-4[2323]) we have not had the same success. The server was OK for under a week and exhibited a very similar problem to what we previously encountered. The main difference withthe new driver was that we could log in and shut the box down cleanly (previously the only option was to power cycle). BTW - the box is a PE 2650 running the 2.4.20-28.8smp kernel. Created attachment 98049 [details]
Redhat Kernel debugging pdf
Redhat Kernel debugging capture process pdf
Attaching a .pdf file that explains how to capture memory and CPU infomation after a mysterious freezes/hangs. We just had another crash about 30mins ago and these did not work for us.... system hangs to hard, kernel logs were not even produced after this crash. We are still running with Red Hat/Adaptec aacraid driver (1.1.2 Feb 9 2004 22:40:31). Look at section #2 in the pdf. Maybe these will work with the 1.1-4 raid driver. Scott Walker: /usr/src/linux-XX/Documentation/nmi_watchdog.txt is also interesting James Oliver (james) case is perplexing. We have had a high degree of reported success for the driver workaround with many others. Please note that there are several other real-world sources of possible SCSI bus lockups. The Driver fix is not a panacea for the problems of the world ;-/ These include Hardware problems and Drive Firmware issues. Most of which can not be worked around in Firmware or Driver. Fortunately due to al lthe testing that has gone into the controllers and hardware, they are rare. This driver patch is used to mask an issue that surfaced in 3170 FW, where Adapter Cache Flush caused commands to time out. The best combination *I* can recomment today is 3/Di 3170 Firmware and the mentioned driver. Any other issue will need the involvement of Dell Technical Support to track down the root cause. This does not mean I have abandoned other lockup issues, just that realistically other uncovered issues may require a finer level detailed analysis. Sincerely -- Mark Salyzyn This might be helpful as well: http://www.redhat.com/support/wpapers/redhat/netdump/ netdump is similar to Sun Solaris core dumps. We have a RHEL support case open and are waiting for the next crash to catch a core for analysis. I will post any information that comes forth. For what it's worth, we couldn't reproduce this in a non-production environment using PostGreSQL. However, I was not aware that the updated driver code worked best with a specific version of firmware, so I'll update the firmware on our systems to build 3170. Hopefully that will solve it for us. If we have any further crashes after updating the firmware (we're already using the new driver), I'll report back here and try some of the suggestions for getting useful debug info out of the system. Its been a week that we've been successfully executing under the environment described in comment #55 posted on 2004-02-23. Under previous releases of the OS and driver, we normally would have encountered the original problem for this post within 24 hours and no longer than 48 hours. We are hopeful that the problem has been resolved, and for now, we plan to continue using the standard RedHat Enterprise Linux ES v.3 distribution and updates as described in comment #55. Mark, RedHat, and /or Dell support folks⦠are their any significant enhancements to the aacraid driver that should cause us to consider manually upgrading from version 1.1-2 to 1.1-4 rather than waiting for RedHat to distribute the latest version as part of their update process? Andrew Kinney, I may have been misunderstood. 3170 is the latest Firmware, it contains improvements over 3157 or it would never have been released. The driver workaround was not meant to function best specifically with any Firmware, but was a workaround for a new issue that surfaced with 3170. The workaround permits 3170 to function, and allows one to take advantage of the other improvements in the 3170 Firmware. Some are perplexed by this issue. From my view there are several problems, of which 3170's reticense during an Adapter Cache Flush accounts for a *substantial* majority of the experiences. Until the remaining issues are duplicated at Dell for Firmware Engineers to view, I might not be part of the loop ... Well, I have to take my optimistic news back. After one week, I had to reload the server via a power cycle as the entire server was completely locked up. I had no console access, and I was not even able to access the ERA module to perform a remote server restart. After the reload, the various system logs do not contain any information regarding the problem. I turned off hyperthreading in hopes that it will circumvent the problem as in the past. We have a DELL PE 4600 and started experiencing this bug since upgrading from RH 8.0 to RHEL 3.0 (kernel 2.4.21-9ELsmp), at the time with controller firmware build #3157. The problem was invariantly triggered during daily backup jobs, which impose heavy I/O loads on the Perc3/Di controller. Following suggestions on this forum and from DELL Technical Support, we have updated BIOS(A10), ESM(A31) and Perc3/Di(2-7-1, build #3170) firmwares as well as the aacraid driver (1.1-4[2323]). We are running with disabled hyperthreading for a week now. Since then, no change was seen in the server behavior, with daily crashes at backup time. The backup script is identical to the one that worked smoothly with RH 8.0. I can reproduce the problem interactively from the console. I enabled the nmi_watchdog as suggested above, but the lockup is always severe wnough that no messages are produced. I am happy that the new driver seems to attenuate the PE 2650 problems, and therefore is a step in the right direction, but unfortunately it had no effect on our case. Will keep looking... Many thanks for all the helpful posts I have found here. The ERA access issue in comment #67 was the result of an MTU sizing problem with the VPN I was using to attempt to restart the server remotely via the ERA. The rest of the problems are similar to what others are reporting. I will try executing without hyperthreading enabled for a bit, and then I will also try driver version 1.1-4. Following up on my comment from Feb 24th, my server just experienced a hard lock. Running RHEL 3.0 with aacraid 1.1-4[2323] on a Dell 2650, RAID 1. Unfortunately I can't see the console as it is managed by Rackspace, but their tech indicates that he saw this message on the console: ext3_fs error (device sd(8,2)): ext3_get_inode_loc: unable to read inode block Obviously it would be nice to know what other messages were on screen - such as the "SCSI hung?" message - but unfortunately it's gone now. For a couple of minutes the server was accepting connections (e.g. SSH), but would not complete the login sequence (no bash prompt). Then two or three minutes into this, the server hard locked and would not accept any connections at all. I suspect this is the same issue again, though I can't be certain (what else would cause ext3-fs to fail like this?). Total uptime: 3 1/2 weeks. Bryan Field-Elliot, which firmware are you using? Is it the build 3170 or later? The reason I ask is that we're just getting ready to put our two misbehaving PE2500's back into the usage scenario that was predicating our lockups and we're using build 3170 of the firmware and aacraid 1.1-4[2323] in the hopes that the specific combination will be stable. If I can avoid a crash on these production machines, I'd like to do so until a better fix is available if your crash was on the build 3170 firmware. One workaround we're going to expriment with is increasing the stripe size from 32KB (the default) to the largest possible (128KB I believe). Since we're using RAID 5 and that requires 4 I/O operations per write, we're hoping that minimizing the number of times the host controller has to issue writes to the array will also minimize the amount of time the controller takes to come back from a cache flush or heavy I/O peak. We also turned on write caching on the drives so that the controller spends less time waiting on the drives. That should also help the controller become available again faster. Of course, what would make more sense to me is a firmware/driver combo that doesn't allow the cache to get crammed to the point that the OS has to wait on the controller. Let some of those disk hungry processes spend time blocking instead. The OS is better able to handle the disk I/O backlog than the controller and it sure beats a crash with file system corruption. For what it's worth, I say this without any actual knowledge of how to rewrite the firmware, card driver, or SCSI subsystem driver. ;-) I was hoping someone with that knowledge would pick up on this. A question for all that have had the problem: Are you all using ext3? Anyone using anything other than ext3? The reason I ask is that it would seem possible that the existing problems are only exposed when the added overhead of journaling is brought into the picture. Normally, journaling replaces typical filesystem table updating, but in the case of ext3, it is really just added activity over and above what already happens with ext2 (unless I misunderstood how ext3 works). This is still just a theory until I get some feedback on what filesystems are in use for those that see this problem. As for us, we didn't start having trouble until we upgraded from RedHat 7.1 to RedHat 9.0. We were using an older kernel (2.4.1), reiserfs (a true journaling filesystem, not a hybrid like ext3), and ext2. Now we're using a newer kernel (2.4.20) and ext3. It's entirely possible that the filesystem and not the kernel is to blame. The likelihood of that being the case increases if nobody can cite an instance where this happened on a non-ext3 filesystem. Anecdotally, it would appear that most instances (if not all) of this problem are when ext3 is in use. So, can anyone cite an instance of this problem where ext3 was not in use? We have 3 Dell PE 2650 running RedHat Linux 8.0 (kernel 2.4.20- 20.8smp and 2.4.20-28.8smp) with SCSI hang problem since Nov 2003. We did not experience any problem when running RH 7.2/3 using ext3 file systems, hyperthreading and smp kernels for a whole year. When we upgraded to RH 8.0 (ext3), the SCSI hang occurs on average once per week. We have other Dell servers (2400s, 2500s) running RH 8.0 without any problem. We also have a Dell 2650 server running RH 7.3 (kernel 2.4.20-19.7smp) with full ext3 filesystems and hyperthreading without problem. We have already applied Mark Salyzyn's aacraid 1.1.4 and we are also running Dell firmware build 3170. We have tried various kernels from 2.4.18-14 to 2.4.20-28.8 without any success. We don't seem to have much choice now except to try the megaraid driver on the LSI Logic PERC/3DC PCI raid card. The problem is we need to migrate the raid configuration from the 3Di to the 3/DC the 3/DC does not support drive/raid roaming and we probably have to install from scratch. Will let others know if we are successful, if anyone has any other solution, please let us know. Steven Ng (stevenng) We just experienced this error for the first time. We have been running with this configuration for over 3 months with no errors. We are running JFS fs. We are also running Postgresql. Server: Dell PowerEdge 2650 with PERC/Di Hyperthreading: Disabled Linux OS version: Linux version 2.4.21-9.0.1.ELsmp Red Hat/Adaptec aacraid driver (1.1.2 Feb 9 2004 22:40:31) AAC0: kernel 2.7.4 build 3170 AAC0: monitor 2.7.4 build 3170 AAC0: bios 2.7.0 build 3170 AAC0: serial 474c61d3fafaf001 Found something that might be of use, but you'll need to check to see if your kernel already has these patches applied. Apparently, these bugs have been around for a very long time and require patches to the 2.4.20 kernel to fix them. The patches didn't make it into the mainstream kernel until 2.4.21, but they still fixed a couple more ext3 bugs in 2.4.22. Basically, the ones of concern relate to how the kernel interacts at the OS buffer level with a journalling filesystem. Found them from here: http://www.zip.com.au/~akpm/linux/ext3/ They're located here: http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.20/ This one is of special interest: http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.20/ext3-scheduling- storm.patch The sync_fs series of patches also could be of interest for this problem. I noted that several of those patches were not specific to ext3 and thus might have an impact on those using other journalling filesystems. I'll be checking with our vendor to see if our 2.4.20smp kernels have these patches installed. I could see how they might get committed to the uniprocessor code and the smp/hyperthreading specific code may have been overlooked or additional work was required to get them working on smp/hyperthreading, so it didn't happen until later. So far, from all the cases I've seen, this issue is isolated to systems with the PERC3/DI card, journalling filesystems, an SMP kernel by way of hyperthreading or real SMP, and disk activity with many small files or database use that has frequent small I/Os. I'm not sure if this is relevant or not, but the majority of cases also appear to be with kernels later than 2.4.20-pre5 but earlier than 2.4.22, which roughly corresponds to the period of time between when the bug was created and when it was discovered and fixed. PE2550 with PERC3/DI reiserfs: format "3.6" with standard journal Kernel.org 2.4.24 SMP (Kills Uptimes Dead) PostgreSQL 7.x Kernel.org 2.4.9 did not exhibit this problem and ran for 462 days. Other info: <http://lists.us.dell.com/pipermail/linux-poweredge/2004-February/018302.html> dmesg output using 2.4.9: percraid device detected Device mapped to virtual address 0xf8800000 percraid:0 device initialization successful percraid:0 AacHba_ClassDriverInit complete scsi0 : percraid Vendor: DELL Model: PERCRAID Mirror Rev: 0001 Type: Direct-Access ANSI SCSI revision: 02 Vendor: DELL Model: PERCRAID Mirror Rev: 0001 Type: Direct-Access ANSI SCSI revision: 02 dmesg output using 2.4.24: Red Hat/Adaptec aacraid driver (1.1-3 Jan 7 2004 23:58:20) AAC0: kernel 2.5.4 build 2991 AAC0: monitor 2.5.4 build 2991 AAC0: bios 2.5.0 build 2991 AAC0: serial 556c01d2fafaf001 scsi0 : percraid Vendor: DELL Model: PERCRAID Mirror Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 Vendor: DELL Model: PERCRAID Mirror Rev: V1.0 Type: Direct-Access ANSI SCSI revision: 02 I have encountered this bug when running NFS benchmark SpecSFS3.0 (see www.specbench.org), using PowerEdge 2650 running RH9.0 kernel 2.4.20-8smp as an NFS server. Here is the configuration that leads to reproducing the problem reliably: 1a. Configure two U320 SCSI disks as RAID0 stripe container 1b. Enable RAID controller write cache memory 2. PE is running NFS server and exports mount point to container created in #1 3. Run standard config file for SpecSFS on a RH9.0 client that targets machine in #2 Result: Fails with ext3 errors like those observed in this report 100% of the time. I've encountered and reproduced this problem on our PE2650 systems running Red Hat AS 3.0 with the 2.4.21-9.0.1ELsmp kernel. Our 3/Di controllers have firmware version 2.8-0. When the error occurs all disc activity on all RAID containers stop. Nothing is written to log, but an error message like "I/O error: dev 08:02, sector 83200" can bee seen on the console. Copying or rsyncing a large number of files from an NFS share, on gigabit ethernet, to a local RAID 1 container causes the PE2650 to crash within minutes. Using rsync through ssh does not trigger the problem, even with 300,000 files summing up 25GB. For what it is worth, this problem still occurs even with kernel 2.4.25 (not RedHats), and aacraid 1.1.5. PERC3/Di firmware 3170 ext3 filesystem It crashed less than 2 days after rebooting, which (for this statistically small sample) is less uptime than with RedHat kernel 2.4.18-14smp Logs: Apr 10 04:03:51 promo kernel: aacraid: Host adapter reset request. SCSI hang ? Apr 10 04:04:51 promo kernel: aacraid: SCSI bus appears hung : : Apr 10 04:15:31 promo kernel: aacraid: Host adapter reset request. SCSI hang ? Apr 10 04:15:31 promo kernel: aacraid: SCSI bus appears hung Apr 10 04:15:31 promo kernel: scsi: device set offline - command error recover failed: host 0 channel 0 id 0 lun 0 Apr 10 04:15:31 promo kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 6000000 Apr 10 04:15:31 promo kernel: I/O error: dev 08:03, sector 11014520 Apr 10 04:15:31 promo kernel: I/O error: dev 08:03, sector 12058624 Apr 10 04:15:31 promo kernel: I/O error: dev 08:03, sector 12058856 : Apr 10 04:15:31 promo kernel: I/O error: dev 08:03, sector 39846008 Apr 10 04:15:31 promo kernel: I/O errev 08:03, sector 262368 : Your Firmware has hung solid (the driver can not do anything about it ...). Please confirm that the 1.1.5 driver is loaded (cat /proc/scsi/aacraid/?) to be sure it is loaded (but the default 2.4.25 driver does not report SCSI bus appears hung, so I am just covering the bases). I had a desire to send you an instrumented driver reporting the Adapter Health in more details, but since your driver did not report any `health check' problems (Host adapter appears dead message did not show), I fear that this driver will just report the same. You may be able to get more details about why the Adapter or SCSI bus is hung on your system by commenting out the AAC_DETAILED_STATUS_INFO line in aacraid.h line 43. There is probably a set of reports just prior to the failure that may indicate which target is the problem. I strongly urge you to report your problem to Dell technical support. They have an Adaptec Firmware engineer embedded on staff to try to trace this problem down. In addition, they may be able to trace your problem down to other possible sources (one cause is a bad power supply). The appearance of this problem getting worse on additional samples may be a set of deteriorating hardware. Sincerely -- Mark Salyzyn Hi, is this thread still alive? Because I have the problem and still didn't found any solution? Today, doing a search (again) about this problem, I found this big thread. I have the same problem on a PE 2650 with PERC/3Di with Red Hat AS 2.1 running Oracle and ext3. BUT, here is another thing, I also have the same errors on an IBM xSeries 255 with a ServRAID-6M (also with Oracle and ext3). The module used is aic7xxxx (witch is for Adaptec too if i'm not mistaking). Little thing that I have notice. Our other dell servers (not PE2650), that are on ext2 are not crashing. We also have 2 other IBM xSeries 255. One is not crashing (ServRAID-4). And the other one, with ext2 for / and /boot, ext3 for the rest of the partitions, got the error messages in the log files, but NO crash! In all those servers I had to add a 1Gb NIC and change the tg3 module for the bcm5700. I don't remember the versions by heart but I will check that and the thread again in deep tomorrow. Here is a portion of the log from the IBM --------------------------------------------------------- Apr 21 09:19:49 ibmfin kernel: (ips0) Reset Request - Flushed Cache Apr 21 09:21:05 ibmfin kernel: scsi: device set offline - not ready or command retry failed after host reset: host 0 channel 0 id 2 lun 0 Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:21, sector 37576 Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:21, sector 37600 Apr 21 09:21:07 ibmfin kernel: SCSI disk error : host 0 channel 0 id 2 lun 0 return code = 30000 Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:22, sector 22922064 Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:22, sector 22922072 Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:22, sector 22922064 ... Apr 21 09:21:07 ibmfin kernel: EXT3-fs error (device sd(8,33)): ext3_get_inode_loc: unable to read inode block - inode=4816905, block=9633799 Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:21, sector 0 Apr 21 09:21:07 ibmfin kernel: EXT3-fs error (device sd(8,33)) in ext3_reserve_inode_write: IO failure Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:21, sector 0 Apr 21 09:21:07 ibmfin kernel: I/O error: dev 08:21, sector 77070392 Apr 21 09:21:07 ibmfin kernel: EXT3-fs error (device sd(8,33)): ext3_get_inode_loc: unable to read inode block - inode=4816905, block=9633799 ---------------------------------------------------------------- For now, it is the biggest discussion that I found about that problem. Thanks, Jean-Philippe Houde For the record we installed a PERC3/DC card on the systems. These use the megaraid driver as opposed to the aacraid driver. This has solved (worked around) our problem (though we had to shell out cash for the other cards). <rant> I feel compelled to say that the lack of support from Dell on this (and RedHat for those who have been experiencing it under RHEL) is disappointing. Especially as these servers are reportedly certified to run with this hardware and Dell and RedHat purport to have a partnership arrangement. </rant> Aside from that Mark Salyzyn from Adaptec was very helpful in trying to assist with the problem with the PERC3/Di driver even though it was ultimately unsuccessful - thanks Mark. James Oliver, Agreed regarding rant. The irony is that while I probably will buy more Dells, I'll probably avoid Adaptec chipsets. As for Red Hat, this is just reinforcing my belief that it isn't worth it to pay a company to support Linux because they rarely do better than the open source community at large. For what it's worth, we haven't had any more problems. We took a shotgun approach to solving this because I didn't have time to fiddly- dink with trying out individual solutions that *might* work. Here's what we did: 1. Moved all active PostGreSQL users with low buffer and cache settings to different hardware. 2. Turned on write-caching on the drives themselves in addition to the writeback controller cache. 3. Updated the RAID firmware. 4. Updated the RAID driver. 5. Updated the kernel. 6. Crossed fingers. 7. Crossed toes. 8. Tried to convince our important customers that we were doing everything possible to ensure their important systems stayed up and running. 9. Revisited this bugzilla thread frequently in the hopes that someone (a kernel developer in particular) might actually fix this. 10. Mourned the fact a $7500 server is now as useful as a five year old P*ckard-Bell desktop. (ok, that's a big exageration, but it's painful to realize that a $600 generic server is more reliable) For what it's worth, I also appreciate the effort that Adaptec put into this, though I still firmly believe the problem lies in the kernel buffering code and not the driver or firmware code. Andrew, how can I turn on write-caching on drives, as you wrote in previous message at item 2)? It's a setting in the RAID BIOS screens. I don't remember which one since it's been several months since I looked at those screens. It's likely in the documentation, though. How have you guys been seeing the error messages? It looks like in the source its kern.err. Are you writing ker.err out to some text file? Or do you have a VT100/320 terminal plugged into the serial port? My systems (Dell 2650, RHEL 3.0WS) keep locking up, but I'm seeing exactly zero error messages. Perhaps because the RAID write buffer is not flushing when the SCSI device hangs? Thanks. Todd, in our situation, we are using ânetdumpâ to send the syslog messages and dump of physical memory to an external netdump server. At times the problem causes a slow loss of services rather than an instant crash. If we notice the problem in time, we are able to SSH into the server and capture the logs. Once the server locks up and it is rebooted, the log information is lost since the original problem noted in this report prevents the error information from being written to disk. Have you tried disabling hyper threading? Its not a fix for the problem, but in our situation, it has allowed our servers to run for over 6 months under RedHat 9, and now over a month under RedHat ES 3.0. Is it possible that the problem occure not only on Dell PE 2650 but also on IBM xSeries 255 witch has a SeveRAID-6M (with an Adaptec chip). We have the same problem on both Dell PE 2650 and IBM xSeries 255. The ServeRAID use the ips.o module. IBM told us that the fix was in those patches: https://listman.redhat.com/archives/ext3-users/2002- December/msg00123.html I doubt it is! but we will try them anyway. Javier, I have tried disabling the hyperthreading and unfortunatly it didn't work for us. Dell just released a new PERC3/DI firmware that supposedly addresses this problem, at least on the PowerEdge 2500 systems. For those with Dell systems, you'll want to login to the Dell support site and go to the downloads area for your system to see if you can get your hands on the new firmware. Just be aware it requires the latest system BIOS be installed before updating the RAID controller firmware. It might be just the thing we've all been needing to kick this problem completely. Is this new Perc3/DI firmware as of April 14-16? Or are you referring to another update? I don't see one more recent than that, but I'm on 2650s, not 2500s. Add another voice to the fray. Dell had us upgrade two packages (aacraid-1.1.4 -2302dkms.noarch.rpm and dkms-1.00-1.noarch.rpm), but neither of them seems to have loaded (of note is that we installed 1.1.4 of aacraid, but 1.1.2 still loads). After that, I upgraded the RAID BIOS, but, still, no difference. Have any RH EL admins found anything good from an upgrade to kernel-2.4.21-9.0.3.EL- i686 from kernel-2.4.21-9.EL-i686? I know little about kernels or whether anything in the changelogs describes a fix to any of these issues. The new firmware for the 2650 is version 2.8-0 build 6089 and is provided on Dell's site with the aforementioned aacraid driver rpm packages. "Cache flush loop decreased to prevent extensive spin-lock hold during high I/O" is mentioned as a specific fix in this release. The driver packages did not work for me either (the RHEL kernel has been upgraded since their release) and I have written Dell support regarding that. However, you can use the dkms rpm to install your own kernel module using the aacraid source files on adaptec's site: http://www.adaptec.com/worldwide/support/driverindex.jsp?sess=no type "aacraid source" in the search box. If you are not faint of heart, feel free to email me and I will send you instructions on how I went about installing the kernel module. Anyway, I have never been able to achieve the lockup condition, so it would be nice if someone would stress test the new Dell BIOS and most recent aacraid driver: Red Hat/Adaptec aacraid driver (1.1-5[2326]) AAC0: kernel 2.8-0 build 6089 AAC0: monitor 2.8-0 build 6089 AAC0: bios 2.8-0 build 6089 AAC0: serial 635441d3 Found in latest linux kernel changelog (http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.6): <markh> [PATCH] aacraid reset handler update This is an update from the Adaptec version of the driver to the aacraid reset handler. The current code has a logic error that is fixed by this version. This builds against 2.6.5-rc1. Sean - I had the same issue with kernel-2.4.21-9.0.3.EL- i686 from kernel-2.4.21-9.EL-i686. I am now running 2.4.21-15.ELsmp for 30 hours now and still running. I never made it past 15 hours before. As for using dkms and the lastest aacraid driver from Dell I am running both. There is s typo in the /var/dkms/aacraid/1.1.4-2302/source/dkms.conf file however. The line PACKAGE_VERSION="1.1.4.2302" Should read PACKAGE_VERSION="1.1.4-2302" After changing that you can do a manual build/install for your kernel by doing: dkms build -m aacraid -v 1.1.4-2302 dkms install -m aacraid -v 1.1.4-2302 I have my fingers crossed that this driver and this new kernel resolve the issue. Just following up. My system has been up and running since 5-13 without issue. End users and myself been using it pretty hard all weekend as well. After installing the latest firmware and aacraid driver as noted in earlier addendum, I find that the problem still exists for at least the case of a RAID0 container configured with 4 disks. I have not observed the problem for RAID5 however. For me, this bug is sensitized when the PowerEdge 2650 is configured as an NFS server and a client sources a workload using the SpecSFS benchmark utility. Here is a summary of all testing using SpecSFS to date. RAID level # disks Status Test cnt RAID5 3 PASS 1 RAID5 4 PASS 2 RAID0 4 FAIL 3 The messages in /var/log/messages are of the form: aacraid: Host adapter reset request. SCSI hang? aacraid: SCSI bus appears hung scsi: device set offline - command error recovery failed I/O error: dev 08:10, sector xxx The system becomes unusable. We have server with Adaptec 2120S controller running RedHat ES 3. Raid configuration RAID 10 probed BIOS of controller- 6008, 6013, 7224 Tested with kernels 2.4.21-4, 2.4.21-9.0.3 (RH), vanilla kernels 2.4.26 with(and without) adaptec driver 1.1.5, and 2.6.6 kernel. With all kernels and all flash- aacraid:SCSI bus reset issued on channel 0 aacraid: Host adapter reset request. SCSI hang ? aacraid:Drive 0:1:0 online on container 2: aacraid:Drive 0:1:0 online on container 63: aacraid:Drive 0:3:0 online on container 2: aacraid:Drive 0:3:0 online on container 62: aacraid:Drive 0:4:0 online on container 2: aacraid:Drive 0:4:0 online on container 63: aacraid:Drive 0:5:0 online on container 2: aacraid:Drive 0:5:0 online on container 62: Server works, but very unstable. After 2 days need reboot, another - stopped. With vanilla kernel 2.4.26- server 30 min running without errors, but dead hangs. On another server (2200S), RAID 10 - with 7224 bios and 2.6.6 kernel- no problem. It's problem of Adaptec or Linux??????? http://lists.us.dell.com/pipermail/linux-poweredge/2004- May/020104.html describes the workaround we are encouraging all customers with Adaptec ROMBs to use. I have made this change 3-4 weeks ago using the above afacli rpm to 8 Dell 2650's here, all running RedHat 8.0, Fedora Core 1 and 2, and AS 3.0. This problem has gone away since then. What impact, if any, does this have on performance? Thanks!! Eric Matt, any word on whether or not the latest aacriad driver solves this issue or whether the workaround might still be needed even after a driver update? This problem has happened for us since 2.6.6. Backing up to 2.6.5 makes the problem go away. This is on six identical 2650s with all the latest firmware updates, running FC1 and custom kernel RPMs built from stock 2.6 sources without any patches applied. What's very strange about this is that it does not seem to crash as long as Apache isn't running. My MX servers (running Postfix 2.0) have been up for 20+ days whereas my POP servers (which also run SquirrelMail) and my Cpanel server crash at least once per day. As a test I shut off Apache on one of my POP servers and it's now been up for 5 days without incident. Cpanel runs Apache 1.3 and the POP servers run Apache 2.0. Doug, it's not a driver issue, it's a firmware issue. We've got a newer firmware available for testing, and hope to have it released on support.dell.com "soon". I discoverd the "kernel I/O error" three days ago on a brand new Dell Power Edge 2650 System after 14 hours of kernel compiling tests under Debian Woody with testing kernel 2.4.26 aacraid 1.1-3. As Matt Domsch recommended I disabled the Perc's read/write cache and the machine is now stable for 48 hours compiling kernels and writing tar files in an endless loop until the edge of the disk. The impact on the performance measured with bonnie seems to be small: Dell Power Edge 2650, 2 x 2.8 Xeon, 2GByte RAM, PERC 3Di 2x73 GByte Raid 1 Debian Woody 3.0 Kernel 2.4.26 Red Hat/Adaptec aacraid driver 1.1-3 Read Write Cache enabled Version 1.02b ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP ns1 4G 22077 82 21839 13 13008 4 23723 69 55881 11 444.3 1 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2835 99 +++++ +++ +++++ +++ 2860 99 +++++ +++ 6415 98 ns1,4G,22077,82,21839,13,13008,4,23723,69,55881,11,444.3,1,16,2835,99,+++++,+++,+++++,+++,2860,99,+++++,+++,6415,98 Read Write Cache disabled Version 1.02b ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP ns1 4G 26513 99 33153 20 17939 6 23534 71 61521 11 422.3 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 2815 98 +++++ +++ +++++ +++ 2845 98 +++++ +++ 6246 95 ns1,4G,26513,99,33153,20,17939,6,23534,71,61521,11,422.3,0,16,2815,98,+++++,+++,+++++,+++,2845,98,+++++,+++,6246,95 Overall results indicate that disabling the cache even slightly increases the performance on the cost of a higher CPU load. I can easily live with that but still some doubts remain whether it's wise to start a production system on insecure hardware. As Matt I encourage all victims of this bug to disable the controller cache and *please* do publish your results! From my point of view two questions remain: 1. In case of further errors would it be a *safe* solution to disable the Perc3 and upgrade to a Perc4 LSI Controller? Any experiences with this hardware on Power Edge Servers under Linux? 2. When will the new firmware be available :-) ? Matthias, One correction to your post. With caching disabled you are getting increased throughput, but it is *not* at the cost of increased CPU. There is a certain amount of system overhead for every page of data read or written. When you increase the number of pages per second, then your CPU use goes up. Not because it costs you any more CPU time per page, but because you are processing more pages. So, in meaningful terms, there is no CPU cost, just that since the pages are coming in faster the CPUs have more data to work on and are sitting around waiting on data less. In a perfect disk setup, you would be able to get 100% CPU usage in these sorts of tests which would mean that the disks are able to feed the CPUs fast enough that the CPUs are never waiting on the disks instead of doing work. So, a step backward is when you get the same amount of throughput and use more CPU, or situations like that. That's one of the reasons I like the tiobench program for benchmarks since it divides throughput by CPU usage to get a number for CPU usage per page processed. That's a *far* more meaningful CPU metric than just CPU usage total. Dough, fair enough, thank you for explanations. Here are the tiobench values for the same machine: Dell Power Edge 2650, 2 x 2.8 Xeon, 2GByte RAM, PERC 3Di 2x73 GByte Raid 1 Debian Woody 3.0 Kernel 2.4.26 Red Hat/Adaptec aacraid driver 1.1-3 Read Write Cache enabled tiobench No size specified, using 1792 MB Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/sec File Block Num Seq Read Rand Read Seq Write Rand Write Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%) ------- ------ ------- --- ----------- ----------- ----------- ----------- . 1792 4096 1 35.72 8.39% 12.05 1.54% 21.33 14.4% 8.687 3.89% . 1792 4096 2 19.68 4.89% 14.54 1.86% 20.24 21.2% 8.767 4.48% . 1792 4096 4 16.71 4.39% 16.27 3.47% 20.36 31.5% 8.438 6.30% . 1792 4096 8 15.43 4.44% 18.14 3.77% 20.36 37.0% 8.532 7.23% Read Write Cache disabled tiobench No size specified, using 1792 MB Size is MB, BlkSz is Bytes, Read, Write, and Seeks are MB/sec File Block Num Seq Read Rand Read Seq Write Rand Write Dir Size Size Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%) ------- ------ ------- --- ----------- ----------- ----------- ----------- . 1792 4096 1 57.55 13.9% 8.152 2.08% 27.80 19.1% 1.712 0.65% . 1792 4096 2 21.13 5.32% 9.246 1.77% 27.40 29.2% 1.701 0.92% . 1792 4096 4 16.80 4.36% 9.980 1.91% 24.46 37.8% 1.705 1.34% . 1792 4096 8 15.23 4.17% 10.53 2.19% 22.70 41.0% 1.709 1.50% As expected random reading and more impressively writing is benefitting from the cache whereas the sequential values are slightly better with cache disabled which I find hard to explain. Maybe you can help me with the interpretation? On our systems, which are heavily multi-threaded (VPS servers) and do a lot of random read/writes, I'm not ready to take a 40% to 400% disk performance reduction. With our disks running near their I/O capacity during backups, such a performance hit would be very noticeable. Matt Domsch, once that new firmware is out of testing, please let us know here if you have a spare moment to do so. It would help me and I'm sure it would help others as well. Oh, and FWIW, since I've made the changes I detailed earlier in this thread (turned on all caching, including on-drive) plus installed the newer firmware (from April 2004), I've moved our "bug triggers" (as I like to call them) back to the hardware experiencing the trouble and, in the month+ since then, we've not had any more trouble. My fingers are still crossed that it stays that way. :-) Lately, we've been pressing the disk systems quite hard on these two systems that had the trouble (backups on top of normal activity), so I would expect that if it was going to have trouble that it would have done so by now. As described in Comment #67, we have a Dell PE4600 that has fallen victim to this bug since December 03. The reproducible trigger to cause the crash were our daily backup jobs, which had to be given up to prevent user madness and admin lynching. A few weeks ago, I finally got the time to do firmware upgrades on the machine and got updated BIOS(A11), ESM(A31) and Perc3/Di(2-8-0, build #6089). I kept the aacraid driver 1.1-4(build #2323) with kernel 2.4.21-9ELsmp. After turning the daily backups on, I got 3 days of uptime, as opposed to 24 hours. I then followed Matt's advice and disabled read/write cache on all containers and am now up to 17 days without a crash. It looks like the problem is finally solved if one can live without the r/w cache. I probably can but will be eagerly awaiting the next firmware upgrades. Thanks for all the work! As seems to be typical of issues like this, I spoke too soon. Within hours of posting that the problem appeared resolved with the newer firmware, the bug was triggered again. Ironically, bugzilla was down at the time as well. It probably doesn't mean anything, but the number of context switches (we graph a lot of things on our servers) went from the average of 12000 per second to almost 8 million per second just before the system became completely unresponsive. I suspect that may have just been a symptom of running processes losing access to the disks, though. At any rate, the bug is still alive and well in our machines. As much as I hate to do it, I may just break down and turn off caching. Since we're running RAID5, we're probably going to take a big performance hit and overall system load will increase (waiting on disks), but I suppose that's better than a 2 hour boot sequence after a crash. if ( $rock>$hardplace ) { $trash=cutoff ( $arm ); print "$expletive. That hurt.\n"; } elsif ( $hardplace>$rock ) { $trash=cutoff ( $leg ); print "$expletive. That hurt.\n"; } else { $trash=cutoff( $leg ) + cutoff( $arm); print "$expletive. That hurt.\n"; } Andrew In addition, I believe the patch is corrupting the Model string
reported in /proc/scsi/scsi. It shows up as a blank string. I
checked it against the 2.4.22-686 prebuilt .o and noticed the same
effect.
I tracked the vendor/model string into the aac_get_container_name-
>aac_fib_send() call where it gets corrupted.
For what its worth, the latest firmware for the PERC3/DI (2.8.0 build 6092) doesn't fix the problem for us. Within hours of installing the new firmware and booting up, our crasholicious PE2500 took yet another nosedive with the exact same symptoms as before (device i/o error and whole slew of ext3 errors after controller was marked dead). All software, firmware, and BIOS are the latest available versions. It is possible that we simply have a different problem, so I've taken the advice from Dell and disabled the controller cache. As expected, the system load quadrupled and performance is sluggy. To prove that we are having the problem that the firmware was supposed to fix, we'll have to witness the absence of crashes after turning off the controller cache. That can be problematic when the crash may take an hour or a month to manifest itself. If we do see another crash after turning off cache, then we probably aren't dealing with the issue the new firmware was supposed to fix and I'll have to start harassing Dell about it. Andrew, on my side I had that problem on a PE2650 and the new firmware fix it. But I also had the same problem on a IBM xSeries 255 (which is obviously not using the PERC 3/DI controller). It seems that the problem was a defect disk. All disks were tested OK, but one day I came in and notice the little amber led on one disk. Since I changed that disk, the problem never came back. Hope this can help you! Jean-Philippe Houde, I appreciate the friendly advice, but this is a systemic problem and not isolated to a single machine. We have two identical PE2500 servers that both exhibit the trouble and neither has any failed disks. I always check for a failed drive first after the server comes back up after a hard reboot after one of these crashes since one of the symptoms is that the controller always marks drive 1 (the second drive of the five disk array) as bad, but then immediately goes into rebuild mode once it is rebooted after a crash. Now, before anyone goes and tells me that it is an obviously bad drive, the clincher is that the same exact drive (drive 1, the second drive in the five disk RAID5 array) gets marked bad on both servers when the problem occurs and the SMART logs on the drive show no grown defects and changing the drive does not help. It's almost as if the controller just automatically fails the second disk in the array under heavy cached write I/O. I can't explain it. At any rate, after turning off the write cache (I left the controller read cache on), both servers have seen some brutal activity (clueless web hosting customers with poorly programmed database driven web sites) and the only result has been really poor disk performance, as you might expect. We have been running this way for a few days now and not had any more trouble with crashes, so my suspicion at this point is that the problem we're having is indeed related to the controller write cache and it is not a defect with just a single machine. I'll be spending most of the night on the phone with Dell until they can replace the RAID controller with one that works in both of these machines or find an acceptable work-around that doesn't involve destroying disk performance. Andrew FYI, RHEL 3 Update 3 was released September 2nd and includes the aacraid 1.1.5.2340 driver, or a variation of it. It's hard to say because I haven't seen any U3 release notes around yet. AFA0> controller details ... Component Revisions ------------------- CLI: 2.8-0 (Build #6076) API: 2.8-0 (Build #6076) Miniport Driver: 1.1-5 (Build #2340) Controller Software: 2.8-0 (Build #6092) Controller BIOS: 2.8-0 (Build #6092) Controller Firmware: (Build #6092) For what it's worth the 6092 build has made the problem on my servers less frequent but we are still seeing crashes across three different 2650s. What is interesting is that to this day I can *still* get a rock solid system by building my kernels with the version of the aacraid driver from 2.6.5. With the driver from 2.6.6+, average uptime with pre-6092 firmware is about 2 days and 7 days with the 6092 firmware. My 2.6.5 machines never crashed; they topped out at about 60 days uptime which is when we rebooted to try the new firmware build and newer kernel driver. Are we sure there isn't a driver problem here as well? We've now gone almost two weeks without a crash after turning off our write cache (we left read cache on, though it doesn't do much in our case), so I'm inclined to believe that we are indeed still having trouble with this controller. Whether it be driver or firmware is anyone's guess at this point (developers seem convinced it's firmware), though I'm still leaning towards driver, kernel, or ext3 issues since one does not see this issue with FreeBSD or RedHat 7.1 or earlier installed (those OS's don't use ext3, they use ext2 or UFS). It's only with RedHat 7.3 or later that these issues surface, and, as Joshua mentioned, some of the newer kernels don't appear to show the problem. We never had the problem on these exact same machines for about a year that we ran them with RedHat 7.1 with ext2 and reiserfs partitions under the same loads that are now causing crashes. Just for kicks, has anyone tried reformatting with ext2 instead of ext3 to see if the problem goes away? Wouldn't that just boggle everyone if this issue is specific to ext3 due to its added I/O overhead? :-) Our kernel (virtuozzo from SWSoft): 2.4.20-020stab009.24.777-smp #1 SMP Tue Aug 17 13:42:53 MSD 2004 i686 i686 i386 GNU/Linux Our controller info: Component Revisions ------------------- CLI: 2.8-0 (Build #6076) API: 2.8-0 (Build #6076) Miniport Driver: 1.1-4 (Build #9999) Controller Software: 2.8-0 (Build #6092) Controller BIOS: 2.8-0 (Build #6092) Controller Firmware: (Build #6092) The driver we're using is essentially aacraid 1.1-4 (build #2323) ported to the virtuozzo kernel (hence the build #9999). We're currently using ext3 for the filesystems. Andrew For us, the problem was there in RHL7.2 as well. However, the problem happened a lot less often, and the problem resulted in a panic (completely fatal) rather than the I/O error (mostly fatal). We've back-leveled our driver to 0.9.9ac4-rel. Christopher, > For us, the problem was there in RHL7.2 as well. What filesystem were you using? ext2 or ext3? I think ext2 was default until RH 7.3, but I'm pretty sure it was a prominent option in RH 7.2. > We've back-leveled our driver to 0.9.9ac4-rel. Has this worked for you? If so, what filesystem, kernel, and controller firmware are you using? Andrew We were using ext3 in RHL7.2. The 0.9.9ac4-rel driver works as well as any version of RHL ever did. These days we are running the RHEL3 kernels with aacraid modifications, ext3, and firmware ~6089. The I/O error problem started with the initial release of RHL8.0. RHL7.3 was later infected via the errata kernels, but RHL7.3 didn't initially have the I/O problem. But again, the occasional aacraid panics were there in RHL7.2/7.3 with ext3. We plan to apply the 6092 firmware during our next patching cycle. Thanks for the reply Christopher. Others can draw what conclusions they will, but I think that the combined experiences shown here and in other discussions elsewhere pretty squarely point the finger at ext3. While it may not be the underlying root cause of the problem, it appears to be what is causing the problem to manifest itself. At this point, all I care is getting rid of the problem. If a man with heart disease can't cure his heart disease, he'll still avail himself of whatever he can to ensure his heart keeps beating. One thing I did find peculiar is that ext3 chose the "ordered data" mode of journaling operation as its default while most other journaling filesystems chose "writeback" as their default journaling method. From 'man mount' regarding fs mount options for ext3: data=journal / data=ordered / data=writeback Specifies the journalling mode for file data. Metadata is always journaled. journal All data is committed into the journal prior to being written into the main file system. ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been commit- ted to the journal. This is rumoured to be the highest- throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. I'm going to perform an experiment (unless someone has already tried this and it still failed) that involves changing to "writeback" journaling mode for non-root filesystems where all the data activity resides. My theory, for the moment, is that the mechanism in ext3 ordered data mode that 'forces' data to disk before the metadata is somehow jamming the write cache hard in certain circumstances and the controller goes unresponsive. By changing to "writeback" journaling mode, the hope is that the OS will be able to use 'lazy' writes much like you'd find in FreeBSD's UFS w/softupdates which doesn't exhibit this problem on this controller. Any comments? Am I nuts or on to something? Andrew For what it's worth, I found that disabling hyperthreading worked around this issue for me. I believe that you're on the right track that anything that causes fast, synchronous disk writes will tickle this bug. My guess is that disabling hyperthreading just slowed down the filesystem code enough that it couldn't overload the controller. I would *guess* that running writeback journaling will decrease the incidence of the bug, but won't make it go away. The only solution I found was to get rid of our PERC3/DCs and replace them with PERC3/ DIs. We haven't had a single problem since upgrading to the DI. If you argue a bit, Dell might give you a discount on trading in the DC for a DI. For the archives, I encountered this problem on at least three separate systems, with write caching both enabled and disabled, and with the ext3 filesystem. I *do not* think this is ext3's fault -- I just think that ext3's journaling may be more likely to actually *use* the controller. Hope this helps. - Jason Parsons Quoting myself:
>I'm going to perform an experiment (unless someone has already tried
>this and it still failed) that involves changing to "writeback"
>journaling mode for non-root filesystems where all the data activity
>resides.
The experiment failed. Within two days of turning on write cache
after changing to "writeback" journaling mode, the crash occurred
again. So, it sounds like Jason Parsons was right about ext3 just
being an agitator and not the cause.
We've turned off write cache again. We're currently working to get
the 1.1-5 driver compiled into our kernel at the suggestion of Dell
in the hopes that when/if that doesn't fix it, we'll have more
effective options presented to us by Dell.
For what it's worth, our problem likely is not a firmware problem, though it is manifesting itself very similarly. This morning, the one of the affected systems crashed with the same symptoms even with the RAID cache turned off. I'm on the phone with Dell at the moment to see what they can tell me. I'll probably duck off this thread until I can find a solution to avoid cluttering otherwise useful information. You should be aware that there is a bug in some versions of the firmware that causes the RAID cache to not be turned off even when it is configured to be off (and the UIs all say that it is off). The workaround was to set the cache to 'enable if protected', then pull the battery off the controller. Ugh. I don't recall any further details, unfortunately. I am seeing this problem on a redhat 9 box with an adaptec 2200S raid card and JBOD. I have tried both the smp and non-smp kernel with the same results. I have verified the media with the adaptec software. As mentioned in comment #123, I have the cache to "enable if protected" and there is no battery installed. I'm pretty sure I have the latest raid card firmware and driver. I can recreate this crash easily, just by launching a full backup. My system does not hang, but the filesystem becomes unusable in that a df shows the fs mounted, but an ls- l shows zero files, and I can't umount it or fdisk the device. I'd sure appreciate some help. This system is unusable in this condition. Here is my firmware, driver, kernel info, logfile portions, etc Adaptec Raid Controller 1.1-4[2322] Vendor: ADAPTEC Model: Adaptec 2200S kernel: 4.1-0[7244] monitor: 4.1-0[7244] bios: 4.1-0[7244] CLI: 4.1-0 (Build #6151) API: 4.1-0 (Build #6151) Miniport Driver: 1.1-4 (Build #2322) Controller Software: 4.1-0 (Build #7244) Controller BIOS: 4.1-0 (Build #7244) Controller Firmware: (Build #7244) Controller Hardware: 2.64 Red Hat Linux release 9 (Shrike) uname -a Linux ladon 2.4.20-8smp #1 SMP Thu Mar 13 17:45:54 EST 2003 i686 i686 i386 GNU/Linux Sep 25 07:13:22 ladon kernel: aacraid: Host adapter reset request. SCSI hang ? Sep 25 07:14:22 ladon kernel: aacraid: SCSI bus appears hung Sep 25 07:14:32 ladon kernel: scsi: device set offline - command error recover failed: host 1 channel 0 id 0 lun 0 Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 217841696 Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 218890256 Sep 25 07:14:32 ladon kernel: SCSI disk error : host 1 channel 0 id 0 lun 0 return code = 6 000000 Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 218995960 Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 218995968 Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 218995960 Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 218890256 Sep 25 07:14:32 ladon kernel: EXT3-fs error (device sd(8,33)): ext3_get_inode_loc: unable t o read inode block - inode=13680655, block=27361282 Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 0 Sep 25 07:14:32 ladon kernel: EXT3-fs error (device sd(8,33)) in ext3_reserve_inode_write: IO failure Sep 25 07:14:32 ladon kernel: I/O error: dev 08:21, sector 0 Sep 25 07:14:33 ladon kernel: I/O error: dev 08:21, sector 13072 Sep 25 07:14:33 ladon kernel: journal_bmap_Rsmp_e68c71a3: journal block not found at offset 2511 on sd(8,33) Sep 25 07:14:33 ladon kernel: Aborting journal on device sd(8,33). Sep 25 07:14:33 ladon kernel: I/O error: dev 08:21, sector 4776 Sep 25 07:14:34 ladon kernel: I/O error: dev 08:21, sector 218894352 Sep 25 07:14:34 ladon last message repeated 98 times Our problem is likely triggered by bad hardware causing a single drive in our array to go unresponsive, but the handling of the unresponsive drive by the firmware and/or driver is ultimately what causes the controller to be unresponsive for long enough that the OS marks it dead. I noticed something about the controller log from our last crash. There are times listed with some of the log messages in the form of number of seconds since controller boot. This gives us a way to calculate how long the controller took to finish its fiddly dinking with the unresponsive drive. From that log: [50]: ID(0:01:0) Cmd[0x28] Fail: Block Range 3424717 : 3424718 at [51]: 509184 sec ..... [78]: ID(0:01:0) Cmd[0x28] Fail: Block Range 0 : 0 at 509262 sec From this, we can tell it took 78 seconds between those two log messages. It likely took longer than that for the entire operation since there were other things done before and after those times were logged. I'm making an educated guess that due to problems with the firmware and/or driver code, the controller doesn't respond to the OS during that entire period. Correct me if I'm wrong, but I believe the SCSI subsystem in Linux only waits a maximum of 60 seconds before it determines that a controller is unresponsive and issues a reset. I'm also guessing that the reason the controller doesn't respond to the reset is that it is already jammed with commands from the prolonged period of commands being queued while it worked to try to reset the unresponsive drive. Of course, it begs the question, why is the drive going unresponsive? We're working with Dell to get that answer. It also begs the question, why does the aacraid firmware and/or driver take so long to determine the drive is unresponsive? That's where firmware programmers and driver programmers come in. If linux thinks a controller should respond in under 60 seconds, wouldn't it stand to reason that the firmware should make an effort to get done with uninterruptable work in less than 60 seconds or make the work interruptable? I'm speaking in general terms because I'm not a programmer by trade (only by necessity), but the logic should be clear enough. Can anybody else confirm similar lengths of time from their controller logs? You'd get the controller log by going into afacli (my afacli is version 2.8-0 Build #6076), opening your controller, and running the following command (assuming this is your first controller boot after your crash): diagnostic show history /old Andrew LSI has apparently identified a kernel issue (note I said *kernel*, not driver) in which the kernel is writing "phantom" commands to the controller. Granted, that is a different driver, but could a similar thing be happening with other drivers? If I'm way wrong, that's fine. Just an idea that hasn't been looked at on this thread before. Reference: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=118432#c10 On October 10th, we loaded the following configuration on our Dell 2650: Server: Dell PowerEdge 2650 with PERC3/DI System BIOS: A19 Backplane firmware: 1.01 PERC3/DI firmware: 2.8.0 (build 6092) ERA firmware: 3.14 Hyperthreading: Enabled Linux OS Version: 2.4.21-20.ELsmp (RedHat Enterprise Linux 3.0) Aacraid driver version: 1.1.5 (installed with RedHat) So far, the system has been running error free and performing much better than with prior firmware and software releases. Previously, the server would usually fail within 48 hours and the longest availability time we experienced was two weeks. We'll provide another update next week. We are now at the two week mark (reference post #127) and we are still running error free. A correction to post #127, the longest availability time we previously experienced was one week, not two weeks as noted on the post. Javier, You had turned off write cache on BIOS? Thank's, Neto. As per our discussion, the write cache parameter is set to disabled. The matches the Dell documentation you referenced which states that this is the default and also the recommendation. We are now reaching the three week mark and we are still running error free. We are going to implement the maintenance on our second server. It appears that the configuration in post #127 has corrected the problem. Thank you to everyone who helped to resolve this problem. We are considering this item as resolved based on the implementation of the items in comment #127. Any news about this problem in newer Dell2850 server? IIRC, the PE2850 uses a different SCSI/RAID chipset, so any problems it has will be completely different than what was represented here. For posterity's sake, our problem was resolved by some combination of the following that was done all simultaneously: 1. Replaced the drive that the system marked as bad, though it had no visible operational problems and came back fine after a power cycle. It was replaced with a newer different brand U320 10K RPM though we're only using it in U160 mode. The newer drive seems to be a bit snappier in processing SCSI commands, but it could just be my imagination. 2. Replaced the SCSI backplane. 3. Replaced the cables. 4. Replaced the mainboard since the RAID controller was on the mainboard. 5. Changed from RAID/SCSI (channel 1, channel 2) mode to RAID/RAID mode since that is the default and probably best supported, though we're not using channel 2. 6. Rebuilt the entire array using a 64KB stripe size instead of a 32KB stripe size since bigger appears to be better when it comes to minimizing physical I/O requests during large transfers. 7. Used the latest BIOS and firmware on the mainboard and controller. A07 and 2.8-0 (Build #6092). 8. Used the newest driver I could get my hands on for my commercial 2.4 series kernel (1.1-4 Build 2323). From what I could tell from the changelog (which is a bit sketchy), there weren't many changes between that version and 1.1-5, so I couldn't see a need to go through the hassle of getting 1.1-5 on my kernel. At any rate, I'm not sure if my experience will help anyone else, but there it is anyway. Andrew |