From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040803 Firefox/0.9.3 Description of problem: The new kernel from Update 3 panics during boot. The attached panic is from a Dell PE2650 using latest BIOS (A18) and RAID firmware (2.8-0). A similar PE2650 with the the same RAID firmware, and another with previous firmware (2.7-1) doesn't have this problem, but we did experience this on a PE2550 using 2.7-1 firmware. This is somewhat strange as the problem isn't always reproduceable. One some boxes we don't see this problem, but on the boxes that have this problem it is always reproduceable. The latest kernel from Update 2 (2.4.21-15.0.4.ELsmp) works fine on either firmware releases. Version-Release number of selected component (if applicable): kernel-2.4.21-20.ELsmp How reproducible: Sometimes Steps to Reproduce: 1. Upgrade to latest firmware on a Dell PE2650 running RHEL3 2. Install latest kernel from Red Hat 3. Reboot the new kernel Actual Results: The system panics. Expected Results: Normal bootup. Additional info:
Created attachment 103432 [details] Example panic from console
We see this too, with both Dell 2550 and 2650 servers that have the PERC3/Di RAID controller, with firmware versions from 2.6 to 2.8 (current).
I receive a very simliar error. A PowerEdge 2650 was updated with up2date and rebooted and the system worked fine. Another reboot of the system brought up the error message as above and The system no longer boots with the kernel-2.4.21-20.ELsmp kernel.
Similar problems seen on a Dell PowerEdge 2650 (BIOS A17) with an Adaptec PERC 3/Di running firmware 2.80 (Build 6089)
Similar issue as above. PowerEdge 2650 (Bios A18) Adaptec PERC 3/Di firmware 2.80 (Build 6092) kernel-2.4.21-20.ELsmp from up2date panics. kernel-2.4.21-4.ELsmp runs perfectly fine.
DITO PowerEdge 2650 two years old kernel-2.4.21-20.ELsmp up2date ->kernel panic Errormessage from aacraid EIP aac_info[aacraid 0x12](2.4.21-20ELSMP/i668) PowerEdge 2650 NEW PowerEdge 2650 two years old kernel-2.4.21-20.ELsmp up2date ->kernel panic Errormessage from aacraid EIP aac_info[aacraid 0x12](2.4.21-20ELSMP/i668) SingleKernels 2.4.21-20 running firmware version not known
We are investigating. In the meantime, the previous version of the aacraid driver is perserved in U3 as aacraid_00909.o. I expect that the older driver will not have this problem. If you are able to get the system up, you can change aacraid to aacraid_00909 in modules.conf, re-make the initrd, then boot the U3 kernel. We will develop a more complete solution once we determine the cause of the problem.
Sorry, aacraid_00909 is the alternate driver in RHEL 2.1 (aacraid v0.9.9) The alternate driver available in RHEL 3 U3 is aacraid_10102 (v1.1.2).
Additional information: Seems to be SMP related. 2.4.21-20.EL-smp on Dell PowerEdge 2550 (BIOS A09) with PERC3/Di (firmware 2.7-0 Build 3546) panics, but 2.4.21-20.EL works. Confirmed - aacraid_10102 does work in 2.4.21-20.EL-smp on this machine. Also note that a Dell PowerEdge 2650 (BIOS A18) with PERC/Di (firmware V2.8-0 Build 6089) does NOT have this issue.
Apply the following patch to drivers/scsi/aacraid/linit.c (only affects some 2.4.*, issue not present in 2.6.* trees with this driver): --- linit.c.orig Thu Sep 9 05:15:41 2004 +++ linit.c Thu Sep 9 05:18:18 2004 @@ -413,7 +413,9 @@ const char *aac_info(struct Scsi_Host *shost) { struct aac_dev *dev = (struct aac_dev *)shost->hostdata; - return aac_drivers[dev->cardtype].name; + if (dev) + return aac_drivers[dev->cardtype].name; + return AAC_DRIVERNAME; } /**
What is triggering this problem? Which of our PowerEdge 2650 are safe to upgrade, and which one should we leave running the old kernel? How can we check if it is safe to run the latest RH kernel?
This is a bug in the Linux aacraid driver, Peter. No aacraid based card is `safe'. I have discovered that it is a layered onion and needs further refinement to the patch I just submitted for testing here: --- linit.c.badinfo Thu Sep 9 05:15:41 2004 +++ linit.c Thu Sep 9 07:28:07 2004 @@ -412,7 +412,17 @@ const char *aac_info(struct Scsi_Host *shost) { +#if ((LINUX_VERSION_CODE <= KERNEL_VERSION(2,5,0)) && defined (MODULE)) + struct aac_dev *dev; + if (shost == aac_dummy) + return AAC_DRIVERNAME; + dev = (struct aac_dev *)shost->hostdata; + if (!dev + || (dev->cardtype >= (sizeof(aac_drivers)/sizeof(aac_drivers [0])))) + return AAC_DRIVERNAME; +#else struct aac_dev *dev = (struct aac_dev *)shost->hostdata; +#endif return aac_drivers[dev->cardtype].name; }
Created attachment 103861 [details] Hand-copied kernel panic messages Partial oops messages, some rolled off the console.
Also get this on 2650, Phoenix BIOS Revision A15, Dell PowerEdge Expandable RAID controller 3/Di BIOS v2.7-1, Dell PowerEdge Expandable RAID controller BIOS v3.31. Occurs 50% of the time using the 2.4.20-20ELsmp kernel, not on the 2.4.20-20EL kernel or the 2.4.20-15ELsmp kernel. See attachment #103861 [details].
Count me as another Dell 2650 user.. Thanks for the heads-up regarding aacraid_10102. I was able to boot the -20 non-SMP kernel, and use that to copy the aacraid_10102.o from /lib/modules into the initrd, and the system was able to boot the SMP kernel. Quick summary of workaround: zcat /boot/initrd-2.4.21-20.ELsmp.img > /tmp/smp.initrd mount -o loop /tmp/smp.initrd /mnt/loop cp /lib/modules/2.4.21-20.ELsmp/kernel/drivers/addon/aacraid_10102/aacraid_10102.o /mnt/loop/lib/aacraid.o umount /mnt/loop gzip /tmp/smp.initrd cp /tmp/smp.initrd.gz /boot/initrd-2.4.21-20.ELsmp-test.img and then point grub at that initrd.
That will work, but an easier way to do the workaround is: edit /etc/modules.conf and replace "aacraid" with "aacraid_10102" /sbin/mkinitrd /boot/initrd-2.4.21-20.ELsmp.img 2.4.21-20.ELsmp Just remember to fix /etc/modules when redhat releases a fixed kernel.
We also experience this exact issue after going from 2.4.21-15.0.4 to 2.4.21-20. System won't boot; kernel panic relating to aacraid module as described in comment #1. PE2650 with 3-disk RAID5 on PERC3/di (older firmware, don't have version handy).
We are planning to include Mark's fix (comment 12) in RHEL 3 U4. We are also going to put the fix in AS 2.1 U6, even though the bug does not exhibit itself there.
Tom, that is not a good idea, unless you preserve both aacraid_10102.o and the buggy version in U4. Otherwise there might be no stable aacraid driver in the U4 kernel. Personally I would prefer a patched kernel for U3.
Whoa. Not acceptable. This is a bug that prevents booting of a machine, not a FE. Many people *will* boot this config and get burned. Save yourself and us the headache of 3-4 months of dealing with workarounds and having to build unsupported driver modules and just fix it. Please? FWIW, the Dell PERC3/Di, an aacraid based card, has some serious issues, so Dell is saying: Please upgrade to the most recent driver to try to fix your random lockups and freezes!" which people do because RHEL3-UPD3 has a newer version of the drive, plus support, and lo and behold it breaks your machine in a total and complete way. I know that respinning the kernel is a PITA, but it needs to be done.
I will propose to include the fix in the next feasible RHEL 3 U3 erratum release. In U4 I am inclined to keep aacraid_10102.o and the patched 1.1.5 version. No need to keep the buggy version.
Getting this fixed sooner than later is the right answer. My group has taken a bit of a black eye after updating our server to Update 3 and running into this problem head on. I'm not pleased to have to integrate a workaround into production systems. I would very much like to see this bug fixed asap
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.13.EL).
Ernie, does this mean that a fix won't be around until U4 is released? I wouldn't think this an acceptable solution, the PE2650 being one of the most mainstream server boxes out there... BTW, has anyone else experienced that kudzu hangs indefinately when booting boxes that have the 2.4.20-20.ELsmp with workaround applied?
Trond, there are no planned releases before U4. Feel free to contact customer support to lobby for a hot-fix for this.
Hmm.. I think Tom's suggestion in comment #22 is a good one, why not include the fix in the next security update? Besides, as an academic institution we don't have support, but we'll cope anyway :)
It is slated to be included in any security errata that comes out before Update 4, but there is none pending right now and with luck there won't be. If you can't use the older included driver for some reason, you can use the 1.1.4.2302.1 driver in DKMS format on linux.dell.com.
I have two identical Dell 2650's with Perc 3/Di (their worst ever) cards. I get a kernel panic with : AAC0: kernel 2.8.4 build 6089 <--- BIOS build no. AAC0: monitor 2.8.4 build 6089 AAC0: bios 2.8.0 build 6089 AAC0: serial f5d881d3fafaf001 && kernel 2.4.21-20.ELsmp but not with: AAC0: kernel 2.8-0 build 6082 <--- BIOS build no. AAC0: monitor 2.8-0 build 6082 AAC0: bios 2.8-0 build 6082 AAC0: serial f5d881d3 AAC0: 64 Bit DAC enabled && kernel 2.4.21-20.ELsmp my temporary solution will be to use kernel 2.4.21-15.0.4.ELsmp on the 2650 with the newer RAID BIOS. Both systems have A18 system BIOS.
I have two PE2650 (identical firmware setup), one of them kernel panic with 2.4.21-20smp but okay with 2.4.21-20 non-smp kernel. It appears to me that the system will panic if it has an IMPEFERECT raid status, otherwise, it will boot into 2.4.21-20smp just fine. Here are the situdations that I encountered: (I have mirorr 0 for the boot disk, disk0 and disk1) 1. Disk1 was flashing orange-green light, booting failed for 2.4.21-20smp 2. Re-insert disk1, no flashing orange-green light, booting 2.4.21-20 smp succeeded 3. Disk1 flashed orange-green light again after some time, rebooting 2.4.21-20 failed again. 4. Pull out disk1, booting 2.4.21-20smp failed. 5. Put in the new disk1 from DELL (raid container rebuilding), booting 2.4.21-20smp worked. 6. Shutdown the system, while (raid was rebuilding, RAID is imperfect), rebooting to 2.4.21-20smp failed. 7. Booting into working kernel and having finished raid rebuild, rebooting into 2.4.21-20smp worked again. When will we have the kernel patch? I hated to do those work-around.
> When will we have the kernel patch? As stated earlier, the patch in comment 12 will be in U4, and it is also proposed to be in the first U3 errata, if there is one. > I hated to do those work-around. There are some better workarounds in comment 7/8, and 28.
*** Bug 134936 has been marked as a duplicate of this bug. ***
verify ident. behavior on 2550 + perc3/Di scsi3 : percraid ^MAAC0: kernel 2.8-0 build 6092 ^MAAC0: monitor 2.8-0 build 6092 ^MAAC0: bios 2.8-0 build 6092 ^MAAC0: serial c9ec01d2 ^MAAC0: ROMB RAID/SCSI mode enabled ^MAAC0: Non-DASD support enabled ^MUnable to handle kernel paging request at virtual address 38385e84 ^M printing eip: ^Mf88730d2 ^M*pde = 8c554972 ^MOops: 0000 ^Maacraid megaraid aic7xxx diskdumplib sd_mod scsi_mod ^MCPU: 0 ^MEIP: 0060:[<f88730d2>] Not tainted ^MEFLAGS: 00010282 ^MEIP is at aac_info [aacraid] 0x12 (2.4.21-20.ELsmp/i686) ^Meax: e7f60e80 ebx: f7fa6e80 ecx: f88730c0 edx: f887eec0 ^Mesi: c4e46c80 edi: f7fa6e47 ebp: f887c55b esp: c3757e88 ^Mds: 0068 es: 0068 ss: 0068 ^MProcess insmod (pid: 23, stackpage=c3757000) ^MStack: f880f336 c4e46c80 f7fa7900 ffffffff 00000000 c03f2324 00000246 f7fa7800 ^M f887e670 f887ef40 f7fa6e80 00000001 f7fa6e80 f887c562 f8810c19 c4e46c80 ^M 00000020 00000001 00000007 00000001 c4e46c80 00000000 00000001 00000001 ^MCall Trace: [<f880f336>] scsi_setup_host [scsi_mod] 0xb6 (0xc3757e88) ^M[<f887e670>] aac_pci_tbl [aacraid] 0x70 (0xc3757ea8) ^M[<f887ef40>] aac_pci_driver [aacraid] 0x0 (0xc3757eac) ^M[<f887c562>] .rodata.str1.1 [aacraid] 0x2a (0xc3757ebc) ^M[<f8810c19>] scsi_register_Rsmp_4853a9b7 [scsi_mod] 0x299 (0xc3757ec0) ^M[<f8873eb4>] init_module [aacraid] 0xc4 (0xc3757eec) ^M[<f887eec0>] aac_driver_template [aacraid] 0x0 (0xc3757ef0) ^M[<f887ee60>] aac_cfg_fops [aacraid] 0x0 (0xc3757ef8) ^M[<c012ab26>] sys_init_module [kernel] 0x5b6 (0xc3757f0c) ^M[<f887d4a8>] .kmodtab [aacraid] 0x0 (0xc3757f20) ^M[<f8873060>] aac_detect [aacraid] 0x0 (0xc3757f2c) ^M[<f887d2f0>] __ksymtab [aacraid] 0x0 (0xc3757f30) ^M[<f8873060>] aac_detect [aacraid] 0x0 (0xc3757f58) ^MCode: 8b 04 c5 84 ea 87 f8 c3 8d b6 00 00 00 00 8b 44 24 04 8d 04 ^MKernel panic: Fatal exception
verify crash still happens on 2550 + 3/Di with Build 6092 + aacraid_10102
You should not get this crash with aacraid_10102. Were you using aacraid_10102, or the default aacraid (1.1.5-xxxx)?
Sorry I'm being unclear. #33 comes from a system using 1.1.5 (eg: RHEL 3 UPD 3 default driver) The crash mentioned in #34 is the same system using aacraid_10102. The crash is a hard deadlock fo the machine (no panic or OOPS). 1.1.5 + 6092 is supposed to fix the deadlock. Maybe. Both were same hardware, so why was it running 2 drivers today you ask? It was upgraded from UPD2 to UPD3 finally this morning and is unhappy with all the drivers. I tried on a fluke to see if 1.1.5 + 6092 would boot (it does for another identical machine) and it did not so I backed down to aacraid_10102 to get the machine to boot but it's still crashing.
*** Bug 135729 has been marked as a duplicate of this bug. ***
Nathan, Try this pre-release kernel: http://people.redhat.com/coughlan/RHEL3-perf-test/ Warning: this is a pre-beta U4 test kerrnel. It has not been through QA. It must not be used in production. It is only to be used for early testing and feedback. This kernel has the 1.1.5 driver with the patch in comment 12. If you still have the deadlock with latest firmware, then please open a new BZ. You have a different problem. Tom
I had the same problem and was fixed by re-creating the initial ram disk. Not sure why. Booted to the previous kernel version. cd /boot mv initrd-2.4.21-20.ELsmp.img initrd-2.4.21-20.ELsmp.img.old mv initrd-2.4.21-20.EL.img initrd-2.4.21-20.EL.img.old mkinitrd initrd-2.4.21-20.ELsmp.img 2.4.21-20.ELsmp mkinitrd initrd-2.4.21-20.EL.img 2.4.21-20.EL
Two Dell 2450 with Perc 3/Si doing the same thing. Will apply the patch tomorrow and see.
Based on Stefan Hudson's Additional Comment #16, I put together this script. I've run it on my existing 2650's and added it to my kickstart for the servers I'm building. HTH... #!/bin/sh # The aacraid driver released with Red Hat Enterprise # Linux 3, Update 3 has problems that can prevent a Dell # PowerEdge 2650 server from booting. The workaround is to # use the older aacraid_10102 driver. Two changes are # needed to implement this. The /etc/modules.conf file # should specify the aacraid_10102 module. An initrd file # containing the other driver needs to be in place to # make the correct driver available at bootup. # Modify the modules.conf timestamp=$( date "+%y%m%d%H%M%S" ) cp /etc/modules.conf /etc/modules.conf.${timestamp} patch /etc/modules.conf <<EOPATCH 3c3,6 < alias scsi_hostadapter aacraid --- > # For RHEL 3 EL U3, there is a bug with the aacraid driver. > # The workaround is to use the aacraid_10102 driver. Be sure > # to change this back with future RHEL version upgrades. > alias scsi_hostadapter aacraid_10102 EOPATCH # Rebuild the initrd file mv /boot/initrd-2.4.21-20.ELsmp.img /boot/initrd-2.4.21-20.ELsmp.img.${timestamp} mkinitrd /boot/initrd-2.4.21-20.ELsmp.img 2.4.21-20.ELsmp
I see that the fix hasn't been included in 2.4.20-20.0.1.ELsmp, as suggested in comment #28 and comment #31. Whether this is a slip-up or a thought-through decision is unknown, but anyway it's a disappointment that Red Hat doesn't take this problem more seriously.
Just a note that I recently installed RHEL 3 on a Fujitsu-Siemens Primergy RX600 server (2 CPUs, Adaptec AIC-7902 U320 hardware RAID) and downloaded 222 (!) RPM updates, including the 2.4.21-20.0.1.ELsmp kernel and am seeing the same reboot crash that people are here, so it's not just restricted to Dell Poweredges. The crash seems to be intermittent and the latest one I got was during an "insmod" of the aacraid driver according to the console output (i.e. pretty well identical to Nathan's crash output in comment #33). I must say that releasing a new kernel on 2nd December without this problem fixed is very poor when the fix has been on this thread for over 2 months. Priority "high" and Severity "high" apparently aren't good enough to get this crucial fix in the kernel :-(
Just want to add my name to the list of people increasingly disappointed that this has yet to be fixed. This really is a big deal for us; the bug affects 3 production servers.
And me also. I have 20+ servers affected by this bug. It was very bad form for RedHat to release a security update kernel and not include a fix for this bug.
I apologize that Update 4 wasn't released last Wednesday as originally scheduled. It is now scheduled for release next week, and it will contain the fix you've been waiting for.
Ah... so the reason that the fix wasn't included in the latest errata is that U4 is just around the corner? Then I withdraw my critisism and look forward to the arrival of U4. Good work guys :) (Since we're counting.. I have 80+ pe2650s. I think I'm in the lead ;)
Just to add a "me too" I am having the same problem with an IBM x306 that has an Adaptec raid controller and is using the aacraid module. I am now able to boot using the workaround in comment #16 above. I am also looking forward to the release of U4 next week.
Hi, it's 'next week' and we are still waiting....
Calm down, calm down. U4 for RHEL 2.1 *did* release this week (I got all the e-mails). One can only assume that either U4 for 3.0 is imminent, or some information came to light during the 2.1 release that is delaying 3.0. Sux, since I could use the kernel patch, too, and also sux since we don't know why it's not out yet, but hey. That's life. Programming's hard. If U4 comes out next week, I'll be happy. If all software (and construction) projects were only a week late, I'd be f***ing ecstatic... --J P.S. But I *want* that kernel update... :-/
U4 is scheduled to hit RHN on Monday.
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html
I don't think the problem is in the aacraid driver! I have PE2250 with BIOS version A09 and PERC 3/Di firmware version 2.8.0 build 6092. After kernel update to 2.4.21-27ELsmp system won't boot. I tryed both accraid and aacraid_10102 drivers but the effect was a series of "Segmentation fault" on each attempt operation with filesystem to be done. As a final result system freezed. aacraid_10102 is a preserved version of the old aacraid driver and if the problem was in the driver the system has to be able to boot with it, but ir won't. I think the problem is in the interaction between the kernel the the driver. We had identical problems with PE1650 server which has just SCSI controller (not RAID). The only possible way to bring the machines back was to use the old kernel -- 2.4.21-15.0.4
Vlady, The problem you are describing is not the same as the problem reported in this bugzilla. Please open a new bugzilla. When you do, provide the console output showing the driver being loaded and the device configuration messages, and the subsequent error messages. Also, which driver is being used in the non-RAID PE1650 system? Tom