Description of problem: The machine had already panicked when I got to it. After recyling the power and getting the machine in single user mode, I did a "cat /proc/mdstat" and the machine panicked once again. We devised a method listed in "steps to reproduce" to stop the raid devices and tried adding them back in one-by-one to see if any specific md was failing. After 2 tries and 2 panics on different devices, we filed this bug. Version-Release number of selected component (if applicable): How reproducible: Everytime Steps to Reproduce: 1) Boot the machine with init=/bin/sh 2) mount -o remount,rw / 3) mount /proc4) echo 100 > /proc/sys/dev/raid/speed_limit_max 5) umount /proc 6) raidstop /dev/md{11..27} 7) mount /proc 8) echo $value > /proc/sys/dev/raid/speed_limit_max 9) raidstart /dev/md{11..27} Waiting for each md to resync before starting another. This is where the panics occurred. Actual results: Kernel Panicked (see additional info) Expected results: Resync Additional info: First panic: init-2.05# scsi : aborting command due to timeout : pid 0, scsi4, channel 0, id 9, lun 0 Read (10) 00 04 30 32 3f 00 00 68 00 scsi : aborting command due to timeout : pid 0, scsi4, channel 0, id 9, lun 0 Read (10) 00 04 30 32 a7 00 00 08 00Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:f882fcc8*pde = 00104001Oops: 0000Kernel 2.4.9-e.40enterpriseCPU: 4EIP: 0010:[<f882fcc8>] Not taintedEFLAGS: 00010002EIP is at aic7xxx_handle_scsiint [aic7xxx] 0x258eax: 0000000d ebx: f67f4084 ecx: f884f000 edx: 00000000esi: f884f000 edi: 00000000 ebp: 00000000 esp: f7ffde28ds: 0018 es: 0018 ss: 0018Process swapper (pid: 0, stackpage=f7ffd000)Stack: 00000000 f2f07200 f8807be4 00000246 f7f27200 f2f072b4 00000001 f2f07200 f8807e99 f7f27218 00000000 f7f27218 00000008 00000000 00000000 f2f07200 f8808241 f2f07200 0000000d 00000000 00000001 00000001 00000001 0981fd28Call Trace: [<f8807be4>] scsi_queue_next_request [scsi_mod] 0x64 (0xf7ffde30)[<f8807e99>] __scsi_end_request [scsi_mod] 0x1b9 (0xf7ffde48)[<f8808241>] scsi_io_completion_Rsmp_4dff857c [scsi_mod] 0x2a1 (0xf7ffde68)[<f8821d8f>] rw_intr [sd_mod] 0x20f (0xf7ffdeb0)[<f8830f96>] aic7xxx_isr [aic7xxx] 0x296 (0xf7ffdec8)[<c010756a>] nmi [kernel] 0x1e (0xf7ffdee0)[<f8831008>] do_aic7xxx_isr [aic7xxx] 0x68 (0xf7ffdf10)[<c0108d1e>] handle_IRQ_event [kernel] 0x5e (0xf7ffdf34)[<c0108f41>] do_IRQ [kernel] 0xc1 (0xf7ffdf54)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffdf68)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffdf74)[<c0246d70>] call_do_IRQ [kernel] 0x5 (0xf7ffdf78)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffdf7c)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffdf90)[<c010544e>] default_idle [kernel] 0x2e (0xf7ffdfa4)[<c01054b2>] cpu_idle [kernel] 0x32 (0xf7ffdfb0)[<c011ceb8>] printk [kernel] 0xd8 (0xf7ffdfd0)[<c0265ebe>] .rodata.str1.1 [kernel] 0xd79 (0xf7ffdfe4) Code: 8b 07 0f b6 40 19 eb 05 b8 ff 00 00 00 50 31 ff 6a ff 55 0f <0>Kernel panic: not continuingIn interrupt handler - not syncing NMI Watchdog detected LOCKUP on CPU7, eip f880bf2d, registers:Kernel 2.4.9-e.40enterpriseCPU: 7EIP: 0010:[<f880bf2d>] Not taintedEFLAGS: 00000086EIP is at .text.lock [scsi_mod] 0x3deax: f67f416c ebx: 00000293 ecx: 00000000 edx: 0006ce0eesi: 00002328 edi: f67f4000 ebp: f2eadc00 esp: f7ff7ce0ds: 0018 es: 0018 ss: 0018Process swapper (pid: 0, stackpage=f7ff7000)Stack: 00000000 f2eadc00 00000000 00000000 00000003 f8807263 f2eadc00 00080000 f50a4a00 f67f4000 00000010 f67f4084 f884f000 00000000 00000000 f8827f28 f2eadc00 f67f4084 f882fcef f67f4084 00000001 f67f4084 00000009 00000000Call Trace: [<f8807263>] scsi_old_done [scsi_mod] 0x5a3 (0xf7ff7cf4)[<f8827f28>] aic7xxx_done_cmds_complete [aic7xxx] 0x28 (0xf7ff7d1c)[<f882fcef>] aic7xxx_handle_scsiint [aic7xxx] 0x27f (0xf7ff7d28)[<c018b7a9>] scrup [kernel] 0x69 (0xf7ff7db4)[<c01c5fd2>] vgacon_cursor [kernel] 0x1b2 (0xf7ff7dd4)[<f8830f96>] aic7xxx_isr [aic7xxx] 0x296 (0xf7ff7de8)[<c0190554>] poke_blanked_console [kernel] 0x64 (0xf7ff7e00)[<c018fa56>] vt_console_print [kernel] 0x2a6 (0xf7ff7e0c)[<f88384ef>] aic7xxx_abort [aic7xxx] 0x5f (0xf7ff7e30)[<c011cd5b>] call_console_drivers [kernel] 0xeb (0xf7ff7e48)[<c011ceb8>] printk [kernel] 0xd8 (0xf7ff7e74)[<f880741e>] scsi_abort [scsi_mod] 0xde (0xf7ff7e84)[<f88148dd>] .LC93 [scsi_mod] 0x4e3 (0xf7ff7e88)[<f8807437>] scsi_abort [scsi_mod] 0xf7 (0xf7ff7e9c)[<f8806a60>] scsi_old_times_out [scsi_mod] 0x0 (0xf7ff7eac)[<f8806a9d>] scsi_old_times_out [scsi_mod] 0x3d (0xf7ff7eb4)[<c0124de1>] __run_timers [kernel] 0xd1 (0xf7ff7ec8)[<c0125373>] run_all_timers [kernel] 0x33 (0xf7ff7ee0)[<c01213fb>] bh_action [kernel] 0x4b (0xf7ff7ef4)[<c01212ac>] tasklet_hi_action [kernel] 0x6c (0xf7ff7efc)[<c012102b>] do_softirq [kernel] 0x7b (0xf7ff7f14)[<c0239f9d>] stext_lock [kernel] 0xf9d (0xf7ff7f30)[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f50)[<c0114318>] smp_apic_timer_interrupt [kernel] 0xb8 (0xf7ff7f54)[<c0108f63>] do_IRQ [kernel] 0xe3 (0xf7ff7f5c)[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f60)[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f68)[<c0247729>] call_apic_timer_interrupt [kernel] 0x5 (0xf7ff7f74)[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f7c)[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f90)[<c0240018>] stext_lock [kernel] 0x7018 (0xf7ff7f98)[<c010544e>] default_idle [kernel] 0x2e (0xf7ff7fa4)[<c01054b2>] cpu_idle [kernel] 0x32 (0xf7ff7fb0)[<c011ceb8>] printk [kernel] 0xd8 (0xf7ff7fd0)[<c0265ebe>] .rodata.str1.1 [kernel] 0xd79 (0xf7ff7fe4) Code: f3 90 7e f9 e9 3f 47 ff ff 80 38 00 f3 90 7e f9 e9 dd 47 ff Second Panic: init-2.05# ./raidstart /dev/md21md: autorun ...md: considering sdbd1 ...md: adding sdbd1 ...md: adding sdai1 ...md: adding sdab1 ...md: adding sdg1 ...md: created md21md: running: <sdbd1><sdai1><sdab1><sdg1>md: md21: raid array is not clean -- starting background reconstructionmd21: max total readahead window set to 1536kmd21: 3 data-disks, max readahead per data-disk: 512kraid5: device sdbd1 operational as raid disk 3raid5: device sdai1 operational as raid disk 2raid5: device sdab1 operational as raid disk 1raid5: device sdg1 operational as raid disk 0raid5: allocated 4339kB for md21raid5: raid level 5 set md21 active with 4 out of 4 devices, algorithm 0raid5: raid set md21 not clean; reconstructing parityRAID5 conf printout:md: syncing RAID array md21md: minimum _guaranteed_ reconstruction speed: 100 KB/sec/disc.md: using maximum available idle IO bandwith (but not more than 50000 KB/sec) for reconstruction.md: using 124k window, over a total of 35559680 blocks. --- rd:4 wd:4 fd:0 disk 0, s:0, o:1, n:0 rd:0 us:1 dev:sdg1 disk 1, s:0, o:1, n:1 rd:1 us:1 dev:sdab1 disk 2, s:0, o:1, n:2 rd:2 us:1 dev:sdai1 disk 3, s:0, o:1, n:3 rd:3 us:1 dev:sdbd1RAID5 conf printout: --- rd:4 wd:4 fd:0 disk 0, s:0, o:1, n:0 rd:0 us:1 dev:sdg1 disk 1, s:0, o:1, n:1 rd:1 us:1 dev:sdab1 disk 2, s:0, o:1, n:2 rd:2 us:1 dev:sdai1 disk 3, s:0, o:1, n:3 rd:3 us:1 dev:sdbd1md: updating md21 RAID superblock on devicemd: ... autorun DONE.init-2.05# scsi : aborting command due to timeout : pid 0, scsi5, channel 0, id 4, lun 0 Read (10) 00 00 4f aa 3f 00 02 d0 00 scsi : aborting command due to timeout : pid 0, scsi5, channel 0, id 4, lun 0 Read (10) 00 00 4f ad 0f 00 00 08 00scsi : aborting command due to timeout : pid 0, scsi5, channel 0, id 4, lun 0 Read (10) 00 00 4f ad 17 00 00 08 00scsi : aborting command due to timeout : pid 0, scsi5, channel 0, id 4, lun 0 Read (10) 00 00 4f ad 1f 00 00 08 00Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip:f882fcc8*pde = 00104001Oops: 0000Kernel 2.4.9-e.40enterpriseCPU: 3EIP: 0010:[<f882fcc8>] Not taintedEFLAGS: 00010002EIP is at aic7xxx_handle_scsiint [aic7xxx] 0x258eax: 0000000d ebx: f67df084 ecx: f8851000 edx: 00000000esi: f8851000 edi: 00000000 ebp: 00000000 esp: f7fffe28ds: 0018 es: 0018 ss: 0018Process swapper (pid: 0, stackpage=f7fff000)Stack: 00000000 f2e27200 f8807be4 00000246 f351f800 f2e272b4 00000001 f2e27200 f8807e99 f351f818 00000000 f351f818 00000008 00000000 00000000 f2e27200 f8808241 f2e27200 0000000d 00000000 00000001 00000001 00000001 04f84960Call Trace: [<f8807be4>] scsi_queue_next_request [scsi_mod] 0x64 (0xf7fffe30)[<f8807e99>] __scsi_end_request [scsi_mod] 0x1b9 (0xf7fffe48)[<f8808241>] scsi_io_completion_Rsmp_4dff857c [scsi_mod] 0x2a1 (0xf7fffe68)[<c0125196>] update_wall_time [kernel] 0x16 (0xf7fffe90)[<c012541b>] run_local_timers [kernel] 0x8b (0xf7fffe98)[<c012556b>] do_timer [kernel] 0xb (0xf7fffea4)[<c010c5ff>] get_cmos_time [kernel] 0x3f (0xf7fffea8)[<f8821d8f>] rw_intr [sd_mod] 0x20f (0xf7fffeb0)[<f8830f96>] aic7xxx_isr [aic7xxx] 0x296 (0xf7fffec8)[<c010756a>] nmi [kernel] 0x1e (0xf7fffee0)[<f8831008>] do_aic7xxx_isr [aic7xxx] 0x68 (0xf7ffff10)[<c0108d1e>] handle_IRQ_event [kernel] 0x5e (0xf7ffff34)[<c0108f41>] do_IRQ [kernel] 0xc1 (0xf7ffff54)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff68)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff74)[<c0246d70>] call_do_IRQ [kernel] 0x5 (0xf7ffff78)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff7c)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff90)[<c010544e>] default_idle [kernel] 0x2e (0xf7ffffa4)[<c01054b2>] cpu_idle [kernel] 0x32 (0xf7ffffb0)[<c011ceb8>] printk [kernel] 0xd8 (0xf7ffffd0)[<c0265ebe>] .rodata.str1.1 [kernel] 0xd79 (0xf7ffffe4) Code: 8b 07 0f b6 40 19 eb 05 b8 ff 00 00 00 50 31 ff 6a ff 55 0f <0>Kernel panic: not continuingIn interrupt handler - not syncing NMI Watchdog detected LOCKUP on CPU3, eip c0113aea, registers:Kernel 2.4.9-e.40enterpriseCPU: 3EIP: 0010:[<c0113aea>] Not taintedEFLAGS: 00000297EIP is at smp_call_function [kernel] 0xaaeax: 000000f3 ebx: 000000f7 ecx: c036e7e0 edx: 00000060esi: f7ffe000 edi: 00000000 ebp: 00104001 esp: f7fffca4ds: 0018 es: 0018 ss: 0018Process swapper (pid: 0, stackpage=f7fff000)Stack: c024efa0 00000000 f7fffdf4 c0113b90 c0113b40 00000000 00000001 00000000 c011c669 00000000 00000000 c0107987 c02657b6 00000001 f7fffdf4 00000000 c0118540 c0267653 f7fffdf4 00000000 c0267642 00104001 c0266c51 f882fcc8Call Trace: [<c024efa0>] call_spurious_interrupt [kernel] 0x781f (0xf7fffca4)[<c0113b90>] smp_send_stop [kernel] 0x10 (0xf7fffcb0)[<c0113b40>] stop_this_cpu [kernel] 0x0 (0xf7fffcb4)[<c011c669>] panic [kernel] 0x99 (0xf7fffcc4)[<c0107987>] die [kernel] 0x77 (0xf7fffcd0)[<c02657b6>] .rodata.str1.1 [kernel] 0x671 (0xf7fffcd4)[<c0118540>] do_page_fault [kernel] 0x380 (0xf7fffce4)[<c0267653>] .rodata.str1.1 [kernel] 0x250e (0xf7fffce8)[<c0267642>] .rodata.str1.1 [kernel] 0x24fd (0xf7fffcf4)[<c0266c51>] .rodata.str1.1 [kernel] 0x1b0c (0xf7fffcfc)[<f882fcc8>] aic7xxx_handle_scsiint [aic7xxx] 0x258 (0xf7fffd00)[<c0267632>] .rodata.str1.1 [kernel] 0x24ed (0xf7fffd04)[<c0267617>] .rodata.str1.1 [kernel] 0x24d2 (0xf7fffd08)[<f8807e99>] __scsi_end_request [scsi_mod] 0x1b9 (0xf7fffd14)[<f8808241>] scsi_io_completion_Rsmp_4dff857c [scsi_mod] 0x2a1 (0xf7fffd34)[<f8807be4>] scsi_queue_next_request [scsi_mod] 0x64 (0xf7fffd6c)[<f8807e99>] __scsi_end_request [scsi_mod] 0x1b9 (0xf7fffd84)[<f8808241>] scsi_io_completion_Rsmp_4dff857c [scsi_mod] 0x2a1 (0xf7fffda4)[<f8837ba8>] aic7xxx_queue [aic7xxx] 0x118 (0xf7fffdc8)[<c01181c0>] do_page_fault [kernel] 0x0 (0xf7fffde0)[<c01074e0>] error_code [kernel] 0x38 (0xf7fffde8)[<f882fcc8>] aic7xxx_handle_scsiint [aic7xxx] 0x258 (0xf7fffe1c)[<f8807be4>] scsi_queue_next_request [scsi_mod] 0x64 (0xf7fffe30)[<f8807e99>] __scsi_end_request [scsi_mod] 0x1b9 (0xf7fffe48)[<f8808241>] scsi_io_completion_Rsmp_4dff857c [scsi_mod] 0x2a1 (0xf7fffe68)[<c0125196>] update_wall_time [kernel] 0x16 (0xf7fffe90)[<c012541b>] run_local_timers [kernel] 0x8b (0xf7fffe98)[<c012556b>] do_timer [kernel] 0xb (0xf7fffea4)[<c010c5ff>] get_cmos_time [kernel] 0x3f (0xf7fffea8)[<f8821d8f>] rw_intr [sd_mod] 0x20f (0xf7fffeb0)[<f8830f96>] aic7xxx_isr [aic7xxx] 0x296 (0xf7fffec8)[<c010756a>] nmi [kernel] 0x1e (0xf7fffee0)[<f8831008>] do_aic7xxx_isr [aic7xxx] 0x68 (0xf7ffff10)[<c0108d1e>] handle_IRQ_event [kernel] 0x5e (0xf7ffff34)[<c0108f41>] do_IRQ [kernel] 0xc1 (0xf7ffff54)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff68)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff74)[<c0246d70>] call_do_IRQ [kernel] 0x5 (0xf7ffff78)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff7c)[<c0105420>] default_idle [kernel] 0x0 (0xf7ffff90)[<c010544e>] default_idle [kernel] 0x2e (0xf7ffffa4)[<c01054b2>] cpu_idle [kernel] 0x32 (0xf7ffffb0)[<c011ceb8>] printk [kernel] 0xd8 (0xf7ffffd0)[<c0265ebe>] .rodata.str1.1 [kernel] 0xd79 (0xf7ffffe4) Code: 8b 41 08 39 d8 75 f7 eb 02 f3 90 8b 41 0c 39 d8 75 f7 c6 05console shuts up ... NMI Watchdog detected LOCKUP on CPU2, eip f880bf2f, registers:
I am currently trying to bring the system up by enabling the raid devices one by one with the rebuild speed limit set to 30000. That seems to be keeping things stable. Also, the traces that are already in this ticket do not reflect the initial failure or the proper kernel that was running when the failure occurred. The system was actually using a test kernel given to us by engineering that fixed a problem where the kernel would panic whenever /proc/mdstat was cat'd. I am once again using that kernel and trying to get the system running again. Here are the two traces we got last night when the system initially crashed. Note that times are all in GMT-7... ^M<Jul/18 08:25 pm>Unable to handle kernel NULL pointer dereference at virtual address 00000000 ^M<Jul/18 08:25 pm>*pde = 00104001 ^M<Jul/18 08:25 pm>Oops: 0000 ^M<Jul/18 08:25 pm>Kernel 2.4.9-e.39.1.testenterprise ^M<Jul/18 08:25 pm>CPU: 3 ^M<Jul/18 08:25 pm>EIP: 0010:[<f882ecc8>] Not tainted ^M<Jul/18 08:25 pm>EFLAGS: 00010002 ^M<Jul/18 08:25 pm>EIP is at aic7xxx_handle_scsiint [aic7xxx] 0x258 <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>eax: 0000000d ebx: f67dc084 ecx: f8852000 edx: 00000000 ^M<Jul/18 08:25 pm>esi: f8852000 edi: 00000000 ebp: 00000000 esp: c9c69e28^M<Jul/18 08:25 pm>ds: 0018 es: 0018 ss: 0018 ^M<Jul/18 08:25 pm>Process swapper (pid: 0, stackpage=c9c69000) ^M<Jul/18 08:25 pm>Stack: 00000000 f2ebb200 f8807be4 00000246 f5b5d000 f2ebb2b4 00000001 f2ebb200 ^M<Jul/18 08:25 pm> f8807e99 f5b5d018 00000000 f5b5d018 00000100 00000000 00000000 f2ebb200 ^M<Jul/18 08:25 pm> f8808241 f2ebb200 0000000d 00000000 00000001 00000001 00000001 0b0b98b0 ^M<Jul/18 08:25 pm>Call Trace: [<f8807be4>] scsi_queue_next_request [scsi_mod] 0x64 (0xc9c69e30) ^M<Jul/18 08:25 pm>[<f8807e99>] __scsi_end_request [scsi_mod] 0x1b9 (0xc9c69e48)^M<Jul/18 08:25 pm>[<f8808241>] scsi_io_completion_Rsmp_4dff857c [scsi_mod] 0x2a1 (0xc9c69e68) ^M<Jul/18 08:25 pm>[<f8820d8f>] rw_intr [sd_mod] 0x20f (0xc9c69eb0) ^M<Jul/18 08:25 pm>[<f882ff96>] aic7xxx_isr [aic7xxx] 0x296 (0xc9c69ec8) <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>[<f8830008>] do_aic7xxx_isr [aic7xxx] 0x68 (0xc9c69f10) ^M<Jul/18 08:25 pm>[<c0108d1e>] handle_IRQ_event [kernel] 0x5e (0xc9c69f34) ^M<Jul/18 08:25 pm>[<c0108f41>] do_IRQ [kernel] 0xc1 (0xc9c69f54) ^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xc9c69f68) <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xc9c69f74) ^M<Jul/18 08:25 pm>[<c02469a4>] call_do_IRQ [kernel] 0x5 (0xc9c69f78) ^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xc9c69f7c) ^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xc9c69f90) ^M<Jul/18 08:25 pm>[<c010544e>] default_idle [kernel] 0x2e (0xc9c69fa4) ^M<Jul/18 08:25 pm>[<c01054b2>] cpu_idle [kernel] 0x32 (0xc9c69fb0) ^M<Jul/18 08:25 pm>[<c011ceb8>] printk [kernel] 0xd8 (0xc9c69fd0) ^M<Jul/18 08:25 pm>[<c0264fe1>] .rodata.str1.1 [kernel] 0xd5c (0xc9c69fe4) <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>Code: 8b 07 0f b6 40 19 eb 05 b8 ff 00 00 00 50 31 ff 6a ff 55 0f ^M<Jul/18 08:25 pm> <0>Kernel panic: not continuing ^M<Jul/18 08:25 pm>In interrupt handler - not syncing ^M<Jul/18 08:25 pm> NMI Watchdog detected LOCKUP on CPU7, eip f880bf2f, registers: ^M<Jul/18 08:25 pm>Kernel 2.4.9-e.39.1.testenterprise ^M<Jul/18 08:25 pm>CPU: 7 ^M<Jul/18 08:25 pm>EIP: 0010:[<f880bf2f>] Not tainted <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>EFLAGS: 00000082 ^M<Jul/18 08:25 pm>EIP is at .text.lock [scsi_mod] 0x3f ^M<Jul/18 08:25 pm>eax: f67dc16c ebx: 00000293 ecx: 00000000 edx: 3a28a5fc<Jul/18 08:25 pm>^M<Jul/18 08:25 pm>esi: 00002328 edi: f67dc000 ebp: f2e69200 esp: f7ff7d38 ^M<Jul/18 08:25 pm>ds: 0018 es: 0018 ss: 0018 ^M<Jul/18 08:25 pm>Process swapper (pid: 0, stackpage=f7ff7000) ^M<Jul/18 08:25 pm>Stack: 00000000 f2e69200 00000000 00000000 00000003 f8807263 f2e69200 00080000 ^M<Jul/18 08:25 pm> f434d800 f67dc000 00000010 f67dc084 f8852000 00000000 00000000 f8826f28 ^M<Jul/18 08:25 pm> f2e69200 f67dc084 f882ecef f67dc084 00000001 f67dc084 0000000b 00000000 <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>Call Trace: [<f8807263>] scsi_old_done [scsi_mod] 0x5a3 (0xf7ff7d4c) ^M<Jul/18 08:25 pm>[<f8826f28>] aic7xxx_done_cmds_complete [aic7xxx] 0x28 (0xf7ff7d74) ^M<Jul/18 08:25 pm>[<f882ecef>] aic7xxx_handle_scsiint [aic7xxx] 0x27f (0xf7ff7d80) ^M<Jul/18 08:25 pm>[<f882ff96>] aic7xxx_isr [aic7xxx] 0x296 (0xf7ff7e40) ^M<Jul/18 08:25 pm>[<c022ca02>] vsnprintf [kernel] 0x2c2 (0xf7ff7e64) <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>[<f88374ef>] aic7xxx_abort [aic7xxx] 0x5f (0xf7ff7e88) ^M<Jul/18 08:25 pm>[<c011cd5b>] call_console_drivers [kernel] 0xeb (0xf7ff7ea0) ^M<Jul/18 08:25 pm>[<c011ceb8>] printk [kernel] 0xd8 (0xf7ff7ecc) ^M<Jul/18 08:25 pm>[<f880741e>] scsi_abort [scsi_mod] 0xde (0xf7ff7edc) ^M<Jul/18 08:25 pm>[<f881413d>] .LC93 [scsi_mod] 0x4e3 (0xf7ff7ee0) ^M<Jul/18 08:25 pm>[<f8807437>] scsi_abort [scsi_mod] 0xf7 (0xf7ff7ef4) ^M<Jul/18 08:25 pm>[<f8806a60>] scsi_old_times_out [scsi_mod] 0x0 (0xf7ff7f04) ^M<Jul/18 08:25 pm>[<f8806a9d>] scsi_old_times_out [scsi_mod] 0x3d (0xf7ff7f0c) ^M<Jul/18 08:25 pm>[<c0124de1>] __run_timers [kernel] 0xd1 (0xf7ff7f20) ^M<Jul/18 08:25 pm>[<c0125424>] run_local_timers [kernel] 0x94 (0xf7ff7f38) ^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f50) ^M<Jul/18 08:25 pm>[<c0114318>] smp_apic_timer_interrupt [kernel] 0xb8 (0xf7ff7f54) ^M<Jul/18 08:25 pm>[<c0108f63>] do_IRQ [kernel] 0xe3 (0xf7ff7f5c) ^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f68) ^M<Jul/18 08:25 pm>[<c024735d>] call_apic_timer_interrupt [kernel] 0x5 (0xf7ff7f74) ^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f7c) ^M<Jul/18 08:25 pm>[<c0105420>] default_idle [kernel] 0x0 (0xf7ff7f90) ^M<Jul/18 08:25 pm>[<c010544e>] default_idle [kernel] 0x2e (0xf7ff7fa4) ^M<Jul/18 08:25 pm>[<c01054b2>] cpu_idle [kernel] 0x32 (0xf7ff7fb0) ^M<Jul/18 08:25 pm>[<c011ceb8>] printk [kernel] 0xd8 (0xf7ff7fd0) ^M<Jul/18 08:25 pm>[<c0264fe1>] .rodata.str1.1 [kernel] 0xd5c (0xf7ff7fe4) <Jul/18 08:25 pm> ^M<Jul/18 08:25 pm>Code: 7e f9 e9 3f 47 ff ff 80 38 00 f3 90 7e f9 e9 dd 47 ff ff 80 <Jul/18 08:25 pm>^M<Jul/18 08:25 pm>console shuts up ...
Just saw the update...here is the raidtab... [root@db1 root]# cat /etc/raidtab raiddev /dev/md1 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sda3 raid-disk 0 device /dev/sdb3 raid-disk 1 raiddev /dev/md0 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sda2 raid-disk 0 device /dev/sdb2 raid-disk 1 raiddev /dev/md7 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sda7 raid-disk 0 device /dev/sdb7 raid-disk 1 raiddev /dev/md8 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sda8 raid-disk 0 device /dev/sdb8 raid-disk 1 raiddev /dev/md6 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sda6 raid-disk 0 device /dev/sdb6 raid-disk 1 raiddev /dev/md5 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sda5 raid-disk 0 device /dev/sdb5 raid-disk 1 ##### raid5 raiddev /dev/md11 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdj1 raid-disk 0 device /dev/sdq1 raid-disk 1 device /dev/sdal1 raid-disk 2 device /dev/sdas1 raid-disk 3 raiddev /dev/md12 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdk1 raid-disk 0 device /dev/sdr1 raid-disk 1 device /dev/sdam1 raid-disk 2 device /dev/sdat1 raid-disk 3 raiddev /dev/md13 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdl1 raid-disk 0 device /dev/sds1 raid-disk 1 device /dev/sdan1 raid-disk 2 device /dev/sdau1 raid-disk 3 raiddev /dev/md14 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdm1 raid-disk 0 device /dev/sdt1 raid-disk 1 device /dev/sdao1 raid-disk 2 device /dev/sdav1 raid-disk 3 raiddev /dev/md15 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdn1 raid-disk 0 device /dev/sdu1 raid-disk 1 device /dev/sdap1 raid-disk 2 device /dev/sdaw1 raid-disk 3 raiddev /dev/md16 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdo1 raid-disk 0 device /dev/sdv1 raid-disk 1 device /dev/sdaq1 raid-disk 2 device /dev/sdax1 raid-disk 3 raiddev /dev/md17 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdc1 raid-disk 0 device /dev/sdx1 raid-disk 1 device /dev/sdae1 raid-disk 2 device /dev/sdaz1 raid-disk 3 raiddev /dev/md18 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdd1 raid-disk 0 device /dev/sdy1 raid-disk 1 device /dev/sdaf1 raid-disk 2 device /dev/sdba1 raid-disk 3 raiddev /dev/md19 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sde1 raid-disk 0 device /dev/sdz1 raid-disk 1 device /dev/sdag1 raid-disk 2 device /dev/sdbb1 raid-disk 3 raiddev /dev/md20 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdf1 raid-disk 0 device /dev/sdaa1 raid-disk 1 device /dev/sdah1 raid-disk 2 device /dev/sdbc1 raid-disk 3 raiddev /dev/md21 raid-level 5 nr-raid-disks 4 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdg1 raid-disk 0 device /dev/sdab1 raid-disk 1 device /dev/sdai1 raid-disk 2 device /dev/sdbd1 raid-disk 3 ##### raid1 raiddev /dev/md22 raid-level 1 nr-raid-disks 2 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdh1 raid-disk 0 device /dev/sdbe1 raid-disk 1 raiddev /dev/md23 raid-level 1 nr-raid-disks 2 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdac1 raid-disk 0 device /dev/sdaj1 raid-disk 1 raiddev /dev/md24 raid-level 1 nr-raid-disks 2 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdw1 raid-disk 0 device /dev/sdar1 raid-disk 1 raiddev /dev/md25 raid-level 1 nr-raid-disks 2 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdi1 raid-disk 0 device /dev/sdbf1 raid-disk 1 raiddev /dev/md26 raid-level 1 nr-raid-disks 2 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdp1 raid-disk 0 device /dev/sday1 raid-disk 1 raiddev /dev/md27 raid-level 1 nr-raid-disks 2 chunk-size 128k persistent-superblock 1 nr-spare-disks 0 device /dev/sdad1 raid-disk 0 device /dev/sdak1 raid-disk 1
There are several issues likely in this ticket, only one of which, the overflow of /proc/mdstat causing a panic, which is addressed in U5. I think the timeouts and panics during a resync would go away if we rate limit to 10000 as in rhel, as mentioned by Phil in comment #2. In terms of the original cause of the panics, perhaps Doug might be interested in those since he knows the aic driver pretty well....
The panics in comment #2 are specific to the old aic7xxx. Would you try the new aic7xxx (aic7xxx_mod.o) and see if the problem is fixed there?
The new aic driver is being used on the box and the max speed limit set to 10000.
any updates?
We have not had any issues with the machines since then. You can resolve.