From Bugzilla Helper: User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.9-6 i686) Description of problem: We have a cluster of 12 machines (Athlon TB1.33GHz, Asus A7V133, 768MB PC-133 RAM (Infineon), 3c996-T, 3Com 3c17700 Gigabit Switch) for scientific calculations with a self compiled scientific software using LAM-MPI. When running parallel jobs some of the nodes crash frequently with a kernel panic, even when we configured the switch to run only with 100MBit-FD. On the other hand, heavy network usage without high cpu load produced no problems. Version-Release number of selected component (if applicable): kernel-2.4.9-6 How reproducible: Sometimes Steps to Reproduce: Unfortunately, we cannot give access to our program and we have not found an alternative way to reproduce the kernel panic, yet. Additional info: Ksymoops output from a serial console: cat ~/parker.oops ksymoops 2.4.0 on i686 2.4.9-6. Options used -V (default) -k /proc/ksyms (default) -l /proc/modules (default) -o /lib/modules/2.4.9-6/ (default) -m /boot/System.map-2.4.9-6 (default) Warning: You did not tell me where to find symbol information. I will assume that the log matches the kernel and modules that are running right now and I'll use the default options above for symbol resolution. If the current kernel and/or modules do not match the log, you can get more accurate output by telling me the kernel version and where to find map, modules, ksyms etc. ksymoops -h explains the options. Warning (compare_maps): mismatch on symbol partition_name , ksyms_base says c01b61e0, System.map says c0157120. Ignoring ksyms_base entry Warning (compare_maps): mismatch on symbol nlmsvc_grace_period , lockd says f093fa94, /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o says f093eefc. Ignoring /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o entry Warning (compare_maps): mismatch on symbol nlmsvc_ops , lockd says f093fa90, /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o says f093eef8. Ignoring /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o entry Warning (compare_maps): mismatch on symbol nlmsvc_timeout , lockd says f093fa98, /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o says f093ef00. Ignoring /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o entry Warning (compare_maps): mismatch on symbol nfs_debug , sunrpc says f0931b00, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317e0. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol nfsd_debug , sunrpc says f0931b04, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317e4. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol nlm_debug , sunrpc says f0931b08, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317e8. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol rpc_debug , sunrpc says f0931afc, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317dc. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol rpc_garbage_args , sunrpc says f0931adc, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317bc. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol rpc_success , sunrpc says f0931acc, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317ac. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol rpc_system_err , sunrpc says f0931ae0, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317c0. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol xdr_one , sunrpc says f0931ac4, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317a4. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol xdr_two , sunrpc says f0931ac8, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317a8. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Warning (compare_maps): mismatch on symbol xdr_zero , sunrpc says f0931ac0, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317a0. Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry Unable to handle kernel paging request at virtual address 2faadb65 f09044c0 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[<f09044c0>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 eax: 00000000 ebx: ef160140 ecx: 2faadb65 edx: ef1498c0 esi: 00000182 edi: ef120000 ebp: c02e9fa8 esp: c02e9f44 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c02e9000) Stack: ef160140 00000000 ef160000 f08fd550 ef160140 efa45b60 04000001 0000000c c010825a 0000000c ef160000 c02e9fa8 c02e9fa8 0000000c c0322a80 efa45b60 c01083d8 0000000c c02e9fa8 efa45b60 c0105390 c02e8000 c02e8000 c0105390 Call Trace: [<f08fd550>] bcm5700_probe [bcm5700] 0xfb0 [<c010825a>] handle_IRQ_event [kernel] 0x3a [<c01083d8>] do_IRQ [kernel] 0x68 [<c0105390>] default_idle [kernel] 0x0 [<c0105390>] default_idle [kernel] 0x0 [<c020eccc>] call_do_IRQ [kernel] 0x5 [<c0105390>] default_idle [kernel] 0x0 [<c0105390>] default_idle [kernel] 0x0 [<c01053b3>] default_idle [kernel] 0x23 [<c0105432>] cpu_idle [kernel] 0x52 [<c0105000>] stext [kernel] 0x0 Code: c7 01 00 00 00 00 0f b7 42 08 0f b7 c0 83 e8 04 89 41 04 0f >>EIP; f09044c0 <[bcm5700]LM_ServiceInterrupts+160/2e0> <===== Trace; f08fd550 <[bcm5700]bcm5700_interrupt+f0/1d0> Trace; c010825a <handle_IRQ_event+3a/70> Trace; c01083d8 <do_IRQ+68/b0> Trace; c0105390 <default_idle+0/30> Trace; c0105390 <default_idle+0/30> Trace; c020eccc <call_do_IRQ+5/d> Trace; c0105390 <default_idle+0/30> Trace; c0105390 <default_idle+0/30> Trace; c01053b3 <default_idle+23/30> Trace; c0105432 <cpu_idle+52/70> Trace; c0105000 <_stext+0/0> Code; f09044c0 <[bcm5700]LM_ServiceInterrupts+160/2e0> 00000000 <_EIP>: Code; f09044c0 <[bcm5700]LM_ServiceInterrupts+160/2e0> <===== 0: c7 01 00 00 00 00 movl $0x0,(%ecx) <===== Code; f09044c6 <[bcm5700]LM_ServiceInterrupts+166/2e0> 6: 0f b7 42 08 movzwl 0x8(%edx),%eax Code; f09044ca <[bcm5700]LM_ServiceInterrupts+16a/2e0> a: 0f b7 c0 movzwl %ax,%eax Code; f09044cd <[bcm5700]LM_ServiceInterrupts+16d/2e0> d: 83 e8 04 sub $0x4,%eax Code; f09044d0 <[bcm5700]LM_ServiceInterrupts+170/2e0> 10: 89 41 04 mov %eax,0x4(%ecx) Code; f09044d3 <[bcm5700]LM_ServiceInterrupts+173/2e0> 13: 0f 00 00 sldt (%eax) <0>Kernel panic: Aiee, killing interrupt handler! 15 warnings issued. Results may not be reliable.
Tweaking the BIOS Settings made the problem reliably go away. I changed: - Spread Spectrum -> disabled - Byte Merge -> enabled - PCI Master Read Caching -> enabled - Delayed Transaction -> enabled - PCI to DRAM Prefetch -> enabled we are now running totally solid for weeks, and since two weeks even with the PCI bus overclocked to 37 MHz.