Bug 55511

Summary: kernel panic when using 3com 3c996-T gigabit (bcm5700 module) under full load
Product: [Retired] Red Hat Linux Reporter: Axel Kohlmeyer <axel.kohlmeyer>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED WORKSFORME QA Contact: Brock Organ <borgan>
Severity: high Docs Contact:
Priority: medium    
Version: 7.1   
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-11-01 15:32:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Axel Kohlmeyer 2001-11-01 15:32:51 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.9-6 i686)

Description of problem:
We have a cluster of 12 machines (Athlon TB1.33GHz, Asus A7V133,  768MB
PC-133 RAM (Infineon), 3c996-T, 3Com 3c17700 Gigabit Switch) for scientific
calculations with a self compiled scientific software using LAM-MPI.

When running parallel jobs some of the nodes crash frequently 
with a kernel panic, even when we configured the switch to run
only with 100MBit-FD. On the other hand, heavy network usage
without high cpu load produced no problems.


Version-Release number of selected component (if applicable):
kernel-2.4.9-6

How reproducible:
Sometimes

Steps to Reproduce:
Unfortunately, we cannot give access to our program
and we have not found an alternative way to reproduce the 
kernel panic, yet.


Additional info:

Ksymoops output from a serial console:

cat ~/parker.oops
ksymoops 2.4.0 on i686 2.4.9-6.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.9-6/ (default)
     -m /boot/System.map-2.4.9-6 (default)

Warning: You did not tell me where to find symbol information.  I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc.  ksymoops -h explains the options.

Warning (compare_maps): mismatch on symbol partition_name  , ksyms_base
says c01b61e0, System.map says c0157120.  Ignoring ksyms_base entry
Warning (compare_maps): mismatch on symbol nlmsvc_grace_period  , lockd
says f093fa94, /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o says f093eefc. 
Ignoring /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o entry
Warning (compare_maps): mismatch on symbol nlmsvc_ops  , lockd says
f093fa90, /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o says f093eef8. 
Ignoring /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o entry
Warning (compare_maps): mismatch on symbol nlmsvc_timeout  , lockd says
f093fa98, /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o says f093ef00. 
Ignoring /lib/modules/2.4.9-6/kernel/fs/lockd/lockd.o entry
Warning (compare_maps): mismatch on symbol nfs_debug  , sunrpc says
f0931b00, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317e0. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol nfsd_debug  , sunrpc says
f0931b04, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317e4. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol nlm_debug  , sunrpc says
f0931b08, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317e8. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_debug  , sunrpc says
f0931afc, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317dc. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_garbage_args  , sunrpc says
f0931adc, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317bc. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_success  , sunrpc says
f0931acc, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317ac. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_system_err  , sunrpc says
f0931ae0, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317c0. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol xdr_one  , sunrpc says f0931ac4,
/lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317a4.  Ignoring
/lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol xdr_two  , sunrpc says f0931ac8,
/lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317a8.  Ignoring
/lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol xdr_zero  , sunrpc says
f0931ac0, /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o says f09317a0. 
Ignoring /lib/modules/2.4.9-6/kernel/net/sunrpc/sunrpc.o entry
Unable to handle kernel paging request at virtual address 2faadb65
f09044c0
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<f09044c0>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000   ebx: ef160140   ecx: 2faadb65   edx: ef1498c0
esi: 00000182   edi: ef120000   ebp: c02e9fa8   esp: c02e9f44
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c02e9000)
Stack: ef160140 00000000 ef160000 f08fd550 ef160140 efa45b60 04000001
0000000c 
       c010825a 0000000c ef160000 c02e9fa8 c02e9fa8 0000000c c0322a80
efa45b60 
       c01083d8 0000000c c02e9fa8 efa45b60 c0105390 c02e8000 c02e8000
c0105390 
Call Trace: [<f08fd550>] bcm5700_probe [bcm5700] 0xfb0 
[<c010825a>] handle_IRQ_event [kernel] 0x3a 
[<c01083d8>] do_IRQ [kernel] 0x68 
[<c0105390>] default_idle [kernel] 0x0 
[<c0105390>] default_idle [kernel] 0x0 
[<c020eccc>] call_do_IRQ [kernel] 0x5 
[<c0105390>] default_idle [kernel] 0x0 
[<c0105390>] default_idle [kernel] 0x0 
[<c01053b3>] default_idle [kernel] 0x23 
[<c0105432>] cpu_idle [kernel] 0x52 
[<c0105000>] stext [kernel] 0x0 
Code: c7 01 00 00 00 00 0f b7 42 08 0f b7 c0 83 e8 04 89 41 04 0f 
>>EIP; f09044c0 <[bcm5700]LM_ServiceInterrupts+160/2e0>   <=====
Trace; f08fd550 <[bcm5700]bcm5700_interrupt+f0/1d0>
Trace; c010825a <handle_IRQ_event+3a/70>
Trace; c01083d8 <do_IRQ+68/b0>
Trace; c0105390 <default_idle+0/30>
Trace; c0105390 <default_idle+0/30>
Trace; c020eccc <call_do_IRQ+5/d>
Trace; c0105390 <default_idle+0/30>
Trace; c0105390 <default_idle+0/30>
Trace; c01053b3 <default_idle+23/30>
Trace; c0105432 <cpu_idle+52/70>
Trace; c0105000 <_stext+0/0>
Code;  f09044c0 <[bcm5700]LM_ServiceInterrupts+160/2e0>
00000000 <_EIP>:
Code;  f09044c0 <[bcm5700]LM_ServiceInterrupts+160/2e0>   <=====
   0:   c7 01 00 00 00 00         movl   $0x0,(%ecx)   <=====
Code;  f09044c6 <[bcm5700]LM_ServiceInterrupts+166/2e0>
   6:   0f b7 42 08               movzwl 0x8(%edx),%eax
Code;  f09044ca <[bcm5700]LM_ServiceInterrupts+16a/2e0>
   a:   0f b7 c0                  movzwl %ax,%eax
Code;  f09044cd <[bcm5700]LM_ServiceInterrupts+16d/2e0>
   d:   83 e8 04                  sub    $0x4,%eax
Code;  f09044d0 <[bcm5700]LM_ServiceInterrupts+170/2e0>
  10:   89 41 04                  mov    %eax,0x4(%ecx)
Code;  f09044d3 <[bcm5700]LM_ServiceInterrupts+173/2e0>
  13:   0f 00 00                  sldt   (%eax)

 <0>Kernel panic: Aiee, killing interrupt handler!

15 warnings issued.  Results may not be reliable.

Comment 1 Axel Kohlmeyer 2002-02-19 12:33:57 UTC
Tweaking the BIOS Settings made the problem reliably go away.

I changed: 
- Spread Spectrum          -> disabled
- Byte Merge               -> enabled
- PCI Master Read Caching  -> enabled
- Delayed Transaction      -> enabled
- PCI to DRAM Prefetch     -> enabled

we are now running totally solid for weeks, and since
two weeks even with the PCI bus overclocked to 37 MHz.