Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 4 product line. The current stable release is 4.9. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 174019

Summary:

TG3 driver crashes with BCM4704C chipset with heavy traffic

Product:

Red Hat Enterprise Linux 4

Reporter:

ALan Jay <alanj>

Component:

kernel

Assignee:

John W. Linville <linville>

Status:

CLOSED ERRATA

QA Contact:

Brian Brock <bbrock>

Severity:

high

Docs Contact:

Priority:

medium

Version:

4.0

CC:

clalance, davem, jbaron, wansink

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

RHSA-2006-0575

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-08-10 21:37:25 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

181409

Attachments:

Description	Flags
Azalee crash dump 24th November 2005 10am	none
Margote crash dump 25-Nov-2005 - Test Kernel 19	none
Noon Panic Margote - TG3 using older kernel	none
TG3 error report - 2 Dec 2005 - Margote - running 24 Kernel	none
3rd December Crash (Azalee)	none
Azalee sysreport from 3rd December crash	none

Description ALan Jay 2005-11-23 19:05:08 UTC

Kernel - 2.6.9-22.0.1.ELsmp

Description of problem:
We have 3 machines running Tyan S2882 and S2882-D motherboards they have the 
Broadcom BCM4704C gigabit chipset.  When using this network interface with 
heavy loading the machine crashes.

No output is sent to the serial console when the machine crashes.  I hope to 
get some output out of the console to define what is crashing and panicing.

In tests with a test kernel (which I assume has more debug code) the crash 
always occus in the tg3 driver.

Version-Release number of selected component (if applicable):


How reproducible:
Kernel - 2.6.9-22.0.1.ELsmp


Steps to Reproduce:
1. Set up one of the machines (A) against another machine (B) (not one of the 
effected ones) on a private network.
2. From machine "A" copy a 100Mb file to machine "B" create a script to do this 
(via scp) and repeat this process.  Place the script in background and run a 
number of instances of it.
3. repeat this at the same time from Machine "B" copying file to Machine "A" 
again run script and run multiple instances.
  
Actual results:
In tests when nothing much else was running on the machine this required about 
6-8 instances of this copy process to cause the machine to crash.  When running 
an aplication we are trying to implement this is reduced to just 2 instances.

Every time the machine crashes after a certain level of access.  This chipset 
has 2 ethernet conenctions if we perform this on ETH0 when the crash occurs 
something happens on ETH2 to cause other machines on the ETH2 network to thinks 
that the network is no longer available - the aplication only thinks that the 
netowrk is available when the machine is rebooted (I have no idea what causes 
this side effect).


Expected results:
The machine shouldn't have crashed.

Additional info:
Machines are Tyan S2882 and S2782D motherboards with twin Opteron, 8Gb RAM, 
LSI320 RAID controller and BCM4704C gigabit ethernet controller.

An example of the type of panic we have had:

----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md
5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket 
pcmcia_core ipt_REJECT ipt_state ip_conn
track iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd 
hw_random shpchp e100 mii tg3 ext3 jbd meg
araid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.18.EL.jwltest.89smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78  EFLAGS: 00010246
^M^@RAX: 00000000000001c8 RBX: 0000010037dddac0 RCX: 0000000000000001
^M^@RDX: 00000101efadd600 RSI: 00000000000005ea RDI: 0000000000000246
^M^@RBP: 0000000000000000 R08: 00000000000005ea R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 00000100edc4ca40 R12: 0000010004955380
^M^@R13: 00000000000001c8 R14: 0000000000000000 R15: 00000000000000bc
^M^@FS:  0000000000f0c580(0000) GS:ffffffff804d4280(0000) knlGS:000000000852cfc0
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 0000002aab8214c8 CR3: 00000000fbfa2000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030)
^M^@Stack: 00000101fff8bea8 ffffffff80131623 000001010385fa60 000000000000000f
^M^@       00000100eb2c67f0 0000000000000001 0000010037d96000 00000101fff8bf1c
^M^@       0000010004955000 000001010385fa60
^M^@Call Trace:<IRQ> <ffffffff80131623>{activate_task+124} <ffffffff802aac4b>
{net_rx_action+129}
^M^@       <ffffffff8013bbe8>{__do_softirq+88} <ffffffff8013bc91>{do_softirq+49}
^M^@       <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0}
^M^@        <EOI> <ffffffff8010e609>{default_idle+0} <ffffffff8010e629>
{default_idle+32}
^M^@       <ffffffff8010e69c>{cpu_idle+26}

^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ <0>Kernel panic - not syncing: Oops

---------------------------------

----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md
5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket 
pcmcia_core ipt_REJECT ipt_state ip_conn
track iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd 
hw_random shpchp e100 mii tg3 ext3 jbd meg
araid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.18.EL.jwltest.89smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78  EFLAGS: 00010246
^M^@RAX: 00000000000001c8 RBX: 0000010037dddac0 RCX: 0000000000000001
^M^@RDX: 00000101efadd600 RSI: 00000000000005ea RDI: 0000000000000246
^M^@RBP: 0000000000000000 R08: 00000000000005ea R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 00000100edc4ca40 R12: 0000010004955380
^M^@R13: 00000000000001c8 R14: 0000000000000000 R15: 00000000000000bc
^M^@FS:  0000000000f0c580(0000) GS:ffffffff804d4280(0000) knlGS:000000000852cfc0
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 0000002aab8214c8 CR3: 00000000fbfa2000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030)
^M^@Stack: 00000101fff8bea8 ffffffff80131623 000001010385fa60 000000000000000f
^M^@       00000100eb2c67f0 0000000000000001 0000010037d96000 00000101fff8bf1c
^M^@       0000010004955000 000001010385fa60
^M^@Call Trace:<IRQ> <ffffffff80131623>{activate_task+124} <ffffffff802aac4b>
{net_rx_action+129}
^M^@       <ffffffff8013bbe8>{__do_softirq+88} <ffffffff8013bc91>{do_softirq+49}
^M^@       <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0}
^M^@        <EOI> <ffffffff8010e609>{default_idle+0} <ffffffff8010e629>
{default_idle+32}
^M^@       <ffffffff8010e69c>{cpu_idle+26}

^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ <0>Kernel panic - not syncing: Oops

---------------------------------------------

^M^@CPU 1: Machine Check Exception:                4 Bank 4: b200000000070f0f
^M^@TSC 1e3b3d5fa02cc

^M^@CPU 0: Machine Check Exception:                4 Bank 4: b200000000070f0f
^M^@TSC 1e3b3d5fa1384
^M^@Kernel panic - not syncing: Machine check
^M^@ NMI Watchdog detected LOCKUP, CPU=1, registers:
^M^@CPU 1
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md
5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket 
pcmcia_core ipt_REJECT ipt_state ip_conn
track iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd 
hw_random e100 mii bcm5700(U) ext3 jbd meg
araid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 2660, comm: dispatcher Tainted: G   M  2.6.9-22.0.1.ELsmp
^M^@RIP: 0010:[<ffffffff8011bcbe>] <ffffffff8011bcbe>{__smp_call_function+100}
^M^@RSP: 0000:00000100f7fa5cb8  EFLAGS: 00000097
^M^@RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
^M^@RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: 0000000000000002
^M^@RBP: 0000000000000000 R08: 0000000000000008 R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8011bd63
^M^@R13: 0000000000000000 R14: 0001e3b3d5f9fa9b R15: ffffffff80317b57
^M^@FS:  0000002a96533320(0000) GS:ffffffff804d3100(005b) knlGS:00000000080c5b40
^M^@CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
^M^@CR2: 0000002aaa1e2000 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process dispatcher (pid: 2660, threadinfo 00000101fd7a8000, task 
00000101fc89c030)
^M^@Stack: ffffffff8011bd63 0000000000000000 0000000000000000 0000000000000000
^M^@       0000000000000012 0000000000000000 0000000000000900 00000000ffffffff
^M^@       ffffffff803ceba0 ffffffff8011bda0
^M^@Call Trace:<ffffffff8011bd63>{smp_really_stop_cpu+0} <ffffffff8011bda0>
{smp_send_stop+52}
^M^@<ffffffff801368a6>{panic+235} <ffffffff801176b4>{print_mce+136}
^M^@<ffffffff8011778c>{mce_available+0} <ffffffff80117adf>{do_machine_check+825}
^M^@<ffffffff801111db>{machine_check+127} <ffffffff801e9b4b>{__delay+7}
^M^@ <EOE> <ffffffffa008d82c>{:bcm5700:LM_ReadPhy+121}
^M^@       <ffffffffa0086697>{:bcm5700:bcm5700_ioctl+242} <ffffffff8013144f>
{activate_task+124}
^M^@       <ffffffff801346f7>{autoremove_wake_function+9} <ffffffff80132eaa>
{__wake_up_common+67}
^M^@       <ffffffff80132eff>{__wake_up+54} <ffffffff803006aa>{packet_rcv+873}
^M^@       <ffffffffa00848f8>{:bcm5700:bcm5700_start_xmit+1128}
^M^@       <ffffffff802a95e6>{dev_queue_xmit_nit+240} <ffffffff802b81e8>
{qdisc_restart+30}
^M^@       <ffffffff802a9a47>{dev_queue_xmit+525} <ffffffff80300e5a>
{packet_sendmsg+522}
^M^@       <ffffffff801313c1>{recalc_task_prio+337} <ffffffff802a0c37>
{sock_sendmsg+271}
^M^@       <ffffffff8010eb99>{__switch_to+289} <ffffffff80131c39>
{finish_task_switch+55}
^M^@       <ffffffff802a6e83>{datagram_poll+0} <ffffffff803032e8>
{thread_return+42}
^M^@       <ffffffff802aab10>{dev_ifsioc+1176} <ffffffff802aaeee>{dev_ioctl+975}
^M^@       <ffffffff80131c39>{finish_task_switch+55} <ffffffff802e8e90>
{inet_ioctl+166}
^M^@       <ffffffff802a1599>{sock_ioctl+699} <ffffffff801885b5>{sys_ioctl+853}
^M^@       <ffffffff8019bc44>{compat_sys_ioctl+235} <ffffffff8012515d>
{ia32_sysret+0}
^M^@

^M^@Code: 39 d8 74 02 eb f6 85 ed 74 0a 8b 44 24 14 39 d8 74 02 eb f6
^M^@Kernel panic - not syncing: nmi watchdog

------------------------------------------------------------------------------- 
[cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6 
parport_pc lp parport autofs4 sunrpc
ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter 
ip_tables dm_mirror dm_mod button batter
y ac ohci_hcd hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm 
sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.16.EL.jwltest.86smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78  EFLAGS: 00010246
^M^@RAX: 0000000000000197 RBX: 00000101fcabd628 RCX: 0000010004b95000
^M^@RDX: 0000000000000197 RSI: 00000101fff8bf1c RDI: 00000101fe479000
^M^@RBP: 0000000000000000 R08: 00000101fff82000 R09: 0000000000000082
^M^@R10: 0000000000000082 R11: 0000000000000002 R12: 00000101fe479380
^M^@R13: 0000000000000197 R14: 00000101fff83e98 R15: 0000000000000000
^M^@FS:  0000000000d7f700(0000) GS:ffffffff804d4200(0000) knlGS:0000000008c848c0
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b65060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030)
^M^@Stack: 00000101fff8bea8 00000101fe47942c 00000001fc821812 000000010000000f
^M^@       0000022302237030 0000003f00010000 0000010004b95000 00000101fff8bf1c
^M^@       00000101fe479000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802aabdb>{net_rx_action+129} <ffffffff8013bbd4>
{__do_softirq+88}
^M^@       <ffffffff8013bc7d>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
^M^@       <ffffffff8011065b>{ret_from_intr+0}  <EOI> <ffffffff8010e609>
{default_idle+0}
^M^@       <ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26}
^M^@

^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ <0>Kernel panic - not syncing: Oops

----------------------------------------------------------------------------

----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6 
parport_pc lp parport autofs4 sunrpc
ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter 
ip_tables dm_mirror dm_mod button batter
y ac ohci_hcd hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm 
sata_sil libata sd_mod scsi_mod
^M^@Pid: 2795, comm: mysqld Not tainted 2.6.9-22.17.EL.jwltest.87smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78  EFLAGS: 00010246
^M^@RAX: 00000000000000b9 RBX: 00000100f7b7c158 RCX: 0000010100000000
^M^@RDX: 0000000000000206 RSI: 00000100f5f1b978 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 00000100f5f1b978 R09: 0000000000000720
^M^@R10: 0000000100000000 R11: ffffffff8011de40 R12: 00000101ff69a380
^M^@R13: 00000000000000b9 R14: 0000000000000000 R15: 0000000000000000
^M^@FS:  0000000000b3a800(005b) GS:ffffffff804d4200(0000) knlGS:000000000852cfc0
^M^@CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
^M^@CR2: 00000000f7b65060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process mysqld (pid: 2795, threadinfo 00000100f7102000, task 
00000100f7b687f0)
^M^@Stack: 00000101fff8bea8 00000101ff69a42c 00000001e437f012 000000010000000f
^M^@       000000ee00ee3030 0000003f00010000 00000100f7b0c000 00000101fff8bf1c
^M^@       00000101ff69a000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802aabfb>{net_rx_action+129} <ffffffff8013bbe8>
{__do_softirq+88}
^M^@       <ffffffff8013bc91>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
^M^@       <ffffffff8011065b>{ret_from_intr+0}  <EOI>

^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ <0>Kernel panic - not syncing: Oops
^M^@rtc: lost some interrupts at 1024Hz.

----------------------------------------------------------------------

^M^@CPU 1: Machine Check Exception:                4 Bank 4: b200000000070f0f
^M^@TSC 73f188bc502

^M^@CPU 0: Machine Check Exception:                4 Bank 4: b200000000070f0f
^M^@TSC 73f188bc649
^M^@Kernel panic - not syncing: Machine check
^M^@ NMI Watchdog detected LOCKUP, CPU=1, registers:
^M^@CPU 1
^M^@Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6 
parport_pc lp parport autofs4 sunrpc
ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter 
ip_tables dm_mirror dm_mod button batter
y ac ohci_hcd hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm 
sata_sil libata sd_mod scsi_mod
^M^@Pid: 11397, comm: ssh Tainted: G   M  2.6.9-22.18.EL.jwltest.89smp
^M^@RIP: 0010:[<ffffffff8011bceb>] <ffffffff8011bceb>{__smp_call_function+106}
^M^@RSP: 0018:00000100f7fa5cb8  EFLAGS: 00000097
^M^@RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
^M^@RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: 0000000000000002
^M^@RBP: 0000000000000000 R08: 0000000000000008 R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8011bd8e
^M^@R13: 0000000000000000 R14: 0000073f188bbe25 R15: ffffffff803187f1
^M^@FS:  0000002a96708040(0000) GS:ffffffff804d4280(0000) knlGS:00000000080c1080
^M^@CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
^M^@CR2: 0000002a9642d8c0 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process ssh (pid: 11397, threadinfo 00000100f6fb6000, task 00000101f9526030)
^M^@Stack: ffffffff8011bd8e 0000000000000000 0000000000000000 0000000000000000
^M^@       0000000000000016 0000000000000000 0000000000000900 00000000ffffffff
^M^@       ffffffff803cfc20 ffffffff8011bdcb
^M^@Call Trace:<ffffffff8011bd8e>{smp_really_stop_cpu+0} <ffffffff8011bdcb>
{smp_send_stop+52}
^M^@<ffffffff80136e0e>{panic+235} <ffffffff801176b8>{print_mce+136}
^M^@<ffffffff80117790>{mce_available+0} <ffffffff80117ae3>{do_machine_check+825}
^M^@<ffffffff8011121b>{machine_check+127} <ffffffffa00891b8>
{:tg3:tg3_start_xmit+1707}
^M^@ <EOE> <ffffffffa00e8ef4>{:ip_conntrack:__ip_conntrack_confirm+448}
^M^@       <ffffffff802b8dd8>{qdisc_restart+254} <ffffffff802c5cf3>
{ip_finish_output2+0}
^M^@       <ffffffff802aa3c2>{dev_queue_xmit+228} <ffffffff802b3619>
{nf_hook_slow+184}
^M^@       <ffffffff802c607b>{ip_finish_output+478} <ffffffff802c59b0>
{dst_output+0}
^M^@       <ffffffff802c59c6>{dst_output+22} <ffffffff802b3619>
{nf_hook_slow+184}
^M^@       <ffffffff802c6479>{ip_queue_xmit+1011} <ffffffff802c1baf>
{__ip_route_output_key+1972}
^M^@       <ffffffff802c1baf>{__ip_route_output_key+1972} <ffffffff8018f087>
{update_atime+147}
^M^@       <ffffffff802d5b11>{tcp_transmit_skb+2037} <ffffffff802d7f3f>
{tcp_connect+727}
^M^@       <ffffffff802dac68>{tcp_v4_connect+2275} <ffffffff802e947a>
{inet_stream_connect+170}
^M^@       <ffffffff8017836c>{fget+75} <ffffffff802a29ac>{sys_connect+114}
^M^@       <ffffffff80176d52>{fd_install+42} <ffffffff802a144f>{sock_map_fd+59}
^M^@       <ffffffff80110092>{system_call+126}

^M^@Code: eb f4 85 ed 74 0c 8b 44 24 14 39 d8 74 04 f3 90 eb f4 48 83
^M^@Kernel panic - not syncing: nmi watchdog
^M^@
--------------------------------------------------------------------

Comment 1 ALan Jay 2005-11-24 13:04:26 UTC

Created attachment 121451 [details]
Azalee crash dump 24th November 2005 10am

This is todays crash dump - I am running one of John Linvilles test kernels at
the moment should I switch back to the default kernel?

Comment 2 John W. Linville 2005-11-24 22:38:33 UTC

Alan, please ensure that you are using the current/latest kernel from here:

   http://people.redhat.com/linville/kernels/rhel4/

I would presume that you are doing so, but it is worth asking... :-)

Comment 3 ALan Jay 2005-11-24 23:07:42 UTC

I have been using these and will do so again - I was told, however, by support 
to go back to the earlier 2.6.9-22.0.1.ELsmp but I am very happy to return to 
the latest version of the test kernal if that is of most use to you.

I currently have one mchine running the 19 kernel and 2 running the 2.6.9-
22.0.1.ELsmp will see what happens over night and report back if there are any 
more crashes.

For some reason I get reports on the serial console much more frequently with 
the test kernels than I do with the release one.

Comment 4 ALan Jay 2005-11-25 07:21:39 UTC

Created attachment 121476 [details]
Margote crash dump 25-Nov-2005 - Test Kernel 19

Another crash overnight with machine Margote crashing - it had been up 4 days
prior to crash also I noticed in the serial console output:


<ConMan> Console [margote] log at 2005-11-24 18:10:00 GMT.
1249
ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@printk: 525 messages suppressed.
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@printk: 577 messages suppressed.
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@
[root@margote 18:15:26 ] ~ # 1250

<ConMan> Console [margote] log at 2005-11-24 18:20:00 GMT.
1251

At 6pm yesterday I don't think it is significant but include it for
completness.  At the time some network maintenance was going on so it could
have been caused by that.

Comment 5 ALan Jay 2005-11-25 18:15:19 UTC

Our machines are now running the 19 test kernal and I will report back any 
additional problems.

I know debugging stuff is hard what we have found is that when the machine 
crashes that our aplication will not restart until the offending crashes 
machine is rebooted as though the crash has caused some disturbance on the 
second network.  

Is there a way to tell if the crash is in eth0 or eth1?

We are using both networks eth0 is being used for aplication data access and 
eth1 is being used for communication between the servers and for a heartbeat 
connection (this is why we find a odity in that our aplication won't use the 
eth1 network after a crash until the offending machine has been rebooted).

As I said I have no idea if this is of any help but I know that sometimes the 
smallest thing help :)

I'll report any more errors if we get any.

Regards
ALan

Comment 6 ALan Jay 2005-11-26 23:46:21 UTC

Just had a crash but not much in the way of output on the serial console:

343

^M^@CPU 0: Machine Check Exception:                4 Bank 4: f200000000070f0f
^M^@TSC df2f64d64b50
^M^@Kernel panic - not syncing: Machine check
^M^@

Comment 7 ALan Jay 2005-11-27 17:20:16 UTC

Another crash report:

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at tg3:2864
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6 
parport_pc l
p parport autofs4 sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state 
ip_conntrack i
ptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random 
e100 mii t
g3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
Pid: 7188, comm: mysqld Not tainted 2.6.9-22.19.EL.jwltest.90smp
RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
RSP: 0000:00000101fff8be78  EFLAGS: 00010246
RAX: 0000000000000174 RBX: 00000100f7a752e0 RCX: 0000010000011000
RDX: 0000000000000206 RSI: 00000101eb60fbb8 RDI: 0000000000000206
RBP: 0000000000000000 R08: 00000101eb60fbb8 R09: 0000002aa8610501
R10: 0000000100000000 R11: ffffffffa008428c R12: 00000100f7f01380
R13: 0000000000000174 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000c2eba0(005b) GS:ffffffff804d4400(0000) knlGS:0000000008c848c0
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002aa8d59000 CR3: 00000000f7fa2000 CR4: 00000000000006e0
Process mysqld (pid: 7188, threadinfo 00000101ece28000, task 00000100f4df9030)
Stack: 00000000fff8bea8 00000100f7f0142c 00000001fe4d6812 0000000100000001
       0000024202426100 0000003f00010000 00000100f7ad1000 00000101fff8bf1c
       00000100f7f01000 0000000000000202
Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softirq+8
8}
       <ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
       <ffffffff8011065b>{ret_from_intr+0}  <EOI>

Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
 <0>Kernel panic - not syncing: Oops

------------------------------------------------------------------------------

Comment 8 ALan Jay 2005-11-27 20:33:21 UTC

And another:

<ConMan> Console [margote] log at 2005-11-27 20:10:00 GMT.
146
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_
filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_dev 
i2c_core sunrp
c ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter 
ip_tables
 dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd 
megaraid_mb
ox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044ba78  EFLAGS: 00010246
^M^@RAX: 0000000000000153 RBX: 00000101fe7fcfc8 RCX: 0000010000011000
^M^@RDX: 0000000000000206 RSI: 0000000000000042 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 0000000000000042 R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 0000000000000001 R12: 00000101fff96380
^M^@R13: 0000000000000153 R14: 0000000000000000 R15: 0000000000000182
^M^@FS:  0000002aa808c140(0000) GS:ffffffff804d4380(0000) knlGS:00000000080ca240
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f49be59c CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804d8000, task ffffffff803cb880)
^M^@Stack: ffffffff8044baa8 00000101fff9642c 00000001ff54a012 000000010000000f
^M^@       000001820182c030 0000003f00010000 00000100f7b9d000 ffffffff8044bb1c
^M^@       00000101fff96000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softi
rq+88}
^M^@       <ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
^M^@       <ffffffff8011065b>{ret_from_intr+0}  <EOI> <ffffffff8010e609>
{default_idle+0}

^M^@       <ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26}
^M^@       <ffffffff804db67b>{start_kernel+470} <ffffffff804db1d5>
{_sinittext+469}
^M^@

^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <ffffffff8044ba78>
^M^@ <0>Kernel panic - not syncing: Oops
^M^@rtc: lost some interrupts a^M^@
^GMessage from syslogd@margotet at Sun Nov 27 2 0:13:59 2005 ...^M^@
margote kerne1l: invalid opera0nd: 0000 [1] SMP2 ^M^@
4Hz.
^M^@
<ConMan> Console [margote] log at 2005-11-27 20:20:00 GMT.

Comment 9 ALan Jay 2005-11-28 06:27:38 UTC

ALan again:-
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_
filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_dev 
i2c_core sunrp
c ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter 
ip_tables
 dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd 
megaraid_mb
ox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78  EFLAGS: 00010246
^M^@RAX: 0000000000000162 RBX: 00000101fdcc5130 RCX: 0000010100000000
^M^@RDX: 0000000000000206 RSI: 00000101fa2b2c78 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 00000101fa2b2c78 R09: 0000000000000010
^M^@R10: 0000000100000000 R11: 0000000000000002 R12: 00000100f7cf8380
^M^@R13: 0000000000000162 R14: 0000000000000000 R15: 0000000000000000
^M^@FS:  0000002aaab44ec0(0000) GS:ffffffff804d4400(0000) knlGS:000000000852cfc0
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b5d060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030)
^M^@Stack: 00000001fff8bea8 0000000000000001 0000000000000012 0000000000000001
^M^@       ffffffff803f6100 0000000000000000 00000100f5ab2000 00000101fff8bf1c
^M^@       00000100f7cf8000 ffffffff8013324d
^M^@Call Trace:<IRQ> <ffffffff8013324d>{__wake_up_common+67} <ffffffff802aadb3>
{net_rx_a
ction+129}
^M^@       <ffffffff8013bc30>{__do_softirq+88} <ffffffff8013bcd9>{do_softirq+49}
^M^@       <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0}
^M^@        <EOI> <ffffffff8010e609>{default_idle+0} <ffffffff8010e629>
{default_idle+32}

^M^@       <ffffffff8010e69c>{cpu_idle+26}

^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>

--------------

^M^@ ----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel panic - not syncing: Oops
^M^@ <1>Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [2] SMP
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4
i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state 
ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac 
ohci_hcd
 hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata 
sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044ba78  EFLAGS: 00010246
^M^@RAX: 00000000000001c9 RBX: 00000100f7565ad8 RCX: 0000000000000001
^M^@RDX: 00000100f15bc600 RSI: 000000000000004d RDI: 0000000000000246
^M^@RBP: 0000000000000000 R08: 000000000000004d R09: 0000000000000008
^M^@R10: 0000000000000008 R11: 00000100efc631c0 R12: 00000101fff96380
^M^@R13: 00000000000001c9 R14: 0000000000000000 R15: 0000000000000000
^M^@FS:  0000000000bcd8c0(0000) GS:ffffffff804d4380(0000) knlGS:00000000080c81e0
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7ec2000 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804d8000, task ffffffff803cb880)
^M^@Stack: ffffffff8044baa8 00000101fff9642c 0000000004b37012 000000010000000f
^M^@       000003460346d030 0000003f00010000 00000100f6238000 ffffffff8044bb1c
^M^@       00000101fff96000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softirq+88}
^M^@       <ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
^M^@       <ffffffff8011065b>{ret_from_intr+0}  <EOI> <ffffffff8010e609>
{default_idle+0}
^M^@       <ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26}
^M^@       <ffffffff804db67b>{start_kernel+470} <ffffffff804db1d5>
{_sinittext+469}
^M^@

^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <ffffffff8044ba78>
^M^@Badness in do_unblank_screen at drivers/char/vt.c:2876

^M^@Call Trace:<IRQ> <ffffffff802324f6>{do_unblank_screen+61} <ffffffff80123008>
{bust_spinlocks+28}
^M^@       <ffffffff80111874>{oops_end+18} <ffffffff801119a1>{die+54}
^M^@       <ffffffff80111d64>{do_invalid_op+145} <ffffffffa0087529>
{:tg3:tg3_poll+177}
^M^@       <ffffffff80112f79>{do_IRQ+266} <ffffffff8011065b>{ret_from_intr+0}
^M^@       <ffffffff80110b2d>{error_exit+0} <ffffffffa0087529>
{:tg3:tg3_poll+177}
^M^@       <ffffffffa008761d>{:tg3:tg3_poll+421} <ffffffff802aadb3>
{net_rx_action+129}
^M^@       <ffffffff8013bc30>{__do_softirq+88} <ffffffff8013bcd9>{do_softirq+49}
^M^@       <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0}
^M^@        <EOI> <ffffffff8010e609>{default_idle+0} <fffff
<ConMan> Console [margote] log at 2005-11-28 01:40:00 GMT.

-----------------------------------------------------------------------

Comment 10 ALan Jay 2005-11-28 12:48:11 UTC

Created attachment 121542 [details]
Noon Panic Margote - TG3 using older kernel

I was asked by our software supplier to try an older kernel from which this
crash was reported.

Comment 11 ALan Jay 2005-11-28 18:10:24 UTC

And Again :- (this time with your kernel 19)

Red Hat Enterprise Linux ES release 4 (Nahant Update 2)
Kernel 2.6.9-22.19.EL.jwltest.90smp on an x86_64

[root@margote 17:50:12 ] ~ # 2
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at tg3:2864
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 
i2c_dev
 i2c_core sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack 
iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random
 e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
RSP: 0000:ffffffff8044ba78  EFLAGS: 00010246
RAX: 00000000000000d5 RBX: 00000100f79443f8 RCX: 0000000000000001
RDX: 00000100f1573500 RSI: 0000000000000411 RDI: 0000000000000246
RBP: 0000000000000000 R08: 0000000000000411 R09: 0000000000000000
R10: 0000000000000000 R11: 00000100f7a41bc0 R12: 00000101ff2fa380
R13: 00000000000000d5 R14: 0000000000000000 R15: 0000000000000000
FS:  000000000187d280(0000) GS:ffffffff804d4380(0000) knlGS:00000000080c81e0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000002aac233000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff804d8000, task ffffffff803cb880)
Stack: ffffffff8044baa8 00000101ff2fa42c 00000001fe3b2012 000000010000000f
       000002f402f4f7f0 0000003f00010000 0000010004a1f000 ffffffff8044bb1c
       00000101ff2fa000 0000000000000202
Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softirq+88}
       <ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
       <ffffffff8011065b>{ret_from_intr+0}  <EOI> <ffffffff8010e609>
{default_idle+0}
       <ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26}
       <ffffffff804db67b>{start_kernel+470} <ffffffff804db1d5>{_sinittext+469}


Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <ffffffff8044ba78>
 <0>Kernel panic - not syncing: Oops
rtc: lost some interrupts at 1024Hz.

Comment 12 John W. Linville 2005-11-29 18:43:56 UTC

Those oopses seem reasonably consistent -- I probably don't need any more at  
the moment... :-)  
  
It looks like we are hitting a BUG() in tg3.c on line 2864:  
  
static void tg3_tx(struct tg3 *tp)  
{  
        u32 hw_idx = tp->hw_status->idx[0].tx_consumer;  
        u32 sw_idx = tp->tx_cons;  
  
        while (sw_idx != hw_idx) {  
                struct tx_ring_info *ri = &tp->tx_buffers[sw_idx];  
                struct sk_buff *skb = ri->skb;  
                int i;  
  
                if (unlikely(skb == NULL))  
                        BUG();  
...<cut>...  
  
Now, what does this mean?  I'll have to get back to you...

Comment 13 David Miller 2005-11-29 22:34:57 UTC

This always means that the PCI chipset is illegally reordering transactions
on the bus.  Try to get the chipset in use by this system, and then add it
to the "write_reorder_chipsets[]" array.  That will fix the bug.

This problem always happens on some x86_64-based platform, it's unfortunate
that generating an exhaustive list of prone chipsets is so difficult.

Comment 14 ALan Jay 2005-11-30 14:33:04 UTC

Not sure exactly which chipset you need the borads are:

Tyan S2882 and S2882-D
http://www.tyan.com/products/html/thunderk8spro_spec.html

Chipset 
 â¢ AMD-8131â¢ HyperTransportâ¢ PCI-X Tunnel
 â¢ AMD-8111â¢ HyperTransportâ¢ I/O Hub
 â¢ Winbondâ¢ W83627HF Super I/O ASIC
 â¢ Analog Devices ADM1027 Hardware Monitoring IC

http://www.tyan.com/products/html/thunderk8sdpro_spec.html 

Chipset 
 â¢ AMD-8131â¢ HyperTransportâ¢ PCI-X Tunnel
 â¢ AMD-8111â¢ HyperTransportâ¢ I/O Hub
 â¢ Winbondâ¢ W83627HF Super I/O ASIC
 â¢ Analog Devices ADM1027 Hardware Monitoring IC

Does that help?

Regards
ALan

Comment 15 John W. Linville 2005-11-30 18:33:13 UTC

I have test kernels available here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give those a try and post the results here.  If they don't work for 
you, then please attach the output of running sysreport as well...thanks!

Comment 16 ALan Jay 2005-11-30 18:52:36 UTC

OK thanks I will load them now and run some tests later and let you know how I 
get on.

Comment 17 ALan Jay 2005-11-30 18:58:34 UTC

On one of the machines when I boot this kernel I get:

ACPI wakeup devices:
PCI1 USB0 USB1 UAR1 UAR2 GOLA GLAN GOLB SMBC AC97 MODM PWRB
Freeing unused kernel memory: 188k freed
Red Hat nash version 4.2.1.6 staSCSI subsystem initialized
rting
Mounted /ACPI: PCI interrupt 0000:03:05.0[A] -> GSI 19 (level, low) -> IRQ 169
Unable to handle kernel paging request
 at 0000000000004c40 RIP:             Mounting sysfs

Creating /dev
<ffffffffa003d461>{:sata_sil:sil_init_one+583}Starting udev
L
oading scsi_mod.PML4 f7d68067 ko module
LoadiPGD f7d73067 ng sd_mod.ko modPMD 0 ule
Loading lib
ata.ko module
LOops: 0002 [1] oading sata_sil.SMP ko module

CPU 0
Modules linked in: sata_sil libata sd_mod scsi_mod
Pid: 202, comm: insmod Not tainted 2.6.9-23.EL.jwltest.92smp
RIP: 0010:[<ffffffffa003d461>] <ffffffffa003d461>{:sata_sil:sil_init_one+583}
RSP: 0000:00000101ffd63e48  EFLAGS: 00010206
RAX: 0000000000000003 RBX: 0000000000004c00 RCX: 00000010feafe000
RDX: 00000000feafe000 RSI: 0000000000000246 RDI: ffffffff803eca60
RBP: 0000010004910f80 R08: 0000000000000001 R09: 00000101ffd63e14
R10: 0000000000000028 R11: ffffffff802a0b70 R12: 0000000000000000
R13: 0000000000000004 R14: 00000101051a2180 R15: ffffffffa003d5e0
FS:  0000000000000000(0000) GS:ffffffff804d8080(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000004c40 CR3: 0000000000101000 CR4: 00000000000006e0
Process insmod (pid: 202, threadinfo 00000101ffd62000, task 00000101fffa27f0)
Stack: 10000100f7d0ed00 00000000ffffffed ffffffffa003f5a8 00000101051a2180
       00000101051a21f0 ffffffffa003f560 000000000056c710 ffffffff801f124c
       ffffffffa003f5a8 00000101051a21f0
Call Trace:<ffffffff801f124c>{pci_device_probe+110} <ffffffff80244c01>
{bus_match+57}
       <ffffffff80244cff>{driver_attach+68} <ffffffff8024501b>
{bus_add_driver+143}
       <ffffffff801f0fbc>{pci_register_driver+119} <ffffffffa004100e>
{:sata_sil:sil_init+14}
       <ffffffff8014f21f>{sys_init_module+278} <ffffffff801101c6>
{system_call+126}


Code: 88 43 40 88 43 41 88 43 44 88 43 45 49 83 7f 18 02 75 66 88
RIP <ffffffffa003d461>{:sata_sil:sil_init_one+583} RSP <00000101ffd63e48>
CR2: 0000000000004c40
 <0>Kernel panic - not syncing: Oops

Comment 18 ALan Jay 2005-11-30 18:59:12 UTC

Actually I get it on both of them :)

powernow-k8: Found 2 AMD Athlon 64 / Opteron processors (version 1.50.04-rh)
powernow-k8: MP systems not supported by PSB BIOS structure
powernow-k8: init not cpu 0
ACPI: (supports S0 S1 S5)
ACPI wakeup devices:
PCI1 USB0 USB1 UAR1 UAR2 GOLA GLAN GOLB SMBC AC97 MODM PWRB
Freeing unused kernel memory: 188k freed
Red Hat nash version 4.2.1.6 staSCSI subsystem initialized
rting
Mounted /ACPI: PCI interrupt 0000:03:05.0[A] -> GSI 19 (level, low) -> IRQ 169
Unable to handle kernel paging request
 at 0000000000004c40 RIP:             Mounting sysfs

Creating /dev
<ffffffffa003d461>{:sata_sil:sil_init_one+583}Starting udev
L
oading scsi_mod.PML4 497c067 ko module
LoadiPGD 37e53067 ng sd_mod.ko modPMD 0 ule
Loading lib
ata.ko module
LOops: 0002 [1] oading sata_sil.SMP ko module

CPU 0
Modules linked in: sata_sil libata sd_mod scsi_mod
Pid: 202, comm: insmod Not tainted 2.6.9-23.EL.jwltest.92smp
RIP: 0010:[<ffffffffa003d461>] <ffffffffa003d461>{:sata_sil:sil_init_one+583}
RSP: 0000:00000101ffd57e48  EFLAGS: 00010206
RAX: 0000000000000003 RBX: 0000000000004c00 RCX: 00000010feafe000
RDX: 00000000feafe000 RSI: 0000000000000246 RDI: ffffffff803eca60
RBP: 0000010037e36f80 R08: 0000000000000001 R09: 00000101ffd57e14
R10: 0000000000000028 R11: ffffffff802a0b70 R12: 0000000000000000
R13: 0000000000000004 R14: 00000101051a2180 R15: ffffffffa003d5e0
FS:  0000000000000000(0000) GS:ffffffff804d8080(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000004c40 CR3: 0000000000101000 CR4: 00000000000006e0
Process insmod (pid: 202, threadinfo 00000101ffd56000, task 00000101fffa37f0)
Stack: 10000100049a5c00 00000000ffffffed ffffffffa003f5a8 00000101051a2180
       00000101051a21f0 ffffffffa003f560 000000000056c710 ffffffff801f124c
       ffffffffa003f5a8 00000101051a21f0
Call Trace:<ffffffff801f124c>{pci_device_probe+110} <ffffffff80244c01>
{bus_match+57}
       <ffffffff80244cff>{driver_attach+68} <ffffffff8024501b>
{bus_add_driver+143}
       <ffffffff801f0fbc>{pci_register_driver+119} <ffffffffa004100e>
{:sata_sil:sil_init+14}
       <ffffffff8014f21f>{sys_init_module+278} <ffffffff801101c6>
{system_call+126}


Code: 88 43 40 88 43 41 88 43 44 88 43 45 49 83 7f 18 02 75 66 88
RIP <ffffffffa003d461>{:sata_sil:sil_init_one+583} RSP <00000101ffd57e48>
CR2: 0000000000004c40
 <0>Kernel panic - not syncing: Oops

Comment 19 John W. Linville 2005-11-30 20:16:12 UTC

Looks like there is a SATA-related problem in recent kernels...I'll respin the 
test code once the base kernels have the fix...probably tomorrow...  Thanks 
for your patience!

Comment 20 ALan Jay 2005-11-30 22:45:42 UTC

OK thanks - not being too expert in these things we don't use the SATA at the 
moment is there a way to not load the driver?  Otherwise will wait till 
tomorrow or Friday hopefully to test.

Thanks

ALan

Comment 21 ALan Jay 2005-12-02 12:45:19 UTC

I have this morning loaded:

Kernel 2.6.9-24.EL.jwltest.93smp on an x86_64

It doesn't have the same issues as the previous kernel it loads correctly and 
the machines run, the issue with the tg3 driver crashing as above seems to have 
stopped.

BUT I am seeing occasional reboots from the machines (without messages to the 
serial console) when pumping multiple multi mega byte files trough the Broadcom 
ethernet interfaces.

Comment 22 ALan Jay 2005-12-02 14:16:51 UTC

Just to reiterate that the problems and issues we were seeing over the crashing 
of the TG3 driver in this particular manner appear to be fixed reboots / 
crashes do not casue the same side effects that we were seeing on other 
machines connected to the same network.

However both machines have spontaneously rebooted and we have had one crash 
(without auto reboot) so far under very heavy load at no time on any of these 
incidents was anything output to the serial console and the machine is 
sufficently hung that it does not respond to Alt SysReq t etc.

If there is anything more I can do please let me know.

Regards
ALan

Comment 23 John W. Linville 2005-12-02 15:14:54 UTC

ALan, would you characterize the current situation as an improvement?  Are the 
crashes less frequent and/or less problematic? 
 
In other words, is this taking us in the right direction?

Comment 24 ALan Jay 2005-12-02 15:40:37 UTC

John,

Yes very much improved :)

It now only seems to crash when I abuse it :) ie I can crash it but I doubt 
that in normal use it will be stressed to that level - even more importantly 
the way it crashes no longer effects other machines on the network.

Previously the crash casues the other machines on the network to stop 
functioning correctly - we are running a cluster aplication so that 3 machines 
server our front end customer environement.  With the previous bug (now 
aparently fixed - thanks) when one node went down the others went off line.

With the new Kernel if one node dies only that one dies - this is survivable 
(if annoying) as users won't see the change in status as there are other 
machines taking over the load.

Our supplier (of the curlster software) has sugested removing any rules in 
iptables and I am now testing it in that configuration.

But so far this is a definite improvement (so thanks) well done for the quick 
turn around.

As I conculded in the last note there are still issues but they are not so 
serious but they would be good to fix - but I don't know what is going wrong 
and there is no console output to help you.

Regards
ALan

Comment 25 ALan Jay 2005-12-02 15:50:29 UTC

Created attachment 121758 [details]
TG3 error report - 2 Dec 2005 - Margote - running 24 Kernel

Comment 26 ALan Jay 2005-12-02 15:52:31 UTC

There is still some issue with the TG3 as the above atachement shows the TG3 
failing in a similar way to previously except the machine did not crash or 
reboot it is still accesible via the serial console though the ssh sessions all 
reset.

<ConMan> Console [margote] log at 2005-12-02 15:40:00 GMT.
40
tg3: eth0: transmit timed out, resetting
^M^@DEBUG: PCI status [82b0] TG3PCI state[000030e2]
^M^@DEBUG: MAC_MODE[00e04c08] MAC_STATUS[00400003]
..............................
^M^@DEBUG: NIC RXD_JUMBO(5)[0][c675ddc5:8c4809d6:bf2a5895:e72d1ede]
^M^@DEBUG: NIC RXD_JUMBO(5)[1][915a39ee:afd09df9:b8a6d9be:2532107f]
^M^@tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
^M^@tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
^M^@tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
^M^@tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
^M^@tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
^M^@
[root@margote 15:49:13 ] ~ # 41

<ConMan> Console [margote] log at 2005-12-02 15:50:00 GMT

Comment 27 ALan Jay 2005-12-02 15:58:12 UTC

And then the machine crashed:

tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2

[root@margote 15:49:13 ] ~ # 41
42
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at tg3:2864
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: iptable_filter ip_tables w83627hf lm85 i2c_sensor i2c_isa 
i2c_amd756 arp
t_mangle arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport 
autofs4 i2c_dev
 i2c_core sunrpc ds yenta_socket pcmcia_core dm_mirror dm_mod button battery ac 
ohci_hcd hw
_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod 
scsi_mod
Pid: 21698, comm: adsd Not tainted 2.6.9-24.EL.jwltest.93smp
RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
RSP: 0000:00000100ca7d78b8  EFLAGS: 00010246
RAX: 00000000000001df RBX: 00000101fe18dce8 RCX: 0000010000011000
RDX: 0000000000000206 RSI: 0000000000000042 RDI: 0000000000000206
RBP: 0000000000000000 R08: 0000000000000042 R09: 0000000000000001
R10: 0000000000000000 R11: 00000101fb261a80 R12: 00000101fecb0380
R13: 00000000000001df R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000bcd8c0(0000) GS:ffffffff804d8180(005b) knlGS:000000000852b960
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f7b5d060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
Process adsd (pid: 21698, threadinfo 00000100ca7d6000, task 00000100d70ee7f0)
Stack: 00000000000001df 0000000000000304 00000101fecb0380 ffffffffffffffc1
       ffffffffa0086281 0000000000000010 00000100f5c66000 00000100ca7d795c
       00000101fecb0000 ffffffffa008b1a8
Call Trace:<ffffffffa0086281>{:tg3:tg3_write32_tx_mbox+30} <ffffffffa008b1a8>
{:tg3:tg3_star
t_xmit+1691}
       <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>{__do_softirq+88}
       <ffffffff8013beb9>{do_softirq+49} <ffffffff802ab57b>{dev_queue_xmit+525}
       <ffffffff802c7091>{ip_finish_output+356} <ffffffff802c6a40>{dst_output+0}
       <ffffffff802c6a56>{dst_output+22} <ffffffff802b46a9>{nf_hook_slow+184}
       <ffffffff802c7509>{ip_queue_xmit+1011} <ffffffff801ea152>
{copy_user_generic_c+8}
       <ffffffff802d6ba1>{tcp_transmit_skb+2037} <ffffffff802cd6a8>
{tcp_recvmsg+1790}
       <ffffffff802a5dd1>{sock_common_recvmsg+48} <ffffffff802a28f4>
{sock_recvmsg+284}
       <ffffffff80131759>{recalc_task_prio+337} <ffffffff80134e12>
{autoremove_wake_function
+0}
       <ffffffff802a24f7>{sockfd_lookup+16} <ffffffff802a3d27>{sys_recvfrom+182}
       <ffffffff80183088>{pipe_writev+726} <ffffffff802b8253>
{compat_sys_socketcall+258}
       <ffffffff8012555d>{ia32_sysret+0}

Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <00000100ca7d78b8>
 <0>Kernel panic - not syncing: Oops

Comment 28 ALan Jay 2005-12-02 23:18:13 UTC

A couple more crashes with output:

<ConMan> Console [margote] log at 2005-12-02 19:50:00 GMT.
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: iptable_filter ip_tables w83627hf lm85 i2c_sensor 
i2c_isa arpt_mangle i2c_amd756 arptable_filter arp_tables ip_queue md5 ipv6 
parport_p
c lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core 
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd 
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 4422, comm: adsd Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78  EFLAGS: 00010246
^M^@RAX: 0000000000000014 RBX: 000001016c03b1e0 RCX: 00000100f4a6f000
^M^@RDX: 0000000000000014 RSI: 00000101fff8bf1c RDI: 0000010037c5a000
^M^@RBP: 0000000000000000 R08: ffffffffffffffc1 R09: 00000000ffffc3d0
^M^@R10: 00000000ffffc3d0 R11: 00000000f6fbb898 R12: 0000010037c5a380
^M^@R13: 0000000000000014 R14: 00000101b09a5f58 R15: 0000000000000000
^M^@FS:  0000000000c2cd00(0000) GS:ffffffff804d8180(005b) knlGS:000000000852cfc0
^M^@CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
^M^@CR2: 00000000f7ffb000 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process adsd (pid: 4422, threadinfo 00000101b09a4000, task 00000100f5da47f0)
^M^@Stack: 0000000000000001 0000010037c5a42c 000000016bcf1812 0000000100000001
^M^@       000002ff02ff7980 0000003f00010000 00000100f4a6f000 00000101fff8bf1c
^M^@       0000010037c5a000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@       <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@       <ffffffff8011078f>{ret_from_intr+0}  <EOI>

^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ ----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel panic - not syncing: Oops
^M^@ <1>Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [2] SMP
^M^@CPU 0
^M^@Modules linked in: iptable_filter ip_tables w83627hf lm85 i2c_sensor 
i2c_isa arpt_mangle i2c_amd756 arptable_filter arp_tables ip_queue md5 ipv6 
parport_p
c lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core 
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd 
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044d5f8  EFLAGS: 00010246
^M^@RAX: 00000000000001d5 RBX: 000001016cdadbf8 RCX: 0000010100000000
^M^@RDX: 0000000000000206 RSI: 0000000000000042 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 0000000000000042 R09: 0000000000000060
^M^@R10: ffffffffa008628c R11: ffffffffa008628c R12: 00000101fff96380
^M^@R13: 00000000000001d5 R14: 0000000000000000 R15: 0000000000000000
^M^@FS:  0000000000b3a800(0000) GS:ffffffff804d8100(0000) knlGS:00000000080c9740
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b5d060 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804dc000, task ffffffff803cd100)
^M^@Stack: ffffffff8044d628 00000101fff9642c 00000001f4a78012 0000000100000001
^M^@       0000012a012a27f0 0000003f00010000 00000100f4960000 ffffffff8044d69c
^M^@       00000101fff96000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@       <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@       <ffffffff8011078f>{ret_from_intr+0}  <EOI> <ffffffff8010e749>
{default_idle+0}
^M^@       <ffffffff8010e769>{default_idle+32} <ffffffff8010e7dc>{cpu_idle+26}
^M^@       <ffffffff804df67b>{start_kernel+470} <ffffffff804df1d5>
{_sinittext+469}
^M^@

^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <ffffffff8044d5f8>
^M^@Badness in do_unblank_screen at drivers/char/vt.c:2876
^M^@Call Trace:<IRQ> <ffffffff80232c8a>{do_unblank_screen+61} <ffffffff801231c4>
{bust_spinlocks+28}
^M^@       <ffffffff801119a8>{oops_end+18} <ffffffff80111ad5>{die+54}
^M^@       <ffffffff80111e98>{do_invalid_op+145} <ffffffffa0089529>
{:tg3:tg3_poll+177}
^M^@       <ffffffffa0086236>{:tg3:_tw32_flush+12} <ffffffffa008628c>
{:tg3:tg3_read32+0}
^M^@       <ffffffffa0086236>{:tg3:_tw32_flush+12} <ffffffffa008628c>
{:tg3:tg3_read32+0}
^M^@       <ffffffffa00865a0>{:tg3:tg3_readphy+141} <ffffffffa0086236>
{:tg3:_tw32_flush+12}
^M^@       <ffffffffa008628c>{:tg3:tg3_read32+0} <ffffffff80110c61>
{error_exit+0}
^M^@       <ffffffffa008628c>{:tg3:tg3_read32+0} <ffffffffa008628c>
{:tg3:tg3_read32+0}
^M^@       <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffffa008961d>
{:tg3:tg3_poll+421}
^M^@       <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@       <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@       <ffffffff8011078f>{re
<ConMan> Console [margote] log at 2005-12-02 20:00:00 GMT.

<ConMan> Console [margote] disconnected from <lontht1:2037> at 12-02 20:00.

Comment 29 ALan Jay 2005-12-02 23:22:07 UTC

And:

<ConMan> Console [azalee] log at 2005-12-02 21:50:00 GMT.
41

^M^@CPU 0: Machine Check Exception:                4 Bank 4: f200000000070f0f
^M^@TSC 1d868f5d0bc7
^M^@Kernel panic - not syncing: Machine check
^M^@
<ConMan> Console [azalee] log at 2005-12-02 22:00:00 GMT.

Comment 30 ALan Jay 2005-12-03 08:00:10 UTC

Created attachment 121794 [details]
3rd December Crash (Azalee)

As you can see there is still an issue.  The way the machines crashes is less
fatal to other machines on the network but the machine still crashes the crash
dump is attached and the header is below.

The machine here is running mySQL and an appliaction from Continuent; the
iptables firewall is running but has no rules in it other than those installed
by Continuants software.  Running sql-bench from a separate machine quering the
server is causing this to happen repeatably.

If you need anything else let me know.


Regards
ALan

^M^@ ----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel panic - not syncing: Oops
^M^@ <1>Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [2] SMP
^M^@CPU 0

<ConMan> Console [azalee] log at 2005-12-03 07:50:00 GMT.
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1

Comment 31 ALan Jay 2005-12-05 07:46:07 UTC

Under low loading slightly more stable (and again only the crashed machine went 
off line so an improvment) but another crash :(

<ConMan> Console [margote] log at 2005-12-05 01:10:00 GMT.
599
600

<ConMan> Console [margote] log at 2005-12-05 01:20:00 GMT.
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_
dev i2c_core sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core 
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd 
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78  EFLAGS: 00010246
^M^@RAX: 0000000000000184 RBX: 00000101fd2dd460 RCX: 0000010000011000
^M^@RDX: 0000000000000206 RSI: 00000100d7f4aa78 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 00000100d7f4aa78 R09: 0000000000000040
^M^@R10: 0000000100000000 R11: 0000000000000002 R12: 0000010004ab2380
^M^@R13: 0000000000000184 R14: 0000000000000000 R15: 0000000000000000
^M^@FS:  0000000000bcd8c0(0000) GS:ffffffff804d8180(0000) knlGS:00000000080af8c0
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b5d060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030)
^M^@Stack: 00000101fff8bea8 0000010004ab242c 00000001ee6aa812 0000000100000002
^M^@       0000004b004b27f0 0000003f00010000 00000100f7b75000 00000101fff8bf1c
^M^@       0000010004ab2000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@       <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@       <ffffffff8011078f>{ret_from_intr+0}  <EOI> <ffffffff8010e749>
{default_idle+0}
^M^@       <ffffffff8010e769>{default_idle+32} <ffffffff8010e7dc>{cpu_idle+26}
^M^@

^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ <0>Kernel panic - not syncing: Oops
^M^@
<ConMan> Console [margote] log at 2005-12-05 01:30:00 GMT.

<ConMan> Console [margote] disconnected from <lontht1:2037> at 12-05 01:33.

Comment 32 ALan Jay 2005-12-05 11:41:49 UTC

And the other machine :)  Again under relativel light load into a mySQL 
database.

<ConMan> Console [azalee] log at 2005-12-05 10:40:00 GMT.
553
554
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_
dev i2c_core sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core 
dm_mirror dm_mod button battery ac ohci_hcd hw_random shpchp e100 mii tg3 ext3 
jbd me
garaid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044d5f8  EFLAGS: 00010246
^M^@RAX: 000000000000008d RBX: 00000101fe7ebd38 RCX: 0000010004aa4000
^M^@RDX: 000000000000008d RSI: 0000000000003c28 RDI: 00000101fffa0384
^M^@RBP: 0000000000000000 R08: ffffffff804dc000 R09: 0000000000000100
^M^@R10: ffffffffa008628c R11: ffffffffa008628c R12: 00000101fffa0380
^M^@R13: 000000000000008d R14: ffffffff804ddf08 R15: 0000000000000000
^M^@FS:  0000000000bd38c0(0000) GS:ffffffff804d8100(0000) knlGS:00000000080c81c0
^M^@CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b51060 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804dc000, task ffffffff803cd100)
^M^@Stack: 0000000000000046 00000101fffa042c 0000000004aa8012 0000000280133490
^M^@       0000025f025fdf08 0000003e00010000 0000010004aa4000 ffffffff8044d69c
^M^@       00000101fffa0000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff80236aa9>{rtc_interrupt+233} <ffffffff802abcdb>
{net_rx_action+129}
^M^@       <ffffffff8013be10>{__do_softirq+88} <ffffffff8013beb9>{do_softirq+49}
^M^@       <ffffffff801130eb>{do_IRQ+328} <ffffffff8011078f>{ret_from_intr+0}
^M^@        <EOI> <ffffffff8010e749>{default_idle+0} <ffffffff8010e769>
{default_idle+32}
^M^@       <ffffffff8010e7dc>{cpu_idle+26} <ffffffff804df67b>{start_kernel+470}
^M^@       <ffffffff804df1d5>{_sinittext+469}

^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <ffffffff8044d5f8>
^M^@ <0>Ker^M^@
^GMessage from syslogd@azalee atn Mon Dec  5 10:4e8:25 2005 ...^M^@
azalee kernel: invalid operand:  0000 [1] SMP ^M^@
panic - not syncing: Oops
^M^@
<ConMan> Console [azalee] log at 2005-12-05 10:50:00 GMT.

Comment 33 John W. Linville 2005-12-05 17:18:05 UTC

ALan,  could you attach the ouput of running "sysreport" on one of the boxes  
in questions?  Thanks!

Comment 34 ALan Jay 2005-12-05 17:27:08 UTC

Created attachment 121857 [details]
Azalee sysreport from 3rd December crash

This is a sysreport I ran on Saturday after a crash.  I noticed the 94 kernal a
little while ago and although there may not be any changes for me I am now
running it and will do some more tests and add a sysreport and crash report if
I get another crash later.

Comment 35 ALan Jay 2005-12-05 19:30:11 UTC

In addition Azalee spontaneously rebooted using the 94 kernel and also I 
noticed that Broadcomm released a new version of their driver for Linux last 
week.

Regards
ALan

Comment 36 ALan Jay 2005-12-06 07:41:51 UTC

And another crash:

<ConMan> Console [margote] log at 2005-12-06 00:30:00 GMT.
88

^M^@CPU 0: Machine Check Exception:                4 Bank 4: b200000000070f0f
^M^@TSC 390b1f9c765c

^M^@CPU 1: Machine Check Exception:                4 Bank 4: b200000000070f0f
^M^@TSC 390b1f9c9c36
^M^@Kernel panic - not syncing: Machine check
^M^@ NMI Watchdog detected LOCKUP, CPU=0, registers:
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle 
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_
dev i2c_core sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core 
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd 
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 2879, comm: dispatcher Tainted: G   M  2.6.9-24.EL.jwltest.94smp
^M^@RIP: 0010:[<ffffffff8011be25>] <ffffffff8011be25>{__smp_call_function+100}
^M^@RSP: 0000:ffffffff80452638  EFLAGS: 00000097
^M^@RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
^M^@RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: 0000000000000000
^M^@RBP: 0000000000000000 R08: 0000000000000008 R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 0000000000000002 R12: ffffffff8011bece
^M^@R13: 0000000000000000 R14: 0000390b1f9c7002 R15: ffffffff80319eb9
^M^@FS:  0000000000bcd8c0(0000) GS:ffffffff804d8300(005b) knlGS:00000000080c98c0
^M^@CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
^M^@CR2: 00000000f7ec2000 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process dispatcher (pid: 2879, threadinfo 00000101fd168000, task 
00000101ffd70030)
^M^@Stack: ffffffff8011bece 0000000000000000 0000000000000000 0000000000000000
^M^@       0000000000000016 0000000000000000 0000000000000900 00000000ffffffff
^M^@       ffffffff803d1840 ffffffff8011bf0b
^M^@Call Trace:<ffffffff8011bece>{smp_really_stop_cpu+0} <ffffffff8011bf0b>
{smp_send_stop+52}
^M^@<ffffffff80137026>{panic+235} <ffffffff801177ec>{print_mce+136}
^M^@<ffffffff801178c4>{mce_available+0} <ffffffff80117c17>{do_machine_check+825}
^M^@<ffffffff8011134f>{machine_check+127} <ffffffffa0086295>{:tg3:tg3_read32+9}
^M^@ <EOE> <IRQ> <ffffffffa008a045>{:tg3:tg3_interrupt_tagged+48}
^M^@       <ffffffff80112dee>{handle_IRQ_event+41} <ffffffff80113068>
{do_IRQ+197}
^M^@       <ffffffff8011078f>{ret_from_intr+0}  <EOI> <ffffffff802231e3>
{uuid_strategy+165}
^M^@       <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffff80124010>
{search_extable+68}
^M^@       <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffff801488a9>
{search_exception_tables+29}
^M^@       <ffffffff80111c5a>{do_trap+220} <ffffffff80111e98>{do_invalid_op+145}
^M^@       <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffff80110c61>
{error_exit+0}
^M^@       <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffffa008961d>
{:tg3:tg3_poll+421}
^M^@       <ffffffff802abec7>{net_rx_action+129} <ffffffff8013be28>
{__do_softirq+88}
^M^@       <ffffffff8013bed1>{do_softirq+49} <ffffffffa01b202b>
{:ip_queue:ipq_issue_verdict+43}
^M^@       <ffffffffa01b28a5>{:ip_queue:ipq_rcv_sk+974} <ffffffff802bee3f>
{netlink_data_ready+22}
^M^@       <ffffffff802be64e>{netlink_sendskb+113} <ffffffff802bee14>
{netlink_sendmsg+689}
^M^@       <ffffffff802a295b>{sock_sendmsg+271} <ffffffff80178870>{fget+67}
^M^@       <ffffffff80134e2a>{autoremove_wake_function+0} <ffffffff802a42cf>
{sys_sendmsg+463}
^M^@       <ffffffff802a26e3>{sockfd_lookup+16} <ffffffff802a248f>
{move_addr_to_user+60}
^M^@       <ffffffff8012232b>{do_gettimeoffset_pm+8} <ffffffff802b8496>
{compat_sys_socketcall+345}
^M^@       <ffffffff80125575>{ia32_sysret+0}

^M^@Code: 39 d8 74 04 f3 90 eb f4 85 ed 74 0c 8b 44 24 14 39 d8 74 04
^M^@Kernel panic - not syncing: nmi watchdog
^M^@
<ConMan> Console [margote] log at 2005-12-06 00:40:00 GMT.

Comment 37 David Miller 2005-12-06 10:21:56 UTC

Enough crash dumps already!!! :-)  All of them have the same signature
and show the same problem.  There is no benefit from posting any more
of these nearly identical dumps, but your tenacity is appreciated :)
Usually we have the opposite problem of not being able to get enough
information.

We'll ask for more dumps in specific situations if we think it will
help diagnose the problem further, thanks.

Comment 38 ALan Jay 2005-12-07 18:08:26 UTC

Sorry - never sure if any of the slight variations is actually helpful :) (or 
not).

I assume that someone will tell me if you have a new version of the Kernel that 
it might be worth me trying to use :) (I note that 95 does not change anything) 
but won't post the crash dumps as they look similar to my untrained eye but I 
have saved them just in case you want to look at them :)

As I said eariler there is a definate improvement in that the crash does not 
casue other machines on the network to fail due to seeing the network being 
trashed :) by the crash (so to speak).

Hopefully what ever is still failing will make sense and we can move forward 
and get a fix :)

All the best and thanks for all your hard work.

ALan

Comment 39 ALan Jay 2005-12-08 17:20:45 UTC

Just in case you are interested Kernel 24 - 06 still crashes  - but does not 
give any error message to the serial console.  It also crashes in a worse way - 
not sure how to describe but the aplication we are running does monitors the 
other machines and when one crashes it does not see it disapear (not sure if 
that makes sense).  This is not what happeend with the previous version of the 
kernel 95 - and so might be viewed as a step back :)

Comment 40 John W. Linville 2005-12-12 17:19:32 UTC

Have you tried using FC4 on this box?  Someone else suggested that those 
kernels have better support for your HyperTransport chipset.

Comment 41 ALan Jay 2005-12-12 18:55:44 UTC

We did a number of months ago before trying RedHat as we hoped moving back to a 
more supported version would be better :(

We had the same problems as we are having now with the Broadcom chipset.

When we run the machines not using this chipset we have a stable environment 
with both FC4 and RedHat v4.

We are currenlty running (on the same hardware) one FC4 machine and one RedHat 
machine - both are pretty stable we have been doing work to them so they have 
only been up 22 days (FC4) and 14 days (RedHat) but on both machines we only 
use the Intel Pro 10/100 ethernet port and not the Broadcom 10/100/1000 ones.

Both machines in that configuration seem fine and stable but as soon as we use 
the Broadcomm ethernet ports to any extent we see the crashes we have reported 
to you over the past few days (FYI the last Kernel of yours I tried crashed 
more frequently and without any error messages - did you remove some debug 
code?).

If the bug was in the Hyper Transport chipset wouldn't it effect both the 
broadcom and intel ethernet controllers?

Comment 42 John W. Linville 2005-12-13 19:52:38 UTC

I suppose it would, although there could be subtle interactions.  Anyway, it 
seemed worth mentioning just in case... 
 
No other bright ideas at the moment...will have to get back to you...

Comment 43 ALan Jay 2005-12-13 20:05:46 UTC

Is there anything else I can do to help?

Happy to try things to give you more information about the problem.

As you saw after the initial bug fix there are still crashes in the same part 
of the code.  

Would it be worthwhile seeing if there is anything in the latest Broadcomm 
driver which was released a couple of weeks ago which might help?

Also if we were to try the broadcom driver which would be the most suitable 
kernel to try it on.  After all we know the bug you fixed will be a problem in 
kernels before a certain number?

Once again thanks for your help and if you have any ideas for things I can try 
to provide you with more information please let me know.

Regards
Alan

Comment 44 Chris Lalancette 2006-01-19 17:19:58 UTC

Hi there,
     This is sort of a shot in the dark, but I noticed that you always have
ip_queue loaded, and one of the crashes is calling the ip_queue code.  Have you
tried stressing the machines without any iptables QUEUE rules, and without
loading ip_queue?  I did have problems at one time with ip_queue causing lockups
on kernels.

Chris Lalancette

Comment 45 ALan Jay 2006-01-20 14:39:19 UTC

Unfortunately iptables is required by the aplication we are trying to run.

I have, however, just completed some tests after removing the LSI MegaRAID 320 
card - although the Broadcomm ports were only connected to a 100Mbit LAN (they 
were previously connected to a Gigabit connection) they no longer fell over.

This could mean the bug is an interaction between the LSI MegaRAID 320 and the 
Broadcomm driver or it could simply be an issue that only is noteceable at the 
higher performance levels required by a Gigabit connection (but I assume you 
have tested that).

I asked Broadcomm's support about this issue and they have the same motherboard 
(tyan S-2882-D) but can't replicate the bug so maybe that is pointing at some 
interaction between the driver and the LSI MegaRAID 320.

Anyway that is all I can add to the mix at the moment.

Comment 46 John W. Linville 2006-02-07 19:45:47 UTC

I have test kernels w/ a tg3 update available here: 
 
   http://people.redhat.com/linville/kernels/rhel4/ 
 
Please give that a try and post the results here...thanks!

Comment 47 John W. Linville 2006-03-14 14:25:09 UTC

Need feedback ASAP if this is going to be in U4...

Comment 48 ALan Jay 2006-03-14 17:36:16 UTC

Sorry the servers I was testing on have been pushed into production.

What I can tell you is that the re-order fix definately worked and improved 
things enormously.

Comment 50 Jason Baron 2006-04-12 13:03:14 UTC

committed in stream U4 build 34.17. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/

Comment 51 ALan Jay 2006-04-12 16:30:08 UTC

One additional thing we have discovered (we think) when used at 100M via 100M 
switch we have seen issues with aplications that use multicast packets to check 
the status of a number of machines (all of which have these ethernet adapters).

To explain further the aplication we use is capable of using main and backup 
ethernet network cords and check the state of the various network cards (I am 
told using multicast packets).  

We discovered that although at 100M the cards transmit data fine and without 
issue (with the 2.6.9-22.0.1.ELsmp kernel) when used in this particular 
environment the aplication can sometimes think that the back-up route is 
responing faster but when it tries to use the back-up netowrk adapter it 
immediately switches back to master (which is not based on the Broadcomm 
hardware).

Not sure if this is of any helpd at all but thought I would report our findings 
just in case they provide some assistance.

Comment 54 Red Hat Bugzilla 2006-08-10 21:37:32 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html