Bug 174019
| Summary: | TG3 driver crashes with BCM4704C chipset with heavy traffic | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 4 | Reporter: | ALan Jay <alanj> |
| Component: | kernel | Assignee: | John W. Linville <linville> |
| Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> |
| Severity: | high | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.0 | CC: | clalance, davem, jbaron, wansink |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHSA-2006-0575 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2006-08-10 21:37:25 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 181409 | ||
| Attachments: | |||
Created attachment 121451 [details]
Azalee crash dump 24th November 2005 10am
This is todays crash dump - I am running one of John Linvilles test kernels at
the moment should I switch back to the default kernel?
Alan, please ensure that you are using the current/latest kernel from here: http://people.redhat.com/linville/kernels/rhel4/ I would presume that you are doing so, but it is worth asking... :-) I have been using these and will do so again - I was told, however, by support to go back to the earlier 2.6.9-22.0.1.ELsmp but I am very happy to return to the latest version of the test kernal if that is of most use to you. I currently have one mchine running the 19 kernel and 2 running the 2.6.9- 22.0.1.ELsmp will see what happens over night and report back if there are any more crashes. For some reason I get reports on the serial console much more frequently with the test kernels than I do with the release one. Created attachment 121476 [details]
Margote crash dump 25-Nov-2005 - Test Kernel 19
Another crash overnight with machine Margote crashing - it had been up 4 days
prior to crash also I noticed in the serial console output:
<ConMan> Console [margote] log at 2005-11-24 18:10:00 GMT.
1249
ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@printk: 525 messages suppressed.
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@printk: 577 messages suppressed.
^M^@ip_queue: full at 1024 entries, dropping packet(s).
^M^@
[root@margote 18:15:26 ] ~ # 1250
<ConMan> Console [margote] log at 2005-11-24 18:20:00 GMT.
1251
At 6pm yesterday I don't think it is significant but include it for
completness. At the time some network maintenance was going on so it could
have been caused by that.
Our machines are now running the 19 test kernal and I will report back any additional problems. I know debugging stuff is hard what we have found is that when the machine crashes that our aplication will not restart until the offending crashes machine is rebooted as though the crash has caused some disturbance on the second network. Is there a way to tell if the crash is in eth0 or eth1? We are using both networks eth0 is being used for aplication data access and eth1 is being used for communication between the servers and for a heartbeat connection (this is why we find a odity in that our aplication won't use the eth1 network after a crash until the offending machine has been rebooted). As I said I have no idea if this is of any help but I know that sometimes the smallest thing help :) I'll report any more errors if we get any. Regards ALan Just had a crash but not much in the way of output on the serial console: 343 ^M^@CPU 0: Machine Check Exception: 4 Bank 4: f200000000070f0f ^M^@TSC df2f64d64b50 ^M^@Kernel panic - not syncing: Machine check ^M^@ Another crash report:
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at tg3:2864
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6
parport_pc l
p parport autofs4 sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state
ip_conntrack i
ptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random
e100 mii t
g3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
Pid: 7188, comm: mysqld Not tainted 2.6.9-22.19.EL.jwltest.90smp
RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
RSP: 0000:00000101fff8be78 EFLAGS: 00010246
RAX: 0000000000000174 RBX: 00000100f7a752e0 RCX: 0000010000011000
RDX: 0000000000000206 RSI: 00000101eb60fbb8 RDI: 0000000000000206
RBP: 0000000000000000 R08: 00000101eb60fbb8 R09: 0000002aa8610501
R10: 0000000100000000 R11: ffffffffa008428c R12: 00000100f7f01380
R13: 0000000000000174 R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000c2eba0(005b) GS:ffffffff804d4400(0000) knlGS:0000000008c848c0
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002aa8d59000 CR3: 00000000f7fa2000 CR4: 00000000000006e0
Process mysqld (pid: 7188, threadinfo 00000101ece28000, task 00000100f4df9030)
Stack: 00000000fff8bea8 00000100f7f0142c 00000001fe4d6812 0000000100000001
0000024202426100 0000003f00010000 00000100f7ad1000 00000101fff8bf1c
00000100f7f01000 0000000000000202
Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softirq+8
8}
<ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
<ffffffff8011065b>{ret_from_intr+0} <EOI>
Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
<0>Kernel panic - not syncing: Oops
------------------------------------------------------------------------------
And another:
<ConMan> Console [margote] log at 2005-11-27 20:10:00 GMT.
146
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle
arptable_
filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_dev
i2c_core sunrp
c ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter
ip_tables
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd
megaraid_mb
ox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044ba78 EFLAGS: 00010246
^M^@RAX: 0000000000000153 RBX: 00000101fe7fcfc8 RCX: 0000010000011000
^M^@RDX: 0000000000000206 RSI: 0000000000000042 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 0000000000000042 R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 0000000000000001 R12: 00000101fff96380
^M^@R13: 0000000000000153 R14: 0000000000000000 R15: 0000000000000182
^M^@FS: 0000002aa808c140(0000) GS:ffffffff804d4380(0000) knlGS:00000000080ca240
^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f49be59c CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804d8000, task ffffffff803cb880)
^M^@Stack: ffffffff8044baa8 00000101fff9642c 00000001ff54a012 000000010000000f
^M^@ 000001820182c030 0000003f00010000 00000100f7b9d000 ffffffff8044bb1c
^M^@ 00000101fff96000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softi
rq+88}
^M^@ <ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
^M^@ <ffffffff8011065b>{ret_from_intr+0} <EOI> <ffffffff8010e609>
{default_idle+0}
^M^@ <ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26}
^M^@ <ffffffff804db67b>{start_kernel+470} <ffffffff804db1d5>
{_sinittext+469}
^M^@
^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <ffffffff8044ba78>
^M^@ <0>Kernel panic - not syncing: Oops
^M^@rtc: lost some interrupts a^M^@
^GMessage from syslogd@margotet at Sun Nov 27 2 0:13:59 2005 ...^M^@
margote kerne1l: invalid opera0nd: 0000 [1] SMP2 ^M^@
4Hz.
^M^@
<ConMan> Console [margote] log at 2005-11-27 20:20:00 GMT.
ALan again:-
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle
arptable_
filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_dev
i2c_core sunrp
c ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter
ip_tables
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd
megaraid_mb
ox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78 EFLAGS: 00010246
^M^@RAX: 0000000000000162 RBX: 00000101fdcc5130 RCX: 0000010100000000
^M^@RDX: 0000000000000206 RSI: 00000101fa2b2c78 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 00000101fa2b2c78 R09: 0000000000000010
^M^@R10: 0000000100000000 R11: 0000000000000002 R12: 00000100f7cf8380
^M^@R13: 0000000000000162 R14: 0000000000000000 R15: 0000000000000000
^M^@FS: 0000002aaab44ec0(0000) GS:ffffffff804d4400(0000) knlGS:000000000852cfc0
^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b5d060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030)
^M^@Stack: 00000001fff8bea8 0000000000000001 0000000000000012 0000000000000001
^M^@ ffffffff803f6100 0000000000000000 00000100f5ab2000 00000101fff8bf1c
^M^@ 00000100f7cf8000 ffffffff8013324d
^M^@Call Trace:<IRQ> <ffffffff8013324d>{__wake_up_common+67} <ffffffff802aadb3>
{net_rx_a
ction+129}
^M^@ <ffffffff8013bc30>{__do_softirq+88} <ffffffff8013bcd9>{do_softirq+49}
^M^@ <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0}
^M^@ <EOI> <ffffffff8010e609>{default_idle+0} <ffffffff8010e629>
{default_idle+32}
^M^@ <ffffffff8010e69c>{cpu_idle+26}
^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
--------------
^M^@ ----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel panic - not syncing: Oops
^M^@ <1>Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [2] SMP
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4
i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state
ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button battery ac
ohci_hcd
hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata
sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044ba78 EFLAGS: 00010246
^M^@RAX: 00000000000001c9 RBX: 00000100f7565ad8 RCX: 0000000000000001
^M^@RDX: 00000100f15bc600 RSI: 000000000000004d RDI: 0000000000000246
^M^@RBP: 0000000000000000 R08: 000000000000004d R09: 0000000000000008
^M^@R10: 0000000000000008 R11: 00000100efc631c0 R12: 00000101fff96380
^M^@R13: 00000000000001c9 R14: 0000000000000000 R15: 0000000000000000
^M^@FS: 0000000000bcd8c0(0000) GS:ffffffff804d4380(0000) knlGS:00000000080c81e0
^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7ec2000 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804d8000, task ffffffff803cb880)
^M^@Stack: ffffffff8044baa8 00000101fff9642c 0000000004b37012 000000010000000f
^M^@ 000003460346d030 0000003f00010000 00000100f6238000 ffffffff8044bb1c
^M^@ 00000101fff96000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softirq+88}
^M^@ <ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
^M^@ <ffffffff8011065b>{ret_from_intr+0} <EOI> <ffffffff8010e609>
{default_idle+0}
^M^@ <ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26}
^M^@ <ffffffff804db67b>{start_kernel+470} <ffffffff804db1d5>
{_sinittext+469}
^M^@
^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <ffffffff8044ba78>
^M^@Badness in do_unblank_screen at drivers/char/vt.c:2876
^M^@Call Trace:<IRQ> <ffffffff802324f6>{do_unblank_screen+61} <ffffffff80123008>
{bust_spinlocks+28}
^M^@ <ffffffff80111874>{oops_end+18} <ffffffff801119a1>{die+54}
^M^@ <ffffffff80111d64>{do_invalid_op+145} <ffffffffa0087529>
{:tg3:tg3_poll+177}
^M^@ <ffffffff80112f79>{do_IRQ+266} <ffffffff8011065b>{ret_from_intr+0}
^M^@ <ffffffff80110b2d>{error_exit+0} <ffffffffa0087529>
{:tg3:tg3_poll+177}
^M^@ <ffffffffa008761d>{:tg3:tg3_poll+421} <ffffffff802aadb3>
{net_rx_action+129}
^M^@ <ffffffff8013bc30>{__do_softirq+88} <ffffffff8013bcd9>{do_softirq+49}
^M^@ <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0}
^M^@ <EOI> <ffffffff8010e609>{default_idle+0} <fffff
<ConMan> Console [margote] log at 2005-11-28 01:40:00 GMT.
-----------------------------------------------------------------------
Created attachment 121542 [details]
Noon Panic Margote - TG3 using older kernel
I was asked by our software supplier to try an older kernel from which this
crash was reported.
And Again :- (this time with your kernel 19)
Red Hat Enterprise Linux ES release 4 (Nahant Update 2)
Kernel 2.6.9-22.19.EL.jwltest.90smp on an x86_64
[root@margote 17:50:12 ] ~ # 2
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at tg3:2864
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4
i2c_dev
i2c_core sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack
iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random
e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.9-22.19.EL.jwltest.90smp
RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177}
RSP: 0000:ffffffff8044ba78 EFLAGS: 00010246
RAX: 00000000000000d5 RBX: 00000100f79443f8 RCX: 0000000000000001
RDX: 00000100f1573500 RSI: 0000000000000411 RDI: 0000000000000246
RBP: 0000000000000000 R08: 0000000000000411 R09: 0000000000000000
R10: 0000000000000000 R11: 00000100f7a41bc0 R12: 00000101ff2fa380
R13: 00000000000000d5 R14: 0000000000000000 R15: 0000000000000000
FS: 000000000187d280(0000) GS:ffffffff804d4380(0000) knlGS:00000000080c81e0
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000002aac233000 CR3: 0000000000101000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff804d8000, task ffffffff803cb880)
Stack: ffffffff8044baa8 00000101ff2fa42c 00000001fe3b2012 000000010000000f
000002f402f4f7f0 0000003f00010000 0000010004a1f000 ffffffff8044bb1c
00000101ff2fa000 0000000000000202
Call Trace:<IRQ> <ffffffff802aadb3>{net_rx_action+129} <ffffffff8013bc30>
{__do_softirq+88}
<ffffffff8013bcd9>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328}
<ffffffff8011065b>{ret_from_intr+0} <EOI> <ffffffff8010e609>
{default_idle+0}
<ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26}
<ffffffff804db67b>{start_kernel+470} <ffffffff804db1d5>{_sinittext+469}
Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <ffffffff8044ba78>
<0>Kernel panic - not syncing: Oops
rtc: lost some interrupts at 1024Hz.
Those oopses seem reasonably consistent -- I probably don't need any more at
the moment... :-)
It looks like we are hitting a BUG() in tg3.c on line 2864:
static void tg3_tx(struct tg3 *tp)
{
u32 hw_idx = tp->hw_status->idx[0].tx_consumer;
u32 sw_idx = tp->tx_cons;
while (sw_idx != hw_idx) {
struct tx_ring_info *ri = &tp->tx_buffers[sw_idx];
struct sk_buff *skb = ri->skb;
int i;
if (unlikely(skb == NULL))
BUG();
...<cut>...
Now, what does this mean? I'll have to get back to you...
This always means that the PCI chipset is illegally reordering transactions on the bus. Try to get the chipset in use by this system, and then add it to the "write_reorder_chipsets[]" array. That will fix the bug. This problem always happens on some x86_64-based platform, it's unfortunate that generating an exhaustive list of prone chipsets is so difficult. Not sure exactly which chipset you need the borads are: Tyan S2882 and S2882-D http://www.tyan.com/products/html/thunderk8spro_spec.html Chipset ⢠AMD-8131⢠HyperTransport⢠PCI-X Tunnel ⢠AMD-8111⢠HyperTransport⢠I/O Hub ⢠Winbond⢠W83627HF Super I/O ASIC ⢠Analog Devices ADM1027 Hardware Monitoring IC http://www.tyan.com/products/html/thunderk8sdpro_spec.html Chipset ⢠AMD-8131⢠HyperTransport⢠PCI-X Tunnel ⢠AMD-8111⢠HyperTransport⢠I/O Hub ⢠Winbond⢠W83627HF Super I/O ASIC ⢠Analog Devices ADM1027 Hardware Monitoring IC Does that help? Regards ALan I have test kernels available here: http://people.redhat.com/linville/kernels/rhel4/ Please give those a try and post the results here. If they don't work for you, then please attach the output of running sysreport as well...thanks! OK thanks I will load them now and run some tests later and let you know how I get on. On one of the machines when I boot this kernel I get:
ACPI wakeup devices:
PCI1 USB0 USB1 UAR1 UAR2 GOLA GLAN GOLB SMBC AC97 MODM PWRB
Freeing unused kernel memory: 188k freed
Red Hat nash version 4.2.1.6 staSCSI subsystem initialized
rting
Mounted /ACPI: PCI interrupt 0000:03:05.0[A] -> GSI 19 (level, low) -> IRQ 169
Unable to handle kernel paging request
at 0000000000004c40 RIP: Mounting sysfs
Creating /dev
<ffffffffa003d461>{:sata_sil:sil_init_one+583}Starting udev
L
oading scsi_mod.PML4 f7d68067 ko module
LoadiPGD f7d73067 ng sd_mod.ko modPMD 0 ule
Loading lib
ata.ko module
LOops: 0002 [1] oading sata_sil.SMP ko module
CPU 0
Modules linked in: sata_sil libata sd_mod scsi_mod
Pid: 202, comm: insmod Not tainted 2.6.9-23.EL.jwltest.92smp
RIP: 0010:[<ffffffffa003d461>] <ffffffffa003d461>{:sata_sil:sil_init_one+583}
RSP: 0000:00000101ffd63e48 EFLAGS: 00010206
RAX: 0000000000000003 RBX: 0000000000004c00 RCX: 00000010feafe000
RDX: 00000000feafe000 RSI: 0000000000000246 RDI: ffffffff803eca60
RBP: 0000010004910f80 R08: 0000000000000001 R09: 00000101ffd63e14
R10: 0000000000000028 R11: ffffffff802a0b70 R12: 0000000000000000
R13: 0000000000000004 R14: 00000101051a2180 R15: ffffffffa003d5e0
FS: 0000000000000000(0000) GS:ffffffff804d8080(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000004c40 CR3: 0000000000101000 CR4: 00000000000006e0
Process insmod (pid: 202, threadinfo 00000101ffd62000, task 00000101fffa27f0)
Stack: 10000100f7d0ed00 00000000ffffffed ffffffffa003f5a8 00000101051a2180
00000101051a21f0 ffffffffa003f560 000000000056c710 ffffffff801f124c
ffffffffa003f5a8 00000101051a21f0
Call Trace:<ffffffff801f124c>{pci_device_probe+110} <ffffffff80244c01>
{bus_match+57}
<ffffffff80244cff>{driver_attach+68} <ffffffff8024501b>
{bus_add_driver+143}
<ffffffff801f0fbc>{pci_register_driver+119} <ffffffffa004100e>
{:sata_sil:sil_init+14}
<ffffffff8014f21f>{sys_init_module+278} <ffffffff801101c6>
{system_call+126}
Code: 88 43 40 88 43 41 88 43 44 88 43 45 49 83 7f 18 02 75 66 88
RIP <ffffffffa003d461>{:sata_sil:sil_init_one+583} RSP <00000101ffd63e48>
CR2: 0000000000004c40
<0>Kernel panic - not syncing: Oops
Actually I get it on both of them :)
powernow-k8: Found 2 AMD Athlon 64 / Opteron processors (version 1.50.04-rh)
powernow-k8: MP systems not supported by PSB BIOS structure
powernow-k8: init not cpu 0
ACPI: (supports S0 S1 S5)
ACPI wakeup devices:
PCI1 USB0 USB1 UAR1 UAR2 GOLA GLAN GOLB SMBC AC97 MODM PWRB
Freeing unused kernel memory: 188k freed
Red Hat nash version 4.2.1.6 staSCSI subsystem initialized
rting
Mounted /ACPI: PCI interrupt 0000:03:05.0[A] -> GSI 19 (level, low) -> IRQ 169
Unable to handle kernel paging request
at 0000000000004c40 RIP: Mounting sysfs
Creating /dev
<ffffffffa003d461>{:sata_sil:sil_init_one+583}Starting udev
L
oading scsi_mod.PML4 497c067 ko module
LoadiPGD 37e53067 ng sd_mod.ko modPMD 0 ule
Loading lib
ata.ko module
LOops: 0002 [1] oading sata_sil.SMP ko module
CPU 0
Modules linked in: sata_sil libata sd_mod scsi_mod
Pid: 202, comm: insmod Not tainted 2.6.9-23.EL.jwltest.92smp
RIP: 0010:[<ffffffffa003d461>] <ffffffffa003d461>{:sata_sil:sil_init_one+583}
RSP: 0000:00000101ffd57e48 EFLAGS: 00010206
RAX: 0000000000000003 RBX: 0000000000004c00 RCX: 00000010feafe000
RDX: 00000000feafe000 RSI: 0000000000000246 RDI: ffffffff803eca60
RBP: 0000010037e36f80 R08: 0000000000000001 R09: 00000101ffd57e14
R10: 0000000000000028 R11: ffffffff802a0b70 R12: 0000000000000000
R13: 0000000000000004 R14: 00000101051a2180 R15: ffffffffa003d5e0
FS: 0000000000000000(0000) GS:ffffffff804d8080(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000004c40 CR3: 0000000000101000 CR4: 00000000000006e0
Process insmod (pid: 202, threadinfo 00000101ffd56000, task 00000101fffa37f0)
Stack: 10000100049a5c00 00000000ffffffed ffffffffa003f5a8 00000101051a2180
00000101051a21f0 ffffffffa003f560 000000000056c710 ffffffff801f124c
ffffffffa003f5a8 00000101051a21f0
Call Trace:<ffffffff801f124c>{pci_device_probe+110} <ffffffff80244c01>
{bus_match+57}
<ffffffff80244cff>{driver_attach+68} <ffffffff8024501b>
{bus_add_driver+143}
<ffffffff801f0fbc>{pci_register_driver+119} <ffffffffa004100e>
{:sata_sil:sil_init+14}
<ffffffff8014f21f>{sys_init_module+278} <ffffffff801101c6>
{system_call+126}
Code: 88 43 40 88 43 41 88 43 44 88 43 45 49 83 7f 18 02 75 66 88
RIP <ffffffffa003d461>{:sata_sil:sil_init_one+583} RSP <00000101ffd57e48>
CR2: 0000000000004c40
<0>Kernel panic - not syncing: Oops
Looks like there is a SATA-related problem in recent kernels...I'll respin the test code once the base kernels have the fix...probably tomorrow... Thanks for your patience! OK thanks - not being too expert in these things we don't use the SATA at the moment is there a way to not load the driver? Otherwise will wait till tomorrow or Friday hopefully to test. Thanks ALan I have this morning loaded: Kernel 2.6.9-24.EL.jwltest.93smp on an x86_64 It doesn't have the same issues as the previous kernel it loads correctly and the machines run, the issue with the tg3 driver crashing as above seems to have stopped. BUT I am seeing occasional reboots from the machines (without messages to the serial console) when pumping multiple multi mega byte files trough the Broadcom ethernet interfaces. Just to reiterate that the problems and issues we were seeing over the crashing of the TG3 driver in this particular manner appear to be fixed reboots / crashes do not casue the same side effects that we were seeing on other machines connected to the same network. However both machines have spontaneously rebooted and we have had one crash (without auto reboot) so far under very heavy load at no time on any of these incidents was anything output to the serial console and the machine is sufficently hung that it does not respond to Alt SysReq t etc. If there is anything more I can do please let me know. Regards ALan ALan, would you characterize the current situation as an improvement? Are the crashes less frequent and/or less problematic? In other words, is this taking us in the right direction? John, Yes very much improved :) It now only seems to crash when I abuse it :) ie I can crash it but I doubt that in normal use it will be stressed to that level - even more importantly the way it crashes no longer effects other machines on the network. Previously the crash casues the other machines on the network to stop functioning correctly - we are running a cluster aplication so that 3 machines server our front end customer environement. With the previous bug (now aparently fixed - thanks) when one node went down the others went off line. With the new Kernel if one node dies only that one dies - this is survivable (if annoying) as users won't see the change in status as there are other machines taking over the load. Our supplier (of the curlster software) has sugested removing any rules in iptables and I am now testing it in that configuration. But so far this is a definite improvement (so thanks) well done for the quick turn around. As I conculded in the last note there are still issues but they are not so serious but they would be good to fix - but I don't know what is going wrong and there is no console output to help you. Regards ALan Created attachment 121758 [details]
TG3 error report - 2 Dec 2005 - Margote - running 24 Kernel
There is still some issue with the TG3 as the above atachement shows the TG3 failing in a similar way to previously except the machine did not crash or reboot it is still accesible via the serial console though the ssh sessions all reset. <ConMan> Console [margote] log at 2005-12-02 15:40:00 GMT. 40 tg3: eth0: transmit timed out, resetting ^M^@DEBUG: PCI status [82b0] TG3PCI state[000030e2] ^M^@DEBUG: MAC_MODE[00e04c08] MAC_STATUS[00400003] .............................. ^M^@DEBUG: NIC RXD_JUMBO(5)[0][c675ddc5:8c4809d6:bf2a5895:e72d1ede] ^M^@DEBUG: NIC RXD_JUMBO(5)[1][915a39ee:afd09df9:b8a6d9be:2532107f] ^M^@tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2 ^M^@tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2 ^M^@tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2 ^M^@tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2 ^M^@tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2 ^M^@ [root@margote 15:49:13 ] ~ # 41 <ConMan> Console [margote] log at 2005-12-02 15:50:00 GMT And then the machine crashed:
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
[root@margote 15:49:13 ] ~ # 41
42
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at tg3:2864
invalid operand: 0000 [1] SMP
CPU 1
Modules linked in: iptable_filter ip_tables w83627hf lm85 i2c_sensor i2c_isa
i2c_amd756 arp
t_mangle arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport
autofs4 i2c_dev
i2c_core sunrpc ds yenta_socket pcmcia_core dm_mirror dm_mod button battery ac
ohci_hcd hw
_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod
scsi_mod
Pid: 21698, comm: adsd Not tainted 2.6.9-24.EL.jwltest.93smp
RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
RSP: 0000:00000100ca7d78b8 EFLAGS: 00010246
RAX: 00000000000001df RBX: 00000101fe18dce8 RCX: 0000010000011000
RDX: 0000000000000206 RSI: 0000000000000042 RDI: 0000000000000206
RBP: 0000000000000000 R08: 0000000000000042 R09: 0000000000000001
R10: 0000000000000000 R11: 00000101fb261a80 R12: 00000101fecb0380
R13: 00000000000001df R14: 0000000000000000 R15: 0000000000000000
FS: 0000000000bcd8c0(0000) GS:ffffffff804d8180(005b) knlGS:000000000852b960
CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f7b5d060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
Process adsd (pid: 21698, threadinfo 00000100ca7d6000, task 00000100d70ee7f0)
Stack: 00000000000001df 0000000000000304 00000101fecb0380 ffffffffffffffc1
ffffffffa0086281 0000000000000010 00000100f5c66000 00000100ca7d795c
00000101fecb0000 ffffffffa008b1a8
Call Trace:<ffffffffa0086281>{:tg3:tg3_write32_tx_mbox+30} <ffffffffa008b1a8>
{:tg3:tg3_star
t_xmit+1691}
<ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>{__do_softirq+88}
<ffffffff8013beb9>{do_softirq+49} <ffffffff802ab57b>{dev_queue_xmit+525}
<ffffffff802c7091>{ip_finish_output+356} <ffffffff802c6a40>{dst_output+0}
<ffffffff802c6a56>{dst_output+22} <ffffffff802b46a9>{nf_hook_slow+184}
<ffffffff802c7509>{ip_queue_xmit+1011} <ffffffff801ea152>
{copy_user_generic_c+8}
<ffffffff802d6ba1>{tcp_transmit_skb+2037} <ffffffff802cd6a8>
{tcp_recvmsg+1790}
<ffffffff802a5dd1>{sock_common_recvmsg+48} <ffffffff802a28f4>
{sock_recvmsg+284}
<ffffffff80131759>{recalc_task_prio+337} <ffffffff80134e12>
{autoremove_wake_function
+0}
<ffffffff802a24f7>{sockfd_lookup+16} <ffffffff802a3d27>{sys_recvfrom+182}
<ffffffff80183088>{pipe_writev+726} <ffffffff802b8253>
{compat_sys_socketcall+258}
<ffffffff8012555d>{ia32_sysret+0}
Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <00000100ca7d78b8>
<0>Kernel panic - not syncing: Oops
A couple more crashes with output:
<ConMan> Console [margote] log at 2005-12-02 19:50:00 GMT.
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: iptable_filter ip_tables w83627hf lm85 i2c_sensor
i2c_isa arpt_mangle i2c_amd756 arptable_filter arp_tables ip_queue md5 ipv6
parport_p
c lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 4422, comm: adsd Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78 EFLAGS: 00010246
^M^@RAX: 0000000000000014 RBX: 000001016c03b1e0 RCX: 00000100f4a6f000
^M^@RDX: 0000000000000014 RSI: 00000101fff8bf1c RDI: 0000010037c5a000
^M^@RBP: 0000000000000000 R08: ffffffffffffffc1 R09: 00000000ffffc3d0
^M^@R10: 00000000ffffc3d0 R11: 00000000f6fbb898 R12: 0000010037c5a380
^M^@R13: 0000000000000014 R14: 00000101b09a5f58 R15: 0000000000000000
^M^@FS: 0000000000c2cd00(0000) GS:ffffffff804d8180(005b) knlGS:000000000852cfc0
^M^@CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
^M^@CR2: 00000000f7ffb000 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process adsd (pid: 4422, threadinfo 00000101b09a4000, task 00000100f5da47f0)
^M^@Stack: 0000000000000001 0000010037c5a42c 000000016bcf1812 0000000100000001
^M^@ 000002ff02ff7980 0000003f00010000 00000100f4a6f000 00000101fff8bf1c
^M^@ 0000010037c5a000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@ <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@ <ffffffff8011078f>{ret_from_intr+0} <EOI>
^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ ----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel panic - not syncing: Oops
^M^@ <1>Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [2] SMP
^M^@CPU 0
^M^@Modules linked in: iptable_filter ip_tables w83627hf lm85 i2c_sensor
i2c_isa arpt_mangle i2c_amd756 arptable_filter arp_tables ip_queue md5 ipv6
parport_p
c lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044d5f8 EFLAGS: 00010246
^M^@RAX: 00000000000001d5 RBX: 000001016cdadbf8 RCX: 0000010100000000
^M^@RDX: 0000000000000206 RSI: 0000000000000042 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 0000000000000042 R09: 0000000000000060
^M^@R10: ffffffffa008628c R11: ffffffffa008628c R12: 00000101fff96380
^M^@R13: 00000000000001d5 R14: 0000000000000000 R15: 0000000000000000
^M^@FS: 0000000000b3a800(0000) GS:ffffffff804d8100(0000) knlGS:00000000080c9740
^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b5d060 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804dc000, task ffffffff803cd100)
^M^@Stack: ffffffff8044d628 00000101fff9642c 00000001f4a78012 0000000100000001
^M^@ 0000012a012a27f0 0000003f00010000 00000100f4960000 ffffffff8044d69c
^M^@ 00000101fff96000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@ <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@ <ffffffff8011078f>{ret_from_intr+0} <EOI> <ffffffff8010e749>
{default_idle+0}
^M^@ <ffffffff8010e769>{default_idle+32} <ffffffff8010e7dc>{cpu_idle+26}
^M^@ <ffffffff804df67b>{start_kernel+470} <ffffffff804df1d5>
{_sinittext+469}
^M^@
^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <ffffffff8044d5f8>
^M^@Badness in do_unblank_screen at drivers/char/vt.c:2876
^M^@Call Trace:<IRQ> <ffffffff80232c8a>{do_unblank_screen+61} <ffffffff801231c4>
{bust_spinlocks+28}
^M^@ <ffffffff801119a8>{oops_end+18} <ffffffff80111ad5>{die+54}
^M^@ <ffffffff80111e98>{do_invalid_op+145} <ffffffffa0089529>
{:tg3:tg3_poll+177}
^M^@ <ffffffffa0086236>{:tg3:_tw32_flush+12} <ffffffffa008628c>
{:tg3:tg3_read32+0}
^M^@ <ffffffffa0086236>{:tg3:_tw32_flush+12} <ffffffffa008628c>
{:tg3:tg3_read32+0}
^M^@ <ffffffffa00865a0>{:tg3:tg3_readphy+141} <ffffffffa0086236>
{:tg3:_tw32_flush+12}
^M^@ <ffffffffa008628c>{:tg3:tg3_read32+0} <ffffffff80110c61>
{error_exit+0}
^M^@ <ffffffffa008628c>{:tg3:tg3_read32+0} <ffffffffa008628c>
{:tg3:tg3_read32+0}
^M^@ <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffffa008961d>
{:tg3:tg3_poll+421}
^M^@ <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@ <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@ <ffffffff8011078f>{re
<ConMan> Console [margote] log at 2005-12-02 20:00:00 GMT.
<ConMan> Console [margote] disconnected from <lontht1:2037> at 12-02 20:00.
And: <ConMan> Console [azalee] log at 2005-12-02 21:50:00 GMT. 41 ^M^@CPU 0: Machine Check Exception: 4 Bank 4: f200000000070f0f ^M^@TSC 1d868f5d0bc7 ^M^@Kernel panic - not syncing: Machine check ^M^@ <ConMan> Console [azalee] log at 2005-12-02 22:00:00 GMT. Created attachment 121794 [details]
3rd December Crash (Azalee)
As you can see there is still an issue. The way the machines crashes is less
fatal to other machines on the network but the machine still crashes the crash
dump is attached and the header is below.
The machine here is running mySQL and an appliaction from Continuent; the
iptables firewall is running but has no rules in it other than those installed
by Continuants software. Running sql-bench from a separate machine quering the
server is causing this to happen repeatably.
If you need anything else let me know.
Regards
ALan
^M^@ ----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel panic - not syncing: Oops
^M^@ <1>Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [2] SMP
^M^@CPU 0
<ConMan> Console [azalee] log at 2005-12-03 07:50:00 GMT.
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
Under low loading slightly more stable (and again only the crashed machine went
off line so an improvment) but another crash :(
<ConMan> Console [margote] log at 2005-12-05 01:10:00 GMT.
599
600
<ConMan> Console [margote] log at 2005-12-05 01:20:00 GMT.
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 1
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_
dev i2c_core sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:00000101fff8be78 EFLAGS: 00010246
^M^@RAX: 0000000000000184 RBX: 00000101fd2dd460 RCX: 0000010000011000
^M^@RDX: 0000000000000206 RSI: 00000100d7f4aa78 RDI: 0000000000000206
^M^@RBP: 0000000000000000 R08: 00000100d7f4aa78 R09: 0000000000000040
^M^@R10: 0000000100000000 R11: 0000000000000002 R12: 0000010004ab2380
^M^@R13: 0000000000000184 R14: 0000000000000000 R15: 0000000000000000
^M^@FS: 0000000000bcd8c0(0000) GS:ffffffff804d8180(0000) knlGS:00000000080af8c0
^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b5d060 CR3: 00000000f7fa2000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030)
^M^@Stack: 00000101fff8bea8 0000010004ab242c 00000001ee6aa812 0000000100000002
^M^@ 0000004b004b27f0 0000003f00010000 00000100f7b75000 00000101fff8bf1c
^M^@ 0000010004ab2000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff802abcdb>{net_rx_action+129} <ffffffff8013be10>
{__do_softirq+88}
^M^@ <ffffffff8013beb9>{do_softirq+49} <ffffffff801130eb>{do_IRQ+328}
^M^@ <ffffffff8011078f>{ret_from_intr+0} <EOI> <ffffffff8010e749>
{default_idle+0}
^M^@ <ffffffff8010e769>{default_idle+32} <ffffffff8010e7dc>{cpu_idle+26}
^M^@
^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <00000101fff8be78>
^M^@ <0>Kernel panic - not syncing: Oops
^M^@
<ConMan> Console [margote] log at 2005-12-05 01:30:00 GMT.
<ConMan> Console [margote] disconnected from <lontht1:2037> at 12-05 01:33.
And the other machine :) Again under relativel light load into a mySQL
database.
<ConMan> Console [azalee] log at 2005-12-05 10:40:00 GMT.
553
554
----------- [cut here ] --------- [please bite here ] ---------
^M^@Kernel BUG at tg3:2864
^M^@invalid operand: 0000 [1] SMP
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_
dev i2c_core sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core
dm_mirror dm_mod button battery ac ohci_hcd hw_random shpchp e100 mii tg3 ext3
jbd me
garaid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 0, comm: swapper Not tainted 2.6.9-24.EL.jwltest.93smp
^M^@RIP: 0010:[<ffffffffa0089529>] <ffffffffa0089529>{:tg3:tg3_poll+177}
^M^@RSP: 0000:ffffffff8044d5f8 EFLAGS: 00010246
^M^@RAX: 000000000000008d RBX: 00000101fe7ebd38 RCX: 0000010004aa4000
^M^@RDX: 000000000000008d RSI: 0000000000003c28 RDI: 00000101fffa0384
^M^@RBP: 0000000000000000 R08: ffffffff804dc000 R09: 0000000000000100
^M^@R10: ffffffffa008628c R11: ffffffffa008628c R12: 00000101fffa0380
^M^@R13: 000000000000008d R14: ffffffff804ddf08 R15: 0000000000000000
^M^@FS: 0000000000bd38c0(0000) GS:ffffffff804d8100(0000) knlGS:00000000080c81c0
^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
^M^@CR2: 00000000f7b51060 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process swapper (pid: 0, threadinfo ffffffff804dc000, task ffffffff803cd100)
^M^@Stack: 0000000000000046 00000101fffa042c 0000000004aa8012 0000000280133490
^M^@ 0000025f025fdf08 0000003e00010000 0000010004aa4000 ffffffff8044d69c
^M^@ 00000101fffa0000 0000000000000202
^M^@Call Trace:<IRQ> <ffffffff80236aa9>{rtc_interrupt+233} <ffffffff802abcdb>
{net_rx_action+129}
^M^@ <ffffffff8013be10>{__do_softirq+88} <ffffffff8013beb9>{do_softirq+49}
^M^@ <ffffffff801130eb>{do_IRQ+328} <ffffffff8011078f>{ret_from_intr+0}
^M^@ <EOI> <ffffffff8010e749>{default_idle+0} <ffffffff8010e769>
{default_idle+32}
^M^@ <ffffffff8010e7dc>{cpu_idle+26} <ffffffff804df67b>{start_kernel+470}
^M^@ <ffffffff804df1d5>{_sinittext+469}
^M^@Code: 0f 0b 38 48 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98
^M^@RIP <ffffffffa0089529>{:tg3:tg3_poll+177} RSP <ffffffff8044d5f8>
^M^@ <0>Ker^M^@
^GMessage from syslogd@azalee atn Mon Dec 5 10:4e8:25 2005 ...^M^@
azalee kernel: invalid operand: 0000 [1] SMP ^M^@
panic - not syncing: Oops
^M^@
<ConMan> Console [azalee] log at 2005-12-05 10:50:00 GMT.
ALan, could you attach the ouput of running "sysreport" on one of the boxes in questions? Thanks! Created attachment 121857 [details]
Azalee sysreport from 3rd December crash
This is a sysreport I ran on Saturday after a crash. I noticed the 94 kernal a
little while ago and although there may not be any changes for me I am now
running it and will do some more tests and add a sysreport and crash report if
I get another crash later.
In addition Azalee spontaneously rebooted using the 94 kernel and also I noticed that Broadcomm released a new version of their driver for Linux last week. Regards ALan
And another crash:
<ConMan> Console [margote] log at 2005-12-06 00:30:00 GMT.
88
^M^@CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f
^M^@TSC 390b1f9c765c
^M^@CPU 1: Machine Check Exception: 4 Bank 4: b200000000070f0f
^M^@TSC 390b1f9c9c36
^M^@Kernel panic - not syncing: Machine check
^M^@ NMI Watchdog detected LOCKUP, CPU=0, registers:
^M^@CPU 0
^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle
arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 i2c_
dev i2c_core sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core
dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii tg3 ext3 jbd
megaraid_
mbox megaraid_mm sata_sil libata sd_mod scsi_mod
^M^@Pid: 2879, comm: dispatcher Tainted: G M 2.6.9-24.EL.jwltest.94smp
^M^@RIP: 0010:[<ffffffff8011be25>] <ffffffff8011be25>{__smp_call_function+100}
^M^@RSP: 0000:ffffffff80452638 EFLAGS: 00000097
^M^@RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
^M^@RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: 0000000000000000
^M^@RBP: 0000000000000000 R08: 0000000000000008 R09: 0000000000000000
^M^@R10: 0000000000000000 R11: 0000000000000002 R12: ffffffff8011bece
^M^@R13: 0000000000000000 R14: 0000390b1f9c7002 R15: ffffffff80319eb9
^M^@FS: 0000000000bcd8c0(0000) GS:ffffffff804d8300(005b) knlGS:00000000080c98c0
^M^@CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
^M^@CR2: 00000000f7ec2000 CR3: 0000000000101000 CR4: 00000000000006e0
^M^@Process dispatcher (pid: 2879, threadinfo 00000101fd168000, task
00000101ffd70030)
^M^@Stack: ffffffff8011bece 0000000000000000 0000000000000000 0000000000000000
^M^@ 0000000000000016 0000000000000000 0000000000000900 00000000ffffffff
^M^@ ffffffff803d1840 ffffffff8011bf0b
^M^@Call Trace:<ffffffff8011bece>{smp_really_stop_cpu+0} <ffffffff8011bf0b>
{smp_send_stop+52}
^M^@<ffffffff80137026>{panic+235} <ffffffff801177ec>{print_mce+136}
^M^@<ffffffff801178c4>{mce_available+0} <ffffffff80117c17>{do_machine_check+825}
^M^@<ffffffff8011134f>{machine_check+127} <ffffffffa0086295>{:tg3:tg3_read32+9}
^M^@ <EOE> <IRQ> <ffffffffa008a045>{:tg3:tg3_interrupt_tagged+48}
^M^@ <ffffffff80112dee>{handle_IRQ_event+41} <ffffffff80113068>
{do_IRQ+197}
^M^@ <ffffffff8011078f>{ret_from_intr+0} <EOI> <ffffffff802231e3>
{uuid_strategy+165}
^M^@ <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffff80124010>
{search_extable+68}
^M^@ <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffff801488a9>
{search_exception_tables+29}
^M^@ <ffffffff80111c5a>{do_trap+220} <ffffffff80111e98>{do_invalid_op+145}
^M^@ <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffff80110c61>
{error_exit+0}
^M^@ <ffffffffa0089529>{:tg3:tg3_poll+177} <ffffffffa008961d>
{:tg3:tg3_poll+421}
^M^@ <ffffffff802abec7>{net_rx_action+129} <ffffffff8013be28>
{__do_softirq+88}
^M^@ <ffffffff8013bed1>{do_softirq+49} <ffffffffa01b202b>
{:ip_queue:ipq_issue_verdict+43}
^M^@ <ffffffffa01b28a5>{:ip_queue:ipq_rcv_sk+974} <ffffffff802bee3f>
{netlink_data_ready+22}
^M^@ <ffffffff802be64e>{netlink_sendskb+113} <ffffffff802bee14>
{netlink_sendmsg+689}
^M^@ <ffffffff802a295b>{sock_sendmsg+271} <ffffffff80178870>{fget+67}
^M^@ <ffffffff80134e2a>{autoremove_wake_function+0} <ffffffff802a42cf>
{sys_sendmsg+463}
^M^@ <ffffffff802a26e3>{sockfd_lookup+16} <ffffffff802a248f>
{move_addr_to_user+60}
^M^@ <ffffffff8012232b>{do_gettimeoffset_pm+8} <ffffffff802b8496>
{compat_sys_socketcall+345}
^M^@ <ffffffff80125575>{ia32_sysret+0}
^M^@Code: 39 d8 74 04 f3 90 eb f4 85 ed 74 0c 8b 44 24 14 39 d8 74 04
^M^@Kernel panic - not syncing: nmi watchdog
^M^@
<ConMan> Console [margote] log at 2005-12-06 00:40:00 GMT.
Enough crash dumps already!!! :-) All of them have the same signature and show the same problem. There is no benefit from posting any more of these nearly identical dumps, but your tenacity is appreciated :) Usually we have the opposite problem of not being able to get enough information. We'll ask for more dumps in specific situations if we think it will help diagnose the problem further, thanks. Sorry - never sure if any of the slight variations is actually helpful :) (or not). I assume that someone will tell me if you have a new version of the Kernel that it might be worth me trying to use :) (I note that 95 does not change anything) but won't post the crash dumps as they look similar to my untrained eye but I have saved them just in case you want to look at them :) As I said eariler there is a definate improvement in that the crash does not casue other machines on the network to fail due to seeing the network being trashed :) by the crash (so to speak). Hopefully what ever is still failing will make sense and we can move forward and get a fix :) All the best and thanks for all your hard work. ALan Just in case you are interested Kernel 24 - 06 still crashes - but does not give any error message to the serial console. It also crashes in a worse way - not sure how to describe but the aplication we are running does monitors the other machines and when one crashes it does not see it disapear (not sure if that makes sense). This is not what happeend with the previous version of the kernel 95 - and so might be viewed as a step back :) Have you tried using FC4 on this box? Someone else suggested that those kernels have better support for your HyperTransport chipset. We did a number of months ago before trying RedHat as we hoped moving back to a more supported version would be better :( We had the same problems as we are having now with the Broadcom chipset. When we run the machines not using this chipset we have a stable environment with both FC4 and RedHat v4. We are currenlty running (on the same hardware) one FC4 machine and one RedHat machine - both are pretty stable we have been doing work to them so they have only been up 22 days (FC4) and 14 days (RedHat) but on both machines we only use the Intel Pro 10/100 ethernet port and not the Broadcom 10/100/1000 ones. Both machines in that configuration seem fine and stable but as soon as we use the Broadcomm ethernet ports to any extent we see the crashes we have reported to you over the past few days (FYI the last Kernel of yours I tried crashed more frequently and without any error messages - did you remove some debug code?). If the bug was in the Hyper Transport chipset wouldn't it effect both the broadcom and intel ethernet controllers? I suppose it would, although there could be subtle interactions. Anyway, it seemed worth mentioning just in case... No other bright ideas at the moment...will have to get back to you... Is there anything else I can do to help? Happy to try things to give you more information about the problem. As you saw after the initial bug fix there are still crashes in the same part of the code. Would it be worthwhile seeing if there is anything in the latest Broadcomm driver which was released a couple of weeks ago which might help? Also if we were to try the broadcom driver which would be the most suitable kernel to try it on. After all we know the bug you fixed will be a problem in kernels before a certain number? Once again thanks for your help and if you have any ideas for things I can try to provide you with more information please let me know. Regards Alan Hi there,
This is sort of a shot in the dark, but I noticed that you always have
ip_queue loaded, and one of the crashes is calling the ip_queue code. Have you
tried stressing the machines without any iptables QUEUE rules, and without
loading ip_queue? I did have problems at one time with ip_queue causing lockups
on kernels.
Chris Lalancette
Unfortunately iptables is required by the aplication we are trying to run. I have, however, just completed some tests after removing the LSI MegaRAID 320 card - although the Broadcomm ports were only connected to a 100Mbit LAN (they were previously connected to a Gigabit connection) they no longer fell over. This could mean the bug is an interaction between the LSI MegaRAID 320 and the Broadcomm driver or it could simply be an issue that only is noteceable at the higher performance levels required by a Gigabit connection (but I assume you have tested that). I asked Broadcomm's support about this issue and they have the same motherboard (tyan S-2882-D) but can't replicate the bug so maybe that is pointing at some interaction between the driver and the LSI MegaRAID 320. Anyway that is all I can add to the mix at the moment. I have test kernels w/ a tg3 update available here: http://people.redhat.com/linville/kernels/rhel4/ Please give that a try and post the results here...thanks! Need feedback ASAP if this is going to be in U4... Sorry the servers I was testing on have been pushed into production. What I can tell you is that the re-order fix definately worked and improved things enormously. committed in stream U4 build 34.17. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ One additional thing we have discovered (we think) when used at 100M via 100M switch we have seen issues with aplications that use multicast packets to check the status of a number of machines (all of which have these ethernet adapters). To explain further the aplication we use is capable of using main and backup ethernet network cords and check the state of the various network cards (I am told using multicast packets). We discovered that although at 100M the cards transmit data fine and without issue (with the 2.6.9-22.0.1.ELsmp kernel) when used in this particular environment the aplication can sometimes think that the back-up route is responing faster but when it tries to use the back-up netowrk adapter it immediately switches back to master (which is not based on the Broadcomm hardware). Not sure if this is of any helpd at all but thought I would report our findings just in case they provide some assistance. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0575.html |
Kernel - 2.6.9-22.0.1.ELsmp Description of problem: We have 3 machines running Tyan S2882 and S2882-D motherboards they have the Broadcom BCM4704C gigabit chipset. When using this network interface with heavy loading the machine crashes. No output is sent to the serial console when the machine crashes. I hope to get some output out of the console to define what is crashing and panicing. In tests with a test kernel (which I assume has more debug code) the crash always occus in the tg3 driver. Version-Release number of selected component (if applicable): How reproducible: Kernel - 2.6.9-22.0.1.ELsmp Steps to Reproduce: 1. Set up one of the machines (A) against another machine (B) (not one of the effected ones) on a private network. 2. From machine "A" copy a 100Mb file to machine "B" create a script to do this (via scp) and repeat this process. Place the script in background and run a number of instances of it. 3. repeat this at the same time from Machine "B" copying file to Machine "A" again run script and run multiple instances. Actual results: In tests when nothing much else was running on the machine this required about 6-8 instances of this copy process to cause the machine to crash. When running an aplication we are trying to implement this is reduced to just 2 instances. Every time the machine crashes after a certain level of access. This chipset has 2 ethernet conenctions if we perform this on ETH0 when the crash occurs something happens on ETH2 to cause other machines on the ETH2 network to thinks that the network is no longer available - the aplication only thinks that the netowrk is available when the machine is rebooted (I have no idea what causes this side effect). Expected results: The machine shouldn't have crashed. Additional info: Machines are Tyan S2882 and S2782D motherboards with twin Opteron, 8Gb RAM, LSI320 RAID controller and BCM4704C gigabit ethernet controller. An example of the type of panic we have had: ----------- [cut here ] --------- [please bite here ] --------- ^M^@Kernel BUG at tg3:2864 ^M^@invalid operand: 0000 [1] SMP ^M^@CPU 1 ^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle arptable_filter arp_tables ip_queue md 5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conn track iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random shpchp e100 mii tg3 ext3 jbd meg araid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod ^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.18.EL.jwltest.89smp ^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177} ^M^@RSP: 0000:00000101fff8be78 EFLAGS: 00010246 ^M^@RAX: 00000000000001c8 RBX: 0000010037dddac0 RCX: 0000000000000001 ^M^@RDX: 00000101efadd600 RSI: 00000000000005ea RDI: 0000000000000246 ^M^@RBP: 0000000000000000 R08: 00000000000005ea R09: 0000000000000000 ^M^@R10: 0000000000000000 R11: 00000100edc4ca40 R12: 0000010004955380 ^M^@R13: 00000000000001c8 R14: 0000000000000000 R15: 00000000000000bc ^M^@FS: 0000000000f0c580(0000) GS:ffffffff804d4280(0000) knlGS:000000000852cfc0 ^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b ^M^@CR2: 0000002aab8214c8 CR3: 00000000fbfa2000 CR4: 00000000000006e0 ^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030) ^M^@Stack: 00000101fff8bea8 ffffffff80131623 000001010385fa60 000000000000000f ^M^@ 00000100eb2c67f0 0000000000000001 0000010037d96000 00000101fff8bf1c ^M^@ 0000010004955000 000001010385fa60 ^M^@Call Trace:<IRQ> <ffffffff80131623>{activate_task+124} <ffffffff802aac4b> {net_rx_action+129} ^M^@ <ffffffff8013bbe8>{__do_softirq+88} <ffffffff8013bc91>{do_softirq+49} ^M^@ <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0} ^M^@ <EOI> <ffffffff8010e609>{default_idle+0} <ffffffff8010e629> {default_idle+32} ^M^@ <ffffffff8010e69c>{cpu_idle+26} ^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98 ^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78> ^M^@ <0>Kernel panic - not syncing: Oops --------------------------------- ----------- [cut here ] --------- [please bite here ] --------- ^M^@Kernel BUG at tg3:2864 ^M^@invalid operand: 0000 [1] SMP ^M^@CPU 1 ^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle arptable_filter arp_tables ip_queue md 5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conn track iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random shpchp e100 mii tg3 ext3 jbd meg araid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod ^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.18.EL.jwltest.89smp ^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177} ^M^@RSP: 0000:00000101fff8be78 EFLAGS: 00010246 ^M^@RAX: 00000000000001c8 RBX: 0000010037dddac0 RCX: 0000000000000001 ^M^@RDX: 00000101efadd600 RSI: 00000000000005ea RDI: 0000000000000246 ^M^@RBP: 0000000000000000 R08: 00000000000005ea R09: 0000000000000000 ^M^@R10: 0000000000000000 R11: 00000100edc4ca40 R12: 0000010004955380 ^M^@R13: 00000000000001c8 R14: 0000000000000000 R15: 00000000000000bc ^M^@FS: 0000000000f0c580(0000) GS:ffffffff804d4280(0000) knlGS:000000000852cfc0 ^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b ^M^@CR2: 0000002aab8214c8 CR3: 00000000fbfa2000 CR4: 00000000000006e0 ^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030) ^M^@Stack: 00000101fff8bea8 ffffffff80131623 000001010385fa60 000000000000000f ^M^@ 00000100eb2c67f0 0000000000000001 0000010037d96000 00000101fff8bf1c ^M^@ 0000010004955000 000001010385fa60 ^M^@Call Trace:<IRQ> <ffffffff80131623>{activate_task+124} <ffffffff802aac4b> {net_rx_action+129} ^M^@ <ffffffff8013bbe8>{__do_softirq+88} <ffffffff8013bc91>{do_softirq+49} ^M^@ <ffffffff80112fb7>{do_IRQ+328} <ffffffff8011065b>{ret_from_intr+0} ^M^@ <EOI> <ffffffff8010e609>{default_idle+0} <ffffffff8010e629> {default_idle+32} ^M^@ <ffffffff8010e69c>{cpu_idle+26} ^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98 ^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78> ^M^@ <0>Kernel panic - not syncing: Oops --------------------------------------------- ^M^@CPU 1: Machine Check Exception: 4 Bank 4: b200000000070f0f ^M^@TSC 1e3b3d5fa02cc ^M^@CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f ^M^@TSC 1e3b3d5fa1384 ^M^@Kernel panic - not syncing: Machine check ^M^@ NMI Watchdog detected LOCKUP, CPU=1, registers: ^M^@CPU 1 ^M^@Modules linked in: w83627hf lm85 i2c_sensor i2c_isa i2c_amd756 arpt_mangle arptable_filter arp_tables ip_queue md 5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conn track iptable_filter ip_tables dm_mirror dm_mod button battery ac ohci_hcd hw_random e100 mii bcm5700(U) ext3 jbd meg araid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod ^M^@Pid: 2660, comm: dispatcher Tainted: G M 2.6.9-22.0.1.ELsmp ^M^@RIP: 0010:[<ffffffff8011bcbe>] <ffffffff8011bcbe>{__smp_call_function+100} ^M^@RSP: 0000:00000100f7fa5cb8 EFLAGS: 00000097 ^M^@RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002 ^M^@RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: 0000000000000002 ^M^@RBP: 0000000000000000 R08: 0000000000000008 R09: 0000000000000000 ^M^@R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8011bd63 ^M^@R13: 0000000000000000 R14: 0001e3b3d5f9fa9b R15: ffffffff80317b57 ^M^@FS: 0000002a96533320(0000) GS:ffffffff804d3100(005b) knlGS:00000000080c5b40 ^M^@CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b ^M^@CR2: 0000002aaa1e2000 CR3: 00000000f7fa2000 CR4: 00000000000006e0 ^M^@Process dispatcher (pid: 2660, threadinfo 00000101fd7a8000, task 00000101fc89c030) ^M^@Stack: ffffffff8011bd63 0000000000000000 0000000000000000 0000000000000000 ^M^@ 0000000000000012 0000000000000000 0000000000000900 00000000ffffffff ^M^@ ffffffff803ceba0 ffffffff8011bda0 ^M^@Call Trace:<ffffffff8011bd63>{smp_really_stop_cpu+0} <ffffffff8011bda0> {smp_send_stop+52} ^M^@<ffffffff801368a6>{panic+235} <ffffffff801176b4>{print_mce+136} ^M^@<ffffffff8011778c>{mce_available+0} <ffffffff80117adf>{do_machine_check+825} ^M^@<ffffffff801111db>{machine_check+127} <ffffffff801e9b4b>{__delay+7} ^M^@ <EOE> <ffffffffa008d82c>{:bcm5700:LM_ReadPhy+121} ^M^@ <ffffffffa0086697>{:bcm5700:bcm5700_ioctl+242} <ffffffff8013144f> {activate_task+124} ^M^@ <ffffffff801346f7>{autoremove_wake_function+9} <ffffffff80132eaa> {__wake_up_common+67} ^M^@ <ffffffff80132eff>{__wake_up+54} <ffffffff803006aa>{packet_rcv+873} ^M^@ <ffffffffa00848f8>{:bcm5700:bcm5700_start_xmit+1128} ^M^@ <ffffffff802a95e6>{dev_queue_xmit_nit+240} <ffffffff802b81e8> {qdisc_restart+30} ^M^@ <ffffffff802a9a47>{dev_queue_xmit+525} <ffffffff80300e5a> {packet_sendmsg+522} ^M^@ <ffffffff801313c1>{recalc_task_prio+337} <ffffffff802a0c37> {sock_sendmsg+271} ^M^@ <ffffffff8010eb99>{__switch_to+289} <ffffffff80131c39> {finish_task_switch+55} ^M^@ <ffffffff802a6e83>{datagram_poll+0} <ffffffff803032e8> {thread_return+42} ^M^@ <ffffffff802aab10>{dev_ifsioc+1176} <ffffffff802aaeee>{dev_ioctl+975} ^M^@ <ffffffff80131c39>{finish_task_switch+55} <ffffffff802e8e90> {inet_ioctl+166} ^M^@ <ffffffff802a1599>{sock_ioctl+699} <ffffffff801885b5>{sys_ioctl+853} ^M^@ <ffffffff8019bc44>{compat_sys_ioctl+235} <ffffffff8012515d> {ia32_sysret+0} ^M^@ ^M^@Code: 39 d8 74 02 eb f6 85 ed 74 0a 8b 44 24 14 39 d8 74 02 eb f6 ^M^@Kernel panic - not syncing: nmi watchdog ------------------------------------------------------------------------------- [cut here ] --------- [please bite here ] --------- ^M^@Kernel BUG at tg3:2864 ^M^@invalid operand: 0000 [1] SMP ^M^@CPU 1 ^M^@Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button batter y ac ohci_hcd hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod ^M^@Pid: 0, comm: swapper Not tainted 2.6.9-22.16.EL.jwltest.86smp ^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177} ^M^@RSP: 0000:00000101fff8be78 EFLAGS: 00010246 ^M^@RAX: 0000000000000197 RBX: 00000101fcabd628 RCX: 0000010004b95000 ^M^@RDX: 0000000000000197 RSI: 00000101fff8bf1c RDI: 00000101fe479000 ^M^@RBP: 0000000000000000 R08: 00000101fff82000 R09: 0000000000000082 ^M^@R10: 0000000000000082 R11: 0000000000000002 R12: 00000101fe479380 ^M^@R13: 0000000000000197 R14: 00000101fff83e98 R15: 0000000000000000 ^M^@FS: 0000000000d7f700(0000) GS:ffffffff804d4200(0000) knlGS:0000000008c848c0 ^M^@CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b ^M^@CR2: 00000000f7b65060 CR3: 00000000f7fa2000 CR4: 00000000000006e0 ^M^@Process swapper (pid: 0, threadinfo 00000101fff82000, task 000001010000a030) ^M^@Stack: 00000101fff8bea8 00000101fe47942c 00000001fc821812 000000010000000f ^M^@ 0000022302237030 0000003f00010000 0000010004b95000 00000101fff8bf1c ^M^@ 00000101fe479000 0000000000000202 ^M^@Call Trace:<IRQ> <ffffffff802aabdb>{net_rx_action+129} <ffffffff8013bbd4> {__do_softirq+88} ^M^@ <ffffffff8013bc7d>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328} ^M^@ <ffffffff8011065b>{ret_from_intr+0} <EOI> <ffffffff8010e609> {default_idle+0} ^M^@ <ffffffff8010e629>{default_idle+32} <ffffffff8010e69c>{cpu_idle+26} ^M^@ ^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98 ^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78> ^M^@ <0>Kernel panic - not syncing: Oops ---------------------------------------------------------------------------- ----------- [cut here ] --------- [please bite here ] --------- ^M^@Kernel BUG at tg3:2864 ^M^@invalid operand: 0000 [1] SMP ^M^@CPU 1 ^M^@Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button batter y ac ohci_hcd hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod ^M^@Pid: 2795, comm: mysqld Not tainted 2.6.9-22.17.EL.jwltest.87smp ^M^@RIP: 0010:[<ffffffffa0087529>] <ffffffffa0087529>{:tg3:tg3_poll+177} ^M^@RSP: 0000:00000101fff8be78 EFLAGS: 00010246 ^M^@RAX: 00000000000000b9 RBX: 00000100f7b7c158 RCX: 0000010100000000 ^M^@RDX: 0000000000000206 RSI: 00000100f5f1b978 RDI: 0000000000000206 ^M^@RBP: 0000000000000000 R08: 00000100f5f1b978 R09: 0000000000000720 ^M^@R10: 0000000100000000 R11: ffffffff8011de40 R12: 00000101ff69a380 ^M^@R13: 00000000000000b9 R14: 0000000000000000 R15: 0000000000000000 ^M^@FS: 0000000000b3a800(005b) GS:ffffffff804d4200(0000) knlGS:000000000852cfc0 ^M^@CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b ^M^@CR2: 00000000f7b65060 CR3: 00000000f7fa2000 CR4: 00000000000006e0 ^M^@Process mysqld (pid: 2795, threadinfo 00000100f7102000, task 00000100f7b687f0) ^M^@Stack: 00000101fff8bea8 00000101ff69a42c 00000001e437f012 000000010000000f ^M^@ 000000ee00ee3030 0000003f00010000 00000100f7b0c000 00000101fff8bf1c ^M^@ 00000101ff69a000 0000000000000202 ^M^@Call Trace:<IRQ> <ffffffff802aabfb>{net_rx_action+129} <ffffffff8013bbe8> {__do_softirq+88} ^M^@ <ffffffff8013bc91>{do_softirq+49} <ffffffff80112fb7>{do_IRQ+328} ^M^@ <ffffffff8011065b>{ret_from_intr+0} <EOI> ^M^@Code: 0f 0b 38 28 09 a0 ff ff ff ff 30 0b 49 8b 4c 24 48 8b 95 98 ^M^@RIP <ffffffffa0087529>{:tg3:tg3_poll+177} RSP <00000101fff8be78> ^M^@ <0>Kernel panic - not syncing: Oops ^M^@rtc: lost some interrupts at 1024Hz. ---------------------------------------------------------------------- ^M^@CPU 1: Machine Check Exception: 4 Bank 4: b200000000070f0f ^M^@TSC 73f188bc502 ^M^@CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f ^M^@TSC 73f188bc649 ^M^@Kernel panic - not syncing: Machine check ^M^@ NMI Watchdog detected LOCKUP, CPU=1, registers: ^M^@CPU 1 ^M^@Modules linked in: arpt_mangle arptable_filter arp_tables ip_queue md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables dm_mirror dm_mod button batter y ac ohci_hcd hw_random e100 mii tg3 ext3 jbd megaraid_mbox megaraid_mm sata_sil libata sd_mod scsi_mod ^M^@Pid: 11397, comm: ssh Tainted: G M 2.6.9-22.18.EL.jwltest.89smp ^M^@RIP: 0010:[<ffffffff8011bceb>] <ffffffff8011bceb>{__smp_call_function+106} ^M^@RSP: 0018:00000100f7fa5cb8 EFLAGS: 00000097 ^M^@RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002 ^M^@RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: 0000000000000002 ^M^@RBP: 0000000000000000 R08: 0000000000000008 R09: 0000000000000000 ^M^@R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8011bd8e ^M^@R13: 0000000000000000 R14: 0000073f188bbe25 R15: ffffffff803187f1 ^M^@FS: 0000002a96708040(0000) GS:ffffffff804d4280(0000) knlGS:00000000080c1080 ^M^@CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 ^M^@CR2: 0000002a9642d8c0 CR3: 00000000f7fa2000 CR4: 00000000000006e0 ^M^@Process ssh (pid: 11397, threadinfo 00000100f6fb6000, task 00000101f9526030) ^M^@Stack: ffffffff8011bd8e 0000000000000000 0000000000000000 0000000000000000 ^M^@ 0000000000000016 0000000000000000 0000000000000900 00000000ffffffff ^M^@ ffffffff803cfc20 ffffffff8011bdcb ^M^@Call Trace:<ffffffff8011bd8e>{smp_really_stop_cpu+0} <ffffffff8011bdcb> {smp_send_stop+52} ^M^@<ffffffff80136e0e>{panic+235} <ffffffff801176b8>{print_mce+136} ^M^@<ffffffff80117790>{mce_available+0} <ffffffff80117ae3>{do_machine_check+825} ^M^@<ffffffff8011121b>{machine_check+127} <ffffffffa00891b8> {:tg3:tg3_start_xmit+1707} ^M^@ <EOE> <ffffffffa00e8ef4>{:ip_conntrack:__ip_conntrack_confirm+448} ^M^@ <ffffffff802b8dd8>{qdisc_restart+254} <ffffffff802c5cf3> {ip_finish_output2+0} ^M^@ <ffffffff802aa3c2>{dev_queue_xmit+228} <ffffffff802b3619> {nf_hook_slow+184} ^M^@ <ffffffff802c607b>{ip_finish_output+478} <ffffffff802c59b0> {dst_output+0} ^M^@ <ffffffff802c59c6>{dst_output+22} <ffffffff802b3619> {nf_hook_slow+184} ^M^@ <ffffffff802c6479>{ip_queue_xmit+1011} <ffffffff802c1baf> {__ip_route_output_key+1972} ^M^@ <ffffffff802c1baf>{__ip_route_output_key+1972} <ffffffff8018f087> {update_atime+147} ^M^@ <ffffffff802d5b11>{tcp_transmit_skb+2037} <ffffffff802d7f3f> {tcp_connect+727} ^M^@ <ffffffff802dac68>{tcp_v4_connect+2275} <ffffffff802e947a> {inet_stream_connect+170} ^M^@ <ffffffff8017836c>{fget+75} <ffffffff802a29ac>{sys_connect+114} ^M^@ <ffffffff80176d52>{fd_install+42} <ffffffff802a144f>{sock_map_fd+59} ^M^@ <ffffffff80110092>{system_call+126} ^M^@Code: eb f4 85 ed 74 0c 8b 44 24 14 39 d8 74 04 f3 90 eb f4 48 83 ^M^@Kernel panic - not syncing: nmi watchdog ^M^@ --------------------------------------------------------------------