Bug 246711

Summary:	kernel BUG at drivers/net/tg3.c:3036 on 2.6.21-31.el5rt
Product:	Red Hat Enterprise MRG	Reporter:	IBM Bug Proxy <bugproxy>
Component:	realtime-kernel	Assignee:	Arnaldo Carvalho de Melo <acme>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	medium	Docs Contact:
Priority:	low
Version:	1.0	CC:	bhu
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-05-02 22:04:09 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description IBM Bug Proxy 2007-07-04 10:44:59 UTC

LTC Owner is: jstultz.com
LTC Originator is: dbhilley.com


Problem description:
Stress testing GPFS on 2.6.21-31.el5rt elicited a kernel BUG in the tg3 driver
on an LS20.

c1f2bc2n13 kernel: kernel BUG at drivers/net/tg3.c:3036!
c1f2bc2n13 kernel: invalid opcode: 0000 [1] PREEMPT SMP 

If this is a customer issue, please indicate the impact to the customer:


If this is not an installation problem,
       Describe any custom patches installed.
GPFS is compiled and loaded as a kernel module.

       Provide output from "uname -a", if possible:
Linux c1f2bc2n13 2.6.21-31.el5rt #1 SMP PREEMPT RT Mon Jun 18 16:44:12 EDT 2007
x86_64 x86_64 x86_64 GNU/Linux


Hardware Environment
    Machine type (p650, x235, SF2, etc.): IBM eServer BladeCenter LS20 -[8850Z47]-
    Cpu type (Power4, Power5, IA-64, etc.): x86_64
    Describe any special hardware you think might be relevant to this problem:


Is this reproducible?
    Currently attempting to reproduce.   It occurred last night after 10 or so
hours of GPFS stress testing.  

Did the system produce an OOPS message on the console?
    If so, copy it here:

c1f2bc2n13 kernel: kernel BUG at drivers/net/tg3.c:3036!
c1f2bc2n13 kernel: invalid opcode: 0000 [1] PREEMPT SMP 
c1f2bc2n13 kernel: CPU 3 
c1f2bc2n13 kernel: Modules linked in: mmfs mmfslinux tracedev autofs4 hidp
rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad
ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath
dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport
sg shpchp pcspkr amd_rng i2c_amd756 i2c_core k8temp hwmon tg3 rtc_cmos rtc_core
rtc_lib serio_raw usb_storage qla2xxx scsi_transport_fc mptspi mptscsih mptbase
scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
c1f2bc2n13 kernel: Pid: 46, comm: softirq-net-rx/ Not tainted 2.6.21-31.el5rt #1
c1f2bc2n13 kernel: RIP: 0010:[<ffffffff88233e64>]  [<ffffffff88233e64>]
:tg3:tg3_tx_recover+0x20/0x5f
c1f2bc2n13 kernel: RSP: 0018:ffff81011fdcfd50  EFLAGS: 00010202
c1f2bc2n13 kernel: RAX: 00000000000000bf RBX: ffff81007e6dc1e8 RCX: ffff81007dd61000
c1f2bc2n13 kernel: RDX: 00000000000000bf RSI: ffff81011fdcfe24 RDI: ffff81007f6c8900
c1f2bc2n13 kernel: RBP: ffff81011fdcfd60 R08: 0000000000000001 R09: 0000000000000003
c1f2bc2n13 kernel: R10: 0000000000000003 R11: 00000000ffff8101 R12: ffff81007f6c8900
c1f2bc2n13 kernel: R13: 00000000000000bf R14: 0000000000000000 R15: 000000000000ffff
c1f2bc2n13 kernel: FS:  00002adcb3d1af40(0000) GS:ffff81011fd9f6c0(0000)
knlGS:00000000f7fb26c0
c1f2bc2n13 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
c1f2bc2n13 kernel: CR2: 00007fff465a7fc4 CR3: 0000000073cc1000 CR4: 00000000000006e0
c1f2bc2n13 kernel: Process softirq-net-rx/ (pid: 46, threadinfo
ffff81011fdce000, task ffff810037c77040)
c1f2bc2n13 kernel: Stack:  ffff81007e6dc1e8 ffff81007f6c8900 ffff81011fdcfe00
ffffffff8823c462
c1f2bc2n13 kernel:  ffff81011fdcfdb0 ffff81011fdcfe24 ffff81007f6c8000
ffff81007dd61000
c1f2bc2n13 kernel:  000021ccdc6a7b5f 00000000000039bd 0000000300000001
ffff81007f6c8a04
c1f2bc2n13 kernel: Call Trace:
c1f2bc2n13 kernel:  [<ffffffff8823c462>] :tg3:tg3_poll+0x26a/0xa39
c1f2bc2n13 kernel:  [<ffffffff8020c441>] net_rx_action+0xbe/0x1f2
c1f2bc2n13 kernel:  [<ffffffff80295ab0>] ksoftirqd+0x16c/0x271
c1f2bc2n13 kernel:  [<ffffffff80233d76>] kthread+0xf5/0x128
c1f2bc2n13 kernel:  [<ffffffff8025ff68>] child_rip+0xa/0x12
c1f2bc2n13 kernel: 
c1f2bc2n13 kernel: 
c1f2bc2n13 kernel: Code: 0f 0b eb fe 48 8b b7 a0 00 00 00 49 8d 5c 24 08 48 c7
c7 e9 
c1f2bc2n13 kernel: RIP  [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f
c1f2bc2n13 kernel:  RSP <ffff81011fdcfd50>
c1f2bc2n13 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0
c1f2bc2n13 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=9030363:   Reason code
668 Failure Reason Lost membership in cluster ls20.ppd.pok.ibm.com. Unmounting
file systems.
c1f2bc2n13 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=9030363:   
c1f2bc2n13 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=9030364:  
Unrecoverable file system operation error.  Status code 218.   Volume gpfs1    
                                                      
c1f2bc2n13 smartd[3220]: Device: /dev/sdb, Temperature changed -3 Celsius to 33
Celsius since last report 
c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out
c1f2bc2n13 kernel: tg3: eth1: transmit timed out, resetting
c1f2bc2n13 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
c1f2bc2n13 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out
c1f2bc2n13 kernel: tg3: eth1: transmit timed out, resetting
c1f2bc2n13 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
c1f2bc2n13 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out
c1f2bc2n13 kernel: tg3: eth1: transmit timed out, resetting
c1f2bc2n13 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
c1f2bc2n13 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out
...

Comment 1 Tim Burke 2007-07-19 19:06:36 UTC

Can this be triggered without GPFS?

Comment 2 IBM Bug Proxy 2007-07-25 01:25:35 UTC

----- Additional Comments From jstultz.com (prefers email at johnstul.com)  2007-07-24 21:20 EDT -------
David: Has this issue been reproduced since it was initially seen?

Comment 3 IBM Bug Proxy 2007-07-25 15:20:15 UTC

----- Additional Comments From dbhilley.com  2007-07-25 11:16 EDT -------
I had moved on to other test suites since I filed this bug, but I ran the stress
tests last night after your query and reproduced it again.  I got an odd message
about 45 mins before the panic:

Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: The system may be re-ordering
memory-mapped I/O cycles to the network device, attempting to recover. Please
report the problem to the driver maintainer and include system chipset information.
Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Link is down.
Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex.
Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Flow control is off for TX and off
for RX.
Jul 25 05:15:33 c1f2bc2n14 kernel: stopped custom tracer.
c1f2bc2n14 kernel: ------------[ cut here ]------------
c1f2bc2n14 kernel: kernel BUG at drivers/net/tg3.c:3036!
c1f2bc2n14 kernel: invalid opcode: 0000 [1] PREEMPT SMP 
c1f2bc2n14 kernel: CPU 3 
c1f2bc2n14 kernel: Modules linked in: mmfs mmfslinux tracedev ipv6 autofs4 hidp
rfcomm l2cap bluetooth sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath dm_mod
video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport sg
i2c_amd756 pcspkr i2c_core amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core
serio_raw rtc_lib usb_storage qla2xxx scsi_transport_fc mptspi mptscsih mptbase
scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
c1f2bc2n14 kernel: Pid: 46, comm: softirq-net-rx/ Not tainted 2.6.21-31.el5rt #1
c1f2bc2n14 kernel: RIP: 0010:[<ffffffff88233e64>]  [<ffffffff88233e64>]
:tg3:tg3_tx_recover+0x20/0x5f
c1f2bc2n14 kernel: RSP: 0000:ffff81011fdcfd50  EFLAGS: 00010202
c1f2bc2n14 kernel: RAX: 00000000000000b9 RBX: ffff81011e464158 RCX: ffff81007a2da000
c1f2bc2n14 kernel: RDX: 00000000000000b9 RSI: ffff81011fdcfe24 RDI: ffff81007ff30900
c1f2bc2n14 kernel: RBP: ffff81011fdcfd60 R08: 0000000000000000 R09: 0000000000000003
c1f2bc2n14 kernel: R10: 0000000000000003 R11: 00000000ffff8101 R12: ffff81007ff30900
c1f2bc2n14 kernel: R13: 00000000000000b9 R14: 0000000000000000 R15: 0000000000000000
c1f2bc2n14 kernel: FS:  00007fff384ff940(0000) GS:ffff81011fd9f6c0(0000)
knlGS:00000000f7f626c0
c1f2bc2n14 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
c1f2bc2n14 kernel: CR2: 00002b8a64044000 CR3: 0000000067446000 CR4: 00000000000006e0
c1f2bc2n14 kernel: Process softirq-net-rx/ (pid: 46, threadinfo
ffff81011fdce000, task ffff810037c77040)
c1f2bc2n14 kernel: Stack:  ffff81011e464158 ffff81007ff30900 ffff81011fdcfe00
ffffffff8823c462
c1f2bc2n14 kernel:  0000000121c3bf22 ffff81011fdcfe24 ffff81007ff30000
ffff81007a2da000
c1f2bc2n14 kernel:  000202a68ac18787 0000000000008e74 0000000300000001
ffff81007ff30a04
c1f2bc2n14 kernel: Call Trace:
c1f2bc2n14 kernel:  [<ffffffff8823c462>] :tg3:tg3_poll+0x26a/0xa39
c1f2bc2n14 kernel:  [<ffffffff8020c441>] net_rx_action+0xbe/0x1f2
c1f2bc2n14 kernel:  [<ffffffff80295ab0>] ksoftirqd+0x16c/0x271
c1f2bc2n14 kernel:  [<ffffffff80233d76>] kthread+0xf5/0x128
c1f2bc2n14 kernel:  [<ffffffff8025ff68>] child_rip+0xa/0x12
c1f2bc2n14 kernel: 
c1f2bc2n14 kernel: 
c1f2bc2n14 kernel: Code: 0f 0b eb fe 48 8b b7 a0 00 00 00 49 8d 5c 24 08 48 c7
c7 e9 
c1f2bc2n14 kernel: RIP  [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f
c1f2bc2n14 kernel:  RSP <ffff81011fdcfd50>
c1f2bc2n14 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=10949868:   Reason code
668 Failure Reason Lost membership in cluster ls20.ppd.pok.ibm.com. Unmounting
file systems.
c1f2bc2n14 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=10949868:   
c1f2bc2n14 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0
c1f2bc2n14 smartd[3453]: Device: /dev/sdb, Temperature changed -3 Celsius to 38
Celsius since last report 
c1f2bc2n14 kernel: NETDEV WATCHDOG: eth1: transmit timed out
c1f2bc2n14 kernel: tg3: eth1: transmit timed out, resetting
c1f2bc2n14 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
c1f2bc2n14 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
c1f2bc2n14 kernel: NETDEV WATCHDOG: eth1: transmit timed out
c1f2bc2n14 kernel: tg3: eth1: transmit timed out, resetting
c1f2bc2n14 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
c1f2bc2n14 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]

I haven't tried to reproduce without GPFS, but the general workload would
involve continuous network traffic in the form of a lot of small packets (not
bandwidth intensive) plus heavy local disk I/O in the form of many small disk
operations.  It is a metadata stress test.  I can try running it on an NFS mount
and see if I can reproduce it that way.

Comment 4 IBM Bug Proxy 2007-07-25 15:25:35 UTC

----- Additional Comments From nivedita.com (prefers email at niv.com)  2007-07-25 11:19 EDT -------
John, I'll try and repro today with a combination of simple tests today (I'm
sticking network stress tests into the testsuite today). --Nivedita

Comment 5 Luis Claudio R. Goncalves 2007-08-02 21:43:08 UTC

I have beentrying to reproduce this bug, without GPFS. So far, I have tried 10h+
with 2.6.21-34.el5rt and 10h+ with -31.el5rt, both x86_64. No single oops or
complaint in the logs.
The point I would like to record is that I have been testing this with my tg3 in
100baseTxFD. I will try using 1000baseTxFD.

My NIC:
eth0: Tigon3 [partno(BCM95755) rev a002 PHY(5755)] (PCI Express)
10/100/1000Base-T Ethernet 00:16:41:68:2f:96

Comment 6 IBM Bug Proxy 2007-11-21 17:40:33 UTC

------- Comment From sripathi.com 2007-11-21 12:35 EDT-------
Nivedita/John, has this been recreated by anyone other than David? Is this
problem still alive?

Comment 7 Clark Williams 2008-02-04 22:24:35 UTC

Any word on this bug?

Comment 8 IBM Bug Proxy 2008-02-15 21:28:46 UTC

------- Comment From nivedita.com 2008-02-15 16:18 EDT-------
I'll try network stress tests on Monday on 2.6.24-rt1 to see if I can
reproduce this.  I'll try and have some update here early next week.

However, we (LTC RT) have not seen this.

David - have you or your team seen this at all again?

Comment 9 IBM Bug Proxy 2008-03-10 20:24:36 UTC

------- Comment From nivedita.com 2008-03-10 16:21 EDT-------
So I have not been able to reproduce this without GPFS and
just doing network stress tests.

I was going to close this until we could reproduce again, but
I just learnt that we might have some GPFS guys testing again.

So for the moment, I'm leaving this open for 2 weeks more. I
think we'll have a GPFS test on latest bits within that time.
If we don't reproduce then, I'll close this bug by that time.

Comment 10 IBM Bug Proxy 2008-03-25 23:16:33 UTC

------- Comment From nivedita.com 2008-03-25 19:11 EDT-------
Rejecting as IRREPRODUCIBLE for now.  Should this reappear on GPFS,
we'll reopen.

Comment 11 IBM Bug Proxy 2008-05-12 16:16:27 UTC

------- Comment From jon.thomas.com 2008-05-12 12:11 EDT-------
moving to closed