LTC Owner is: jstultz.com LTC Originator is: dbhilley.com Problem description: Stress testing GPFS on 2.6.21-31.el5rt elicited a kernel BUG in the tg3 driver on an LS20. c1f2bc2n13 kernel: kernel BUG at drivers/net/tg3.c:3036! c1f2bc2n13 kernel: invalid opcode: 0000 [1] PREEMPT SMP If this is a customer issue, please indicate the impact to the customer: If this is not an installation problem, Describe any custom patches installed. GPFS is compiled and loaded as a kernel module. Provide output from "uname -a", if possible: Linux c1f2bc2n13 2.6.21-31.el5rt #1 SMP PREEMPT RT Mon Jun 18 16:44:12 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux Hardware Environment Machine type (p650, x235, SF2, etc.): IBM eServer BladeCenter LS20 -[8850Z47]- Cpu type (Power4, Power5, IA-64, etc.): x86_64 Describe any special hardware you think might be relevant to this problem: Is this reproducible? Currently attempting to reproduce. It occurred last night after 10 or so hours of GPFS stress testing. Did the system produce an OOPS message on the console? If so, copy it here: c1f2bc2n13 kernel: kernel BUG at drivers/net/tg3.c:3036! c1f2bc2n13 kernel: invalid opcode: 0000 [1] PREEMPT SMP c1f2bc2n13 kernel: CPU 3 c1f2bc2n13 kernel: Modules linked in: mmfs mmfslinux tracedev autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport sg shpchp pcspkr amd_rng i2c_amd756 i2c_core k8temp hwmon tg3 rtc_cmos rtc_core rtc_lib serio_raw usb_storage qla2xxx scsi_transport_fc mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd c1f2bc2n13 kernel: Pid: 46, comm: softirq-net-rx/ Not tainted 2.6.21-31.el5rt #1 c1f2bc2n13 kernel: RIP: 0010:[<ffffffff88233e64>] [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f c1f2bc2n13 kernel: RSP: 0018:ffff81011fdcfd50 EFLAGS: 00010202 c1f2bc2n13 kernel: RAX: 00000000000000bf RBX: ffff81007e6dc1e8 RCX: ffff81007dd61000 c1f2bc2n13 kernel: RDX: 00000000000000bf RSI: ffff81011fdcfe24 RDI: ffff81007f6c8900 c1f2bc2n13 kernel: RBP: ffff81011fdcfd60 R08: 0000000000000001 R09: 0000000000000003 c1f2bc2n13 kernel: R10: 0000000000000003 R11: 00000000ffff8101 R12: ffff81007f6c8900 c1f2bc2n13 kernel: R13: 00000000000000bf R14: 0000000000000000 R15: 000000000000ffff c1f2bc2n13 kernel: FS: 00002adcb3d1af40(0000) GS:ffff81011fd9f6c0(0000) knlGS:00000000f7fb26c0 c1f2bc2n13 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b c1f2bc2n13 kernel: CR2: 00007fff465a7fc4 CR3: 0000000073cc1000 CR4: 00000000000006e0 c1f2bc2n13 kernel: Process softirq-net-rx/ (pid: 46, threadinfo ffff81011fdce000, task ffff810037c77040) c1f2bc2n13 kernel: Stack: ffff81007e6dc1e8 ffff81007f6c8900 ffff81011fdcfe00 ffffffff8823c462 c1f2bc2n13 kernel: ffff81011fdcfdb0 ffff81011fdcfe24 ffff81007f6c8000 ffff81007dd61000 c1f2bc2n13 kernel: 000021ccdc6a7b5f 00000000000039bd 0000000300000001 ffff81007f6c8a04 c1f2bc2n13 kernel: Call Trace: c1f2bc2n13 kernel: [<ffffffff8823c462>] :tg3:tg3_poll+0x26a/0xa39 c1f2bc2n13 kernel: [<ffffffff8020c441>] net_rx_action+0xbe/0x1f2 c1f2bc2n13 kernel: [<ffffffff80295ab0>] ksoftirqd+0x16c/0x271 c1f2bc2n13 kernel: [<ffffffff80233d76>] kthread+0xf5/0x128 c1f2bc2n13 kernel: [<ffffffff8025ff68>] child_rip+0xa/0x12 c1f2bc2n13 kernel: c1f2bc2n13 kernel: c1f2bc2n13 kernel: Code: 0f 0b eb fe 48 8b b7 a0 00 00 00 49 8d 5c 24 08 48 c7 c7 e9 c1f2bc2n13 kernel: RIP [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f c1f2bc2n13 kernel: RSP <ffff81011fdcfd50> c1f2bc2n13 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0 c1f2bc2n13 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=9030363: Reason code 668 Failure Reason Lost membership in cluster ls20.ppd.pok.ibm.com. Unmounting file systems. c1f2bc2n13 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=9030363: c1f2bc2n13 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=9030364: Unrecoverable file system operation error. Status code 218. Volume gpfs1 c1f2bc2n13 smartd[3220]: Device: /dev/sdb, Temperature changed -3 Celsius to 33 Celsius since last report c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out c1f2bc2n13 kernel: tg3: eth1: transmit timed out, resetting c1f2bc2n13 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] c1f2bc2n13 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out c1f2bc2n13 kernel: tg3: eth1: transmit timed out, resetting c1f2bc2n13 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] c1f2bc2n13 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out c1f2bc2n13 kernel: tg3: eth1: transmit timed out, resetting c1f2bc2n13 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] c1f2bc2n13 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] c1f2bc2n13 kernel: NETDEV WATCHDOG: eth1: transmit timed out ...
Can this be triggered without GPFS?
----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-24 21:20 EDT ------- David: Has this issue been reproduced since it was initially seen?
----- Additional Comments From dbhilley.com 2007-07-25 11:16 EDT ------- I had moved on to other test suites since I filed this bug, but I ran the stress tests last night after your query and reproduced it again. I got an odd message about 45 mins before the panic: Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: The system may be re-ordering memory-mapped I/O cycles to the network device, attempting to recover. Please report the problem to the driver maintainer and include system chipset information. Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Link is down. Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Flow control is off for TX and off for RX. Jul 25 05:15:33 c1f2bc2n14 kernel: stopped custom tracer. c1f2bc2n14 kernel: ------------[ cut here ]------------ c1f2bc2n14 kernel: kernel BUG at drivers/net/tg3.c:3036! c1f2bc2n14 kernel: invalid opcode: 0000 [1] PREEMPT SMP c1f2bc2n14 kernel: CPU 3 c1f2bc2n14 kernel: Modules linked in: mmfs mmfslinux tracedev ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport sg i2c_amd756 pcspkr i2c_core amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core serio_raw rtc_lib usb_storage qla2xxx scsi_transport_fc mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd c1f2bc2n14 kernel: Pid: 46, comm: softirq-net-rx/ Not tainted 2.6.21-31.el5rt #1 c1f2bc2n14 kernel: RIP: 0010:[<ffffffff88233e64>] [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f c1f2bc2n14 kernel: RSP: 0000:ffff81011fdcfd50 EFLAGS: 00010202 c1f2bc2n14 kernel: RAX: 00000000000000b9 RBX: ffff81011e464158 RCX: ffff81007a2da000 c1f2bc2n14 kernel: RDX: 00000000000000b9 RSI: ffff81011fdcfe24 RDI: ffff81007ff30900 c1f2bc2n14 kernel: RBP: ffff81011fdcfd60 R08: 0000000000000000 R09: 0000000000000003 c1f2bc2n14 kernel: R10: 0000000000000003 R11: 00000000ffff8101 R12: ffff81007ff30900 c1f2bc2n14 kernel: R13: 00000000000000b9 R14: 0000000000000000 R15: 0000000000000000 c1f2bc2n14 kernel: FS: 00007fff384ff940(0000) GS:ffff81011fd9f6c0(0000) knlGS:00000000f7f626c0 c1f2bc2n14 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b c1f2bc2n14 kernel: CR2: 00002b8a64044000 CR3: 0000000067446000 CR4: 00000000000006e0 c1f2bc2n14 kernel: Process softirq-net-rx/ (pid: 46, threadinfo ffff81011fdce000, task ffff810037c77040) c1f2bc2n14 kernel: Stack: ffff81011e464158 ffff81007ff30900 ffff81011fdcfe00 ffffffff8823c462 c1f2bc2n14 kernel: 0000000121c3bf22 ffff81011fdcfe24 ffff81007ff30000 ffff81007a2da000 c1f2bc2n14 kernel: 000202a68ac18787 0000000000008e74 0000000300000001 ffff81007ff30a04 c1f2bc2n14 kernel: Call Trace: c1f2bc2n14 kernel: [<ffffffff8823c462>] :tg3:tg3_poll+0x26a/0xa39 c1f2bc2n14 kernel: [<ffffffff8020c441>] net_rx_action+0xbe/0x1f2 c1f2bc2n14 kernel: [<ffffffff80295ab0>] ksoftirqd+0x16c/0x271 c1f2bc2n14 kernel: [<ffffffff80233d76>] kthread+0xf5/0x128 c1f2bc2n14 kernel: [<ffffffff8025ff68>] child_rip+0xa/0x12 c1f2bc2n14 kernel: c1f2bc2n14 kernel: c1f2bc2n14 kernel: Code: 0f 0b eb fe 48 8b b7 a0 00 00 00 49 8d 5c 24 08 48 c7 c7 e9 c1f2bc2n14 kernel: RIP [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f c1f2bc2n14 kernel: RSP <ffff81011fdcfd50> c1f2bc2n14 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=10949868: Reason code 668 Failure Reason Lost membership in cluster ls20.ppd.pok.ibm.com. Unmounting file systems. c1f2bc2n14 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=10949868: c1f2bc2n14 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0 c1f2bc2n14 smartd[3453]: Device: /dev/sdb, Temperature changed -3 Celsius to 38 Celsius since last report c1f2bc2n14 kernel: NETDEV WATCHDOG: eth1: transmit timed out c1f2bc2n14 kernel: tg3: eth1: transmit timed out, resetting c1f2bc2n14 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] c1f2bc2n14 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] c1f2bc2n14 kernel: NETDEV WATCHDOG: eth1: transmit timed out c1f2bc2n14 kernel: tg3: eth1: transmit timed out, resetting c1f2bc2n14 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] c1f2bc2n14 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] I haven't tried to reproduce without GPFS, but the general workload would involve continuous network traffic in the form of a lot of small packets (not bandwidth intensive) plus heavy local disk I/O in the form of many small disk operations. It is a metadata stress test. I can try running it on an NFS mount and see if I can reproduce it that way.
----- Additional Comments From nivedita.com (prefers email at niv.com) 2007-07-25 11:19 EDT ------- John, I'll try and repro today with a combination of simple tests today (I'm sticking network stress tests into the testsuite today). --Nivedita
I have beentrying to reproduce this bug, without GPFS. So far, I have tried 10h+ with 2.6.21-34.el5rt and 10h+ with -31.el5rt, both x86_64. No single oops or complaint in the logs. The point I would like to record is that I have been testing this with my tg3 in 100baseTxFD. I will try using 1000baseTxFD. My NIC: eth0: Tigon3 [partno(BCM95755) rev a002 PHY(5755)] (PCI Express) 10/100/1000Base-T Ethernet 00:16:41:68:2f:96
------- Comment From sripathi.com 2007-11-21 12:35 EDT------- Nivedita/John, has this been recreated by anyone other than David? Is this problem still alive?
Any word on this bug?
------- Comment From nivedita.com 2008-02-15 16:18 EDT------- I'll try network stress tests on Monday on 2.6.24-rt1 to see if I can reproduce this. I'll try and have some update here early next week. However, we (LTC RT) have not seen this. David - have you or your team seen this at all again?
------- Comment From nivedita.com 2008-03-10 16:21 EDT------- So I have not been able to reproduce this without GPFS and just doing network stress tests. I was going to close this until we could reproduce again, but I just learnt that we might have some GPFS guys testing again. So for the moment, I'm leaving this open for 2 weeks more. I think we'll have a GPFS test on latest bits within that time. If we don't reproduce then, I'll close this bug by that time.
------- Comment From nivedita.com 2008-03-25 19:11 EDT------- Rejecting as IRREPRODUCIBLE for now. Should this reappear on GPFS, we'll reopen.
------- Comment From jon.thomas.com 2008-05-12 12:11 EDT------- moving to closed