Bug 246711
Summary: | kernel BUG at drivers/net/tg3.c:3036 on 2.6.21-31.el5rt | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | IBM Bug Proxy <bugproxy> |
Component: | realtime-kernel | Assignee: | Arnaldo Carvalho de Melo <acme> |
Status: | CLOSED NOTABUG | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 1.0 | CC: | bhu |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-05-02 22:04:09 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
IBM Bug Proxy
2007-07-04 10:44:59 UTC
Can this be triggered without GPFS? ----- Additional Comments From jstultz.com (prefers email at johnstul.com) 2007-07-24 21:20 EDT ------- David: Has this issue been reproduced since it was initially seen? ----- Additional Comments From dbhilley.com 2007-07-25 11:16 EDT ------- I had moved on to other test suites since I filed this bug, but I ran the stress tests last night after your query and reproduced it again. I got an odd message about 45 mins before the panic: Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: The system may be re-ordering memory-mapped I/O cycles to the network device, attempting to recover. Please report the problem to the driver maintainer and include system chipset information. Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Link is down. Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex. Jul 25 04:30:51 c1f2bc2n14 kernel: tg3: eth1: Flow control is off for TX and off for RX. Jul 25 05:15:33 c1f2bc2n14 kernel: stopped custom tracer. c1f2bc2n14 kernel: ------------[ cut here ]------------ c1f2bc2n14 kernel: kernel BUG at drivers/net/tg3.c:3036! c1f2bc2n14 kernel: invalid opcode: 0000 [1] PREEMPT SMP c1f2bc2n14 kernel: CPU 3 c1f2bc2n14 kernel: Modules linked in: mmfs mmfslinux tracedev ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath dm_mod video sbs i2c_ec dock button battery asus_acpi ac parport_pc lp parport sg i2c_amd756 pcspkr i2c_core amd_rng shpchp k8temp hwmon tg3 rtc_cmos rtc_core serio_raw rtc_lib usb_storage qla2xxx scsi_transport_fc mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd c1f2bc2n14 kernel: Pid: 46, comm: softirq-net-rx/ Not tainted 2.6.21-31.el5rt #1 c1f2bc2n14 kernel: RIP: 0010:[<ffffffff88233e64>] [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f c1f2bc2n14 kernel: RSP: 0000:ffff81011fdcfd50 EFLAGS: 00010202 c1f2bc2n14 kernel: RAX: 00000000000000b9 RBX: ffff81011e464158 RCX: ffff81007a2da000 c1f2bc2n14 kernel: RDX: 00000000000000b9 RSI: ffff81011fdcfe24 RDI: ffff81007ff30900 c1f2bc2n14 kernel: RBP: ffff81011fdcfd60 R08: 0000000000000000 R09: 0000000000000003 c1f2bc2n14 kernel: R10: 0000000000000003 R11: 00000000ffff8101 R12: ffff81007ff30900 c1f2bc2n14 kernel: R13: 00000000000000b9 R14: 0000000000000000 R15: 0000000000000000 c1f2bc2n14 kernel: FS: 00007fff384ff940(0000) GS:ffff81011fd9f6c0(0000) knlGS:00000000f7f626c0 c1f2bc2n14 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b c1f2bc2n14 kernel: CR2: 00002b8a64044000 CR3: 0000000067446000 CR4: 00000000000006e0 c1f2bc2n14 kernel: Process softirq-net-rx/ (pid: 46, threadinfo ffff81011fdce000, task ffff810037c77040) c1f2bc2n14 kernel: Stack: ffff81011e464158 ffff81007ff30900 ffff81011fdcfe00 ffffffff8823c462 c1f2bc2n14 kernel: 0000000121c3bf22 ffff81011fdcfe24 ffff81007ff30000 ffff81007a2da000 c1f2bc2n14 kernel: 000202a68ac18787 0000000000008e74 0000000300000001 ffff81007ff30a04 c1f2bc2n14 kernel: Call Trace: c1f2bc2n14 kernel: [<ffffffff8823c462>] :tg3:tg3_poll+0x26a/0xa39 c1f2bc2n14 kernel: [<ffffffff8020c441>] net_rx_action+0xbe/0x1f2 c1f2bc2n14 kernel: [<ffffffff80295ab0>] ksoftirqd+0x16c/0x271 c1f2bc2n14 kernel: [<ffffffff80233d76>] kthread+0xf5/0x128 c1f2bc2n14 kernel: [<ffffffff8025ff68>] child_rip+0xa/0x12 c1f2bc2n14 kernel: c1f2bc2n14 kernel: c1f2bc2n14 kernel: Code: 0f 0b eb fe 48 8b b7 a0 00 00 00 49 8d 5c 24 08 48 c7 c7 e9 c1f2bc2n14 kernel: RIP [<ffffffff88233e64>] :tg3:tg3_tx_recover+0x20/0x5f c1f2bc2n14 kernel: RSP <ffff81011fdcfd50> c1f2bc2n14 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=10949868: Reason code 668 Failure Reason Lost membership in cluster ls20.ppd.pok.ibm.com. Unmounting file systems. c1f2bc2n14 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=10949868: c1f2bc2n14 kernel: GPFS Deadman Switch timer [0] has expired; IOs in progress: 0 c1f2bc2n14 smartd[3453]: Device: /dev/sdb, Temperature changed -3 Celsius to 38 Celsius since last report c1f2bc2n14 kernel: NETDEV WATCHDOG: eth1: transmit timed out c1f2bc2n14 kernel: tg3: eth1: transmit timed out, resetting c1f2bc2n14 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] c1f2bc2n14 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] c1f2bc2n14 kernel: NETDEV WATCHDOG: eth1: transmit timed out c1f2bc2n14 kernel: tg3: eth1: transmit timed out, resetting c1f2bc2n14 kernel: tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000] c1f2bc2n14 kernel: tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000] I haven't tried to reproduce without GPFS, but the general workload would involve continuous network traffic in the form of a lot of small packets (not bandwidth intensive) plus heavy local disk I/O in the form of many small disk operations. It is a metadata stress test. I can try running it on an NFS mount and see if I can reproduce it that way. ----- Additional Comments From nivedita.com (prefers email at niv.com) 2007-07-25 11:19 EDT ------- John, I'll try and repro today with a combination of simple tests today (I'm sticking network stress tests into the testsuite today). --Nivedita I have beentrying to reproduce this bug, without GPFS. So far, I have tried 10h+ with 2.6.21-34.el5rt and 10h+ with -31.el5rt, both x86_64. No single oops or complaint in the logs. The point I would like to record is that I have been testing this with my tg3 in 100baseTxFD. I will try using 1000baseTxFD. My NIC: eth0: Tigon3 [partno(BCM95755) rev a002 PHY(5755)] (PCI Express) 10/100/1000Base-T Ethernet 00:16:41:68:2f:96 ------- Comment From sripathi.com 2007-11-21 12:35 EDT------- Nivedita/John, has this been recreated by anyone other than David? Is this problem still alive? Any word on this bug? ------- Comment From nivedita.com 2008-02-15 16:18 EDT------- I'll try network stress tests on Monday on 2.6.24-rt1 to see if I can reproduce this. I'll try and have some update here early next week. However, we (LTC RT) have not seen this. David - have you or your team seen this at all again? ------- Comment From nivedita.com 2008-03-10 16:21 EDT------- So I have not been able to reproduce this without GPFS and just doing network stress tests. I was going to close this until we could reproduce again, but I just learnt that we might have some GPFS guys testing again. So for the moment, I'm leaving this open for 2 weeks more. I think we'll have a GPFS test on latest bits within that time. If we don't reproduce then, I'll close this bug by that time. ------- Comment From nivedita.com 2008-03-25 19:11 EDT------- Rejecting as IRREPRODUCIBLE for now. Should this reappear on GPFS, we'll reopen. ------- Comment From jon.thomas.com 2008-05-12 12:11 EDT------- moving to closed |