| Summary: | bnx2: NETDEV WATCHDOG: em1 (bnx2): transmit queue 4 timed out | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Robert Peterson <rpeterso> | ||||||
| Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 6.2 | CC: | chorn, jburke, jeder, jwest | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-09-30 20:29:17 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Robert Peterson
2011-07-12 17:47:13 UTC
Created attachment 512502 [details]
patch to swap endianness if the device soft resets to opposite endianess
bob, can you try this patch out please? I've returned your system to you. Thanks!
Good news and bad news: The good news is that I was able to recreate the problem with only three of the four nodes, and without GFS2. To recreate the problem, I used a DLM test tool given to me by Dave Teigland called dlm_load. The original source is in: http://fedorapeople.org/gitweb?p=teigland/public_git/dct-stuff.git;a=tree;f=dlm;hb=HEAD I modified the tool so I could control the sleep parameters to push the ethernet driver harder. The new "-S" parameter makes it sleep every iteration rather than every ten iterations. The new "-s" parameter specifies the usleep time (in microseconds). The default is 200000. (1/5 of a second). I tighten it to 2. So to recreate the problem, I'm running this command on three nodes simultaneously, with cluster-ssh (cssh): /home/bob/dlm_load -i 1000000 -q -S -s 2 I had to run this command several times before the failure recreated for me. The failure has always occurred on gfs-a16c-04, so I'm still not ruling out hardware problems. The bad news is that Neil's patch from comment #2 did not seem to help at all. I still got: ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: PowerEdge R815 NETDEV WATCHDOG: em1 (bnx2): transmit queue 1 timed out Modules linked in: dlm configfs autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log power_meter dcdbas microcode serio_raw ] Pid: 0, comm: swapper Not tainted 2.6.32-167.el6.bz676626.x86_64 #1 Call Trace: <IRQ> [<ffffffff81067657>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81067746>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8143cc4d>] ? dev_watchdog+0x26d/0x280 [<ffffffff81079825>] ? internal_add_timer+0xb5/0x110 [<ffffffff8143c9e0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107a457>] ? run_timer_softirq+0x197/0x340 [<ffffffff8109e550>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102a2fd>] ? lapic_next_event+0x1d/0x30 [<ffffffff8106fc41>] ? __do_softirq+0xc1/0x1d0 [<ffffffff81093200>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de45>] ? do_softirq+0x65/0xa0 [<ffffffff8106fa25>] ? irq_exit+0x85/0x90 [<ffffffff814e5f80>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bbd3>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff8103656b>] ? native_safe_halt+0xb/0x10 [<ffffffff8101433d>] ? default_idle+0x4d/0xb0 [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110 [<ffffffff814c59fa>] ? rest_init+0x7a/0x80 [<ffffffff81b9df48>] ? start_kernel+0x41d/0x429 [<ffffffff81b9d33a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81b9d443>] ? x86_64_start_kernel+0x105/0x114 ---[ end trace f821242cfe8f7341 ]--- bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000] bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000] bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff] bnx2: Chip not in correct Endian mode bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000] bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000] bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff] bnx2: Chip not in correct Endian mode (Repeated). Note that "Endian" in the above message is capitalized. I changed that to ensure the patched version was, in fact, running rather than the stock version. It was. Created attachment 512720 [details]
The modified dlm_load I used to recreate the problem
Bob, could you please try something for me? When this problem occurs, can you rmmod the bnx2 module and modprobe it again? I'd like to see if the device can actually be recovered at all without a cold boot. If it can, then I did something wrong with my patch. If I can't then I'm starting to think this might be a hardware issue that we need to involve broadcom with. I'll try to get this going, but for some reason the serial ports on my consoles are all showing good output, but taking bad input. I've been trying to backtrack it and have eliminated the easy things (such as other processes having ttyS1 open). If I can't log in to the serial console, I can't rmmod or insmod the bnx2 driver because my only way in would be the console. I'll keep you posted. probably, I've seen that happen before, although I don't ever recall what the solution is. Please let me know when the system is back in working order. Robert, any luck gettnig the console on this system fixed? have you tried 197? I ran the 197 kernel all weekend for bz 734815 with heavy netperf traffic and encountered no errors at all. /me is starting to wonder if perhaps a bios update is needed on some of the systems in quesion I tried to recreate the problem with a newer kernel but ran into other non-related issues. I'll try it again when I get a few spare cycles. ok, copy that. With the latest RHEL6 kernel bits, I ran the test successfully a bunch of times in a row without failure, so I'm going to close this bug record. I'm not sure which microcode patch fixed the problem, but it seems fixed. Sorry for the noise and sorry it took so long to resolve. Incidentally, I did my testing on the 2.6.32-203.el6.x86_64 kernel. |