Bug 720769

Summary: bnx2: NETDEV WATCHDOG: em1 (bnx2): transmit queue 4 timed out
Product: Red Hat Enterprise Linux 6 Reporter: Robert Peterson <rpeterso>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED WORKSFORME QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.2CC: chorn, jburke, jeder, jwest
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-30 20:29:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
patch to swap endianness if the device soft resets to opposite endianess
none
The modified dlm_load I used to recreate the problem none

Description Robert Peterson 2011-07-12 17:47:13 UTC
Description of problem:
I was running smb_torture with ctdb on a four-node RHEL6 cluster.
After running it successfully 7 or 8 times, one of the nodes
failed with the errors show in "Actual Results".  I was able to
recreate the problem a second time.

Version-Release number of selected component (if applicable):
2.6.32-167.el6.bz676626.x86_64

How reproducible:
Unknown, but it recreated twice for me already, so I think it's
likely to recreate again.

Steps to Reproduce:
1.On all nodes: Start the cluster:
service cman start
service clvmd start

2.From one node, create the two gfs2 file systems:
mkfs.gfs2 -j4 -p lock_dlm -t intec_cluster:ctdb /dev/intec/ctdb -O && mkfs.gfs2 -j4 -p lock_dlm -t intec_cluster:gfs2 /dev/intec/intec1 -O

3.From all nodes, mount the GFS2 file systems:
mount -t gfs2 /dev/intec/ctdb /mnt/ctdb/ && mount -t gfs2 /dev/intec/in
tec1 /mnt/gfs2a/

4.From all nodes, start ctdb:
service ctdb start
  
5.From the driver system, gfs-i24c-01, start smb_torture:
cd /home/bob/samba/bin/default/source4/torture
./smbtorture //localhost/data  -U testmonkey%password bench.nbench --unclist=/root/unclist.txt --num-progs=32 -t600

6. Keep doing #5 repeatedly

Actual results:
It may run successfully up to ten times in a row, but then:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: PowerEdge R815
NETDEV WATCHDOG: em1 (bnx2): transmit queue 4 timed out
Modules linked in: gfs2(U) iptable_filter ip_tables dlm configfs autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log power_]
Pid: 0, comm: swapper Not tainted 2.6.32-167.el6.bz676626.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff81067657>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff81067746>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff8143cc4d>] ? dev_watchdog+0x26d/0x280
 [<ffffffff8107ac54>] ? mod_timer+0x144/0x220
 [<ffffffff8143c9e0>] ? dev_watchdog+0x0/0x280
 [<ffffffff8107a457>] ? run_timer_softirq+0x197/0x340
 [<ffffffff8109e550>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8102a2fd>] ? lapic_next_event+0x1d/0x30
 [<ffffffff8106fc41>] ? __do_softirq+0xc1/0x1d0
 [<ffffffff81093200>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de45>] ? do_softirq+0x65/0xa0
 [<ffffffff8106fa25>] ? irq_exit+0x85/0x90
 [<ffffffff814e5f80>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bbd3>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8103656b>] ? native_safe_halt+0xb/0x10
 [<ffffffff8101433d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814c59fa>] ? rest_init+0x7a/0x80
 [<ffffffff81b9df48>] ? start_kernel+0x41d/0x429
 [<ffffffff81b9d33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81b9d443>] ? x86_64_start_kernel+0x105/0x114
---[ end trace fbbba1dfce3a091d ]---
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct endian mode
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct endian mode
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct endian mode
etc.

Expected results:
All tests should be successful.

Additional info:
This test was performed in order to test the patch for bug #676626
so you need to have that GFS2 patch in order to run it, otherwise
GFS2 may have serious issues (see that bug for details).

We might be able to recreate it by flooding the card with
other traffic or other dlm traffic.

Comment 2 Neil Horman 2011-07-12 18:53:13 UTC
Created attachment 512502 [details]
patch to swap endianness if the device soft resets to opposite endianess

bob, can you try this patch out please? I've returned your system to you.  Thanks!

Comment 3 Robert Peterson 2011-07-13 18:35:28 UTC
Good news and bad news:

The good news is that I was able to recreate the problem with
only three of the four nodes, and without GFS2.  To recreate
the problem, I used a DLM test tool given to me by Dave Teigland
called dlm_load.  The original source is in:

http://fedorapeople.org/gitweb?p=teigland/public_git/dct-stuff.git;a=tree;f=dlm;hb=HEAD

I modified the tool so I could control the sleep parameters to
push the ethernet driver harder.  The new "-S" parameter makes
it sleep every iteration rather than every ten iterations.
The new "-s" parameter specifies the usleep time (in microseconds).
The default is 200000. (1/5 of a second).  I tighten it to 2.
So to recreate the problem, I'm running this command on three
nodes simultaneously, with cluster-ssh (cssh):

/home/bob/dlm_load -i 1000000 -q -S -s 2

I had to run this command several times before the failure
recreated for me.  The failure has always occurred on
gfs-a16c-04, so I'm still not ruling out hardware problems.

The bad news is that Neil's patch from comment #2 did not
seem to help at all.  I still got:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: PowerEdge R815
NETDEV WATCHDOG: em1 (bnx2): transmit queue 1 timed out
Modules linked in: dlm configfs autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log power_meter dcdbas microcode serio_raw ]
Pid: 0, comm: swapper Not tainted 2.6.32-167.el6.bz676626.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff81067657>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff81067746>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff8143cc4d>] ? dev_watchdog+0x26d/0x280
 [<ffffffff81079825>] ? internal_add_timer+0xb5/0x110
 [<ffffffff8143c9e0>] ? dev_watchdog+0x0/0x280
 [<ffffffff8107a457>] ? run_timer_softirq+0x197/0x340
 [<ffffffff8109e550>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8102a2fd>] ? lapic_next_event+0x1d/0x30
 [<ffffffff8106fc41>] ? __do_softirq+0xc1/0x1d0
 [<ffffffff81093200>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de45>] ? do_softirq+0x65/0xa0
 [<ffffffff8106fa25>] ? irq_exit+0x85/0x90
 [<ffffffff814e5f80>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bbd3>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8103656b>] ? native_safe_halt+0xb/0x10
 [<ffffffff8101433d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814c59fa>] ? rest_init+0x7a/0x80
 [<ffffffff81b9df48>] ? start_kernel+0x41d/0x429
 [<ffffffff81b9d33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81b9d443>] ? x86_64_start_kernel+0x105/0x114
---[ end trace f821242cfe8f7341 ]---
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct Endian mode
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct Endian mode
(Repeated).

Note that "Endian" in the above message is capitalized.  I changed
that to ensure the patched version was, in fact, running rather
than the stock version.  It was.

Comment 4 Robert Peterson 2011-07-13 18:37:10 UTC
Created attachment 512720 [details]
The modified dlm_load I used to recreate the problem

Comment 5 Neil Horman 2011-07-13 18:55:36 UTC
Bob, could you please try something for me?  When this problem occurs, can you rmmod the bnx2 module and modprobe it again?  I'd like to see if the device can actually be recovered at all without a cold boot.  If it can, then I did something wrong with my patch.  If I can't then I'm starting to think this might be a hardware issue that we need to involve broadcom with.

Comment 6 Robert Peterson 2011-07-13 19:59:39 UTC
I'll try to get this going, but for some reason the serial ports
on my consoles are all showing good output, but taking bad input.
I've been trying to backtrack it and have eliminated the easy
things (such as other processes having ttyS1 open).  If I can't 
log in to the serial console, I can't rmmod or insmod the bnx2
driver because my only way in would be the console.
I'll keep you posted.

Comment 8 Neil Horman 2011-07-14 13:12:42 UTC
probably, I've seen that happen before, although I don't ever recall what the solution is.  Please let me know when the system is back in working order.

Comment 10 Neil Horman 2011-08-29 14:43:06 UTC
Robert, any luck gettnig the console on this system fixed?

Comment 12 Neil Horman 2011-09-19 15:31:00 UTC
have you tried 197?

I ran the 197 kernel all weekend for bz 734815 with heavy netperf traffic and encountered no errors at all.

/me is starting to wonder if perhaps a bios update is needed on some of the systems in quesion

Comment 14 Robert Peterson 2011-09-19 16:37:14 UTC
I tried to recreate the problem with a newer kernel but ran
into other non-related issues.  I'll try it again when I get
a few spare cycles.

Comment 15 Neil Horman 2011-09-20 10:39:36 UTC
ok, copy that.

Comment 17 Robert Peterson 2011-09-30 20:29:17 UTC
With the latest RHEL6 kernel bits, I ran the test successfully
a bunch of times in a row without failure, so I'm going to
close this bug record.  I'm not sure which microcode patch
fixed the problem, but it seems fixed.  Sorry for the noise
and sorry it took so long to resolve.  Incidentally, I did my
testing on the 2.6.32-203.el6.x86_64 kernel.