Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 720769

Summary:

bnx2: NETDEV WATCHDOG: em1 (bnx2): transmit queue 4 timed out

Product:

Red Hat Enterprise Linux 6

Reporter:

Robert Peterson <rpeterso>

Component:

kernel

Assignee:

Neil Horman <nhorman>

Status:

CLOSED WORKSFORME

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.2

CC:

chorn, jburke, jeder, jwest

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-09-30 20:29:17 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
patch to swap endianness if the device soft resets to opposite endianess	none
The modified dlm_load I used to recreate the problem	none

Description Robert Peterson 2011-07-12 17:47:13 UTC

Description of problem:
I was running smb_torture with ctdb on a four-node RHEL6 cluster.
After running it successfully 7 or 8 times, one of the nodes
failed with the errors show in "Actual Results".  I was able to
recreate the problem a second time.

Version-Release number of selected component (if applicable):
2.6.32-167.el6.bz676626.x86_64

How reproducible:
Unknown, but it recreated twice for me already, so I think it's
likely to recreate again.

Steps to Reproduce:
1.On all nodes: Start the cluster:
service cman start
service clvmd start

2.From one node, create the two gfs2 file systems:
mkfs.gfs2 -j4 -p lock_dlm -t intec_cluster:ctdb /dev/intec/ctdb -O && mkfs.gfs2 -j4 -p lock_dlm -t intec_cluster:gfs2 /dev/intec/intec1 -O

3.From all nodes, mount the GFS2 file systems:
mount -t gfs2 /dev/intec/ctdb /mnt/ctdb/ && mount -t gfs2 /dev/intec/in
tec1 /mnt/gfs2a/

4.From all nodes, start ctdb:
service ctdb start
  
5.From the driver system, gfs-i24c-01, start smb_torture:
cd /home/bob/samba/bin/default/source4/torture
./smbtorture //localhost/data  -U testmonkey%password bench.nbench --unclist=/root/unclist.txt --num-progs=32 -t600

6. Keep doing #5 repeatedly

Actual results:
It may run successfully up to ten times in a row, but then:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: PowerEdge R815
NETDEV WATCHDOG: em1 (bnx2): transmit queue 4 timed out
Modules linked in: gfs2(U) iptable_filter ip_tables dlm configfs autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log power_]
Pid: 0, comm: swapper Not tainted 2.6.32-167.el6.bz676626.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff81067657>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff81067746>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff8143cc4d>] ? dev_watchdog+0x26d/0x280
 [<ffffffff8107ac54>] ? mod_timer+0x144/0x220
 [<ffffffff8143c9e0>] ? dev_watchdog+0x0/0x280
 [<ffffffff8107a457>] ? run_timer_softirq+0x197/0x340
 [<ffffffff8109e550>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8102a2fd>] ? lapic_next_event+0x1d/0x30
 [<ffffffff8106fc41>] ? __do_softirq+0xc1/0x1d0
 [<ffffffff81093200>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de45>] ? do_softirq+0x65/0xa0
 [<ffffffff8106fa25>] ? irq_exit+0x85/0x90
 [<ffffffff814e5f80>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bbd3>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8103656b>] ? native_safe_halt+0xb/0x10
 [<ffffffff8101433d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814c59fa>] ? rest_init+0x7a/0x80
 [<ffffffff81b9df48>] ? start_kernel+0x41d/0x429
 [<ffffffff81b9d33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81b9d443>] ? x86_64_start_kernel+0x105/0x114
---[ end trace fbbba1dfce3a091d ]---
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct endian mode
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct endian mode
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct endian mode
etc.

Expected results:
All tests should be successful.

Additional info:
This test was performed in order to test the patch for bug #676626
so you need to have that GFS2 patch in order to run it, otherwise
GFS2 may have serious issues (see that bug for details).

We might be able to recreate it by flooding the card with
other traffic or other dlm traffic.

Comment 2 Neil Horman 2011-07-12 18:53:13 UTC

Created attachment 512502 [details]
patch to swap endianness if the device soft resets to opposite endianess

bob, can you try this patch out please? I've returned your system to you.  Thanks!

Comment 3 Robert Peterson 2011-07-13 18:35:28 UTC

Good news and bad news:

The good news is that I was able to recreate the problem with
only three of the four nodes, and without GFS2.  To recreate
the problem, I used a DLM test tool given to me by Dave Teigland
called dlm_load.  The original source is in:

http://fedorapeople.org/gitweb?p=teigland/public_git/dct-stuff.git;a=tree;f=dlm;hb=HEAD

I modified the tool so I could control the sleep parameters to
push the ethernet driver harder.  The new "-S" parameter makes
it sleep every iteration rather than every ten iterations.
The new "-s" parameter specifies the usleep time (in microseconds).
The default is 200000. (1/5 of a second).  I tighten it to 2.
So to recreate the problem, I'm running this command on three
nodes simultaneously, with cluster-ssh (cssh):

/home/bob/dlm_load -i 1000000 -q -S -s 2

I had to run this command several times before the failure
recreated for me.  The failure has always occurred on
gfs-a16c-04, so I'm still not ruling out hardware problems.

The bad news is that Neil's patch from comment #2 did not
seem to help at all.  I still got:

------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Hardware name: PowerEdge R815
NETDEV WATCHDOG: em1 (bnx2): transmit queue 1 timed out
Modules linked in: dlm configfs autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log power_meter dcdbas microcode serio_raw ]
Pid: 0, comm: swapper Not tainted 2.6.32-167.el6.bz676626.x86_64 #1
Call Trace:
 <IRQ>  [<ffffffff81067657>] ? warn_slowpath_common+0x87/0xc0
 [<ffffffff81067746>] ? warn_slowpath_fmt+0x46/0x50
 [<ffffffff8143cc4d>] ? dev_watchdog+0x26d/0x280
 [<ffffffff81079825>] ? internal_add_timer+0xb5/0x110
 [<ffffffff8143c9e0>] ? dev_watchdog+0x0/0x280
 [<ffffffff8107a457>] ? run_timer_softirq+0x197/0x340
 [<ffffffff8109e550>] ? tick_sched_timer+0x0/0xc0
 [<ffffffff8102a2fd>] ? lapic_next_event+0x1d/0x30
 [<ffffffff8106fc41>] ? __do_softirq+0xc1/0x1d0
 [<ffffffff81093200>] ? hrtimer_interrupt+0x140/0x250
 [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30
 [<ffffffff8100de45>] ? do_softirq+0x65/0xa0
 [<ffffffff8106fa25>] ? irq_exit+0x85/0x90
 [<ffffffff814e5f80>] ? smp_apic_timer_interrupt+0x70/0x9b
 [<ffffffff8100bbd3>] ? apic_timer_interrupt+0x13/0x20
 <EOI>  [<ffffffff8103656b>] ? native_safe_halt+0xb/0x10
 [<ffffffff8101433d>] ? default_idle+0x4d/0xb0
 [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110
 [<ffffffff814c59fa>] ? rest_init+0x7a/0x80
 [<ffffffff81b9df48>] ? start_kernel+0x41d/0x429
 [<ffffffff81b9d33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81b9d443>] ? x86_64_start_kernel+0x105/0x114
---[ end trace f821242cfe8f7341 ]---
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct Endian mode
bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000]
bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000]
bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff]
bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff]
bnx2: Chip not in correct Endian mode
(Repeated).

Note that "Endian" in the above message is capitalized.  I changed
that to ensure the patched version was, in fact, running rather
than the stock version.  It was.

Comment 4 Robert Peterson 2011-07-13 18:37:10 UTC

Created attachment 512720 [details]
The modified dlm_load I used to recreate the problem

Comment 5 Neil Horman 2011-07-13 18:55:36 UTC

Bob, could you please try something for me?  When this problem occurs, can you rmmod the bnx2 module and modprobe it again?  I'd like to see if the device can actually be recovered at all without a cold boot.  If it can, then I did something wrong with my patch.  If I can't then I'm starting to think this might be a hardware issue that we need to involve broadcom with.

Comment 6 Robert Peterson 2011-07-13 19:59:39 UTC

I'll try to get this going, but for some reason the serial ports
on my consoles are all showing good output, but taking bad input.
I've been trying to backtrack it and have eliminated the easy
things (such as other processes having ttyS1 open).  If I can't 
log in to the serial console, I can't rmmod or insmod the bnx2
driver because my only way in would be the console.
I'll keep you posted.

Comment 8 Neil Horman 2011-07-14 13:12:42 UTC

probably, I've seen that happen before, although I don't ever recall what the solution is.  Please let me know when the system is back in working order.

Comment 10 Neil Horman 2011-08-29 14:43:06 UTC

Robert, any luck gettnig the console on this system fixed?

Comment 12 Neil Horman 2011-09-19 15:31:00 UTC

have you tried 197?

I ran the 197 kernel all weekend for bz 734815 with heavy netperf traffic and encountered no errors at all.

/me is starting to wonder if perhaps a bios update is needed on some of the systems in quesion

Comment 14 Robert Peterson 2011-09-19 16:37:14 UTC

I tried to recreate the problem with a newer kernel but ran
into other non-related issues.  I'll try it again when I get
a few spare cycles.

Comment 15 Neil Horman 2011-09-20 10:39:36 UTC

ok, copy that.

Comment 17 Robert Peterson 2011-09-30 20:29:17 UTC

With the latest RHEL6 kernel bits, I ran the test successfully
a bunch of times in a row without failure, so I'm going to
close this bug record.  I'm not sure which microcode patch
fixed the problem, but it seems fixed.  Sorry for the noise
and sorry it took so long to resolve.  Incidentally, I did my
testing on the 2.6.32-203.el6.x86_64 kernel.