Description of problem:
We have a dell M1000 blade chassis with M610 blades, and BCM5709s gig nics (2). We have set one of them on our production network for production data flow, and the other is connected to our san via iscsi. Under varying timeframes, one or both of the interfaces simply stops sending traffic and goes non responsive until restarted.
Version-Release number of selected component (if applicable):
Stock from first kernel in 2.3 to the latest 1.9.3 driver from RHEL 5.4
Always. Times vary, but it happens without fail.
Steps to Reproduce:
1. Start network flow over production network
2. Start network flow over iscsi network
Network connectivity drops out
Network flow remains constant
The individual interfaces are also through different switches, both cisco 3032s.
should have clarified that the component in question is the bnx2 driver.
Thanks for contacting Red Hat about this potential bug. Please note that bugzilla is not an official support tool and that if you have support entitlements you should contact Red Hat support to have this bug prioritized with engineering and worked accordingly.
Red Hat Support
Just to clarify, your title says "crashes" but the text indicates this NIC is unresponsive. I assume when this bug happens you can't ping the device in question.
When you get into this situation, can you
/sbin/ethtool -S <interface> |grep rw_fw_discards
If rw_fw_discards is increasing, does it appear as though the number of interrupts received on for the specific device is not increasing?
If the above is true, is your system configured to use irqbalance? You might want to disable this and see if this improves your situation.
Next, you might want to disable bnx2 from using msi and report the results.
We have tripped across this as well the Dell M610 planar eth0 (BCM5709, bnx2) on 3 separate machines w/ different workloads (unlike comment 0, none of them iSCSI, though at least one was a heavy NFS client). Wasn't able to catch the comment 4 ethtool stats on them. Link remains up from both client side (as reported by ethtool) and switch side, though forwarding is dead; a network restart on the server appears to clear the condition.
This has been predominately on 2.6.18-164.el5; one of the cases _might_ have been 2.6.18-128.el5 (documentation lacking), though the majority of our M610's are on -128 and have not experienced this (it certainly feels like a -164 regression).
Unfortunately, have hammered several machine in a test setup (iperf and working the NFS client hard) trying to get a clean reproduce and it just won't die.
re comment 4 -- irqbalance and MSI-X (with the default 8 queues) are enabled; are you suspicious of something in particular there? (ie. would throwing in more non-bnx2 interrupts have a better chance of tickling this?)
There have been reports of interfaces becoming un-responsive and it appears to be related to when interrupt migration with msi, hence my suggestion.
I have no data to show that adding non-bnx2 interrupts would help. In fact, it my belief that adding more bnx2 interrupts (if possible) would help induce this timing related situation.
Data point: We're seeing this as well on many Dell M610 systems running x86_64 SL5.3, kernels -164.2.1 and -164.6.1.
Using the bnx2 driver from Dell's support page seems to help.
Thank you for the info. The driver provided by Dell does not have MSI-X enabled so it has not been found to fail, if your bug is the same bug as the one that Dell and I are persuing.
Thanks for the update, John. In the process of pushing out configs to disable msi-x for now, but is the suspect problem specifically with the BCM5709 that Dell ships in the M610/M710 and the re-based bnx2 in -164? (Just want to scope the config as widely as reasonable without being silly).
I have attempted many times to reproduce this bug in our qa lab on many different drivers, kernels, etc.
The one problem we have run into is that the key factor that brought our machine down to begin with was a massive amount of throughput. We just cant reproduce our production rate in qa no matter how hard we try. Our production NFS server pushes out at a very high capacity due to how our nfs network is configured.
If there is noone else that can reproduce the issue reliable and there is reason to believe there is a fix, then I can attempt to change over our production back to 5.4, and see if we can trip it. The major downside is that we will have to plan this and let our customers know as the machine impacted is customer facing. (It would have to be in about a week or so)
Ok, here's some more data:
It turns out we saw this problem with earlier kernels as well - at least -128.7.1 . But all our systems are running -164.6.1 now for obvious reasons, so all observations below are with this kernel only:
The issue is clearly correlated with network throughput. The more traffic, the higher the probability that the NIC will stall. We have not been able to find a reliable reproducer though. But since we're just burning in four enclosures filled with M610s, and two more are in production already, we have enough statistics to be be reasonably sure that a workaround works if we had no stalled interfaces for a while.
Sometimes an interface will recover without any action, after minutes or after hours. But often, they don't recover within 12 hours or more.
An interface will recover immediately after an "ethtool -t" .
The problem does not occur when the Dell driver is used. This one uses MSI, but not MSI-X.
The "disable_msi=1" workaround reliably makes this problem vanish for us.
We haven't tried turning off irqbalance instead yet. It is enabled on all systems. NB we're not doing iSCSI, and we habitually disable TSO on all NICs using the bnx2 driver.
We've just started seeing this on Dell R610s (which are new to our infra). I came across this Bugzilla entry when looking for issues with the bnx2 driver. I'm trying the disable_msi=1 workaround but the problem reported above also appears to be our issue as well. Do the Dell-provided drivers represent a newer version of the driver than exists in the 5.4 kernels? We've historically avoided the Dell-provided drivers as non-standard. According to the EL 5.4 release notes, the bnx2 driver was updated. Additionally this workaround as noted in the 5.3 release notes appears to be removed in 5.4. I'm trying to ascertain if I need to take this up with RH Support as an RFE to use the latest bnx2 driver or if it's being handled already for an errata release.
Sorry, this workaround is still listed in the 5.4 notes. I was looking at the Release Notes not the Technical Notes when doing a string search for bnx2.
Just for reference the Technical Notes don't mention this network problem as described in this BZ, just that
"Configuring IRQ SMP affinity has no effect on some devices that use message signalled interrupts (MSI) with no MSI per-vector masking capability. Examples of such devices include Broadcom NetXtreme Ethernet devices that use the bnx2 driver.
If you need to configure IRQ affinity for such a device, disable MSI by creating a file in /etc/modprobe.d/ containing the following line:
options bnx2 disable_msi=1
Alternatively, you can disable MSI completely using the kernel boot parameter pci=nomsi. (BZ#432451)"
Some comments on the above:
- I am in the process of writing a knowledge base article that docments this issue.
- I believe it is not due to a regression in later RHEL5 bnx2 revisions.
- The Dell provided driver is not newer than RHEL5.4's.
- I have found manually changing IRQ affinity via irqbalance did not help reproduce the problem, but then again, the problem is difficult to reproduce.
- It definitely occurs due to high volume of traffic, perhaps when the traffic causes three MSI-X vectors to execute simultaneously. Still we too have tried to bombard the NIC with traffic and have not reproduced it.
Thanks to all for the info you have provided.
Concerning the increasing rw_fw_discards counter I made the folowing
observation on several HP DL380G6 Servers:
As soon the bnx2i module is loaded the rw_fw_discards counter increases
bnx2 is loaded with disable_msi=1 option
modinfo bnx2: version 1.9.3
modinfo bnx2i: version 2.0.1e
modinfo cnic: version 2.0.1
(all as provided by 2.6.18-164.11.1.el5 #1 x86_64 kernel)
# ethtool -i eth0
firmware-version: 5.2.2 NCSI 2.0.6
(latest firmware from HP)
Broadcom Adapter is:
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
03:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
I'd like to add a "me too" to this particular problem. I've been struggling with it for about a month, and finally came across this bug report.
# ethtool -i eth0
firmware-version: 4.6.4 NCSI 1.0.6
Hardware is: PowerEdge R710
I'm going to try the workaround by disabling msi.
Thanks to Dell's investigation and Broadcom's bnx2 expertise, there is a fix for the "configure the 5709 to use MSI-X and under heavy load, the interrupts can get hung with the bnx2 driver" problem which I assume most of you have been hitting. It will be available in RHEL5.6 and there will be a bnx2 5.5.z-stream soon. Thanks again to all for the above info.
Oh, if anyone tries this z-stream bnx2 driver and finds that the problem reported here to be fixed, I would really appreciate an update to be added to this bz so I can put this one to bed. Of course, if the z-stream bnx2 driver doesn't fix the problem, I would appreciate knowing that too.
The 5.5 Z-stream errata I referred to in the previous comment can be found at http://rhn.redhat.com/errata/RHSA-2010-0398.html
We had several months of stability on kernel 2.6.18-164.2.1.el5 by utilizing the workaround "options bnx2 disable_msi=1" with our Dell R710s and BCM5709 NICs. After upgrading to 2.6.18-194.el5 the problem returned. The workaround is still in place but clearly doesn't seem to make any difference.
We've been having this problem on our Dell R710s with BCM5709. We've only just implemented the workaround in this bugzilla on the 2.6.18-194.el5, and haven't had a problem yet (but that is only 3 days ago and we've had a weekend where load is low). We will be moving to the latest kernel (kernel-2.6.18-194.3.1.el5) late this week with the "fix". Will let you know what happens.
We've been seeing this issue with a couple of ESXi VMs running on Dell 610s. Very infrequent, 3 times over about 6 months on 2 identical MySQL servers. We've saw it on 2.6.18-128.el5 the first 2 times on one server, the servers were both updated 2 weeks ago and this morning we see it on 2.6.18-194.3.1.el5 on the other server. The VMs are running e1000s, have irqbalance running, and acpid. We're going to shut down acpid for a start and see how that goes.
This could be a dup of Engineeringbz#511368
Right, 511368's errata is the one I was referring to in comments 19 and 20. I believe this bz (520888) is fixed by that errata and the fix will be generally available in 5.6.
Haven't gotten any feedback from anyone about this recently. I am going to assume the fix to Engineeringbz511368 has fixed this as a result. Unless told otherwise, I will be closing this as a duplicate.
since we can't see that private BZ, can we post the advisory that might be resolving this or the update so we can know to verify it's in 5.6?
Yes! A confirmation that the fix is in 5.6 is a great thing!!!!
(We running several DL380 as KVM Hosts and it would be a very very very
bad thing when these are beginning to crash again;-)
Given the errata (http://rhn.redhat.com/errata/RHSA-2011-0017.html) fixes this issue, I am closing this bugzilla.
*** This bug has been marked as a duplicate of Engineeringbug 511368 ***
I searched the errate you gave as fix reference,
but I did not find this Bug 520888 nor the duplicate EngineeringBug 511368
on the fix list.
Please enlight us which one on RHSA-2011-0017 list does fix this
I'm being told that I'm "not authorized" to access EngineeringBug 511368, even after creating a Bugzilla account and logging in; why is this? Please grant me and other interested people permission to view this bug.
(In reply to Eugene Koontz from comment #33)
> I'm being told that I'm "not authorized" to access EngineeringBug 511368, even after
> creating a Bugzilla account and logging in; why is this? Please grant me and
> other interested people permission to view this bug.
quickly and for others, EngineeringBug 511368 was closed with an errata of the RHEL 5.6 kernel package:
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.
so if you update to at least RHEL 5.6 you should not see the problem outlined here.
If you must stay on RHEL 5.5 and have Extended Update Support (EUS) you could use the following errata instead:
* in certain circumstances, under heavy load, certain network interface
cards using the bnx2 driver and configured to use MSI-X, could stop
processing interrupts and then network connectivity would cease.
Note the front page of Bugzilla mentions:
"Red Hat Bugzilla is the Red Hat bug-tracking system and is used to submit and review defects that have been found in Red Hat distributions. Red Hat Bugzilla is not an avenue for technical assistance or support, but simply a bug tracking system."
"If you are a Red Hat customer with an active subscription, please visit the Red Hat Customer Portal for assistance with your issue."
Your best bet is to open a case and reference these bugs/erratas.