Bug 602402 - bnx2x panic dumps with multiple interfaces enabled
Summary: bnx2x panic dumps with multiple interfaces enabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
urgent
high
Target Milestone: rc
: ---
Assignee: Stanislaw Gruszka
QA Contact: Liang Zheng
URL:
Whiteboard:
: 601637 (view as bug list)
Depends On:
Blocks: 607087 609184
TreeView+ depends on / blocked
 
Reported: 2010-06-09 19:27 UTC by Tim Wilkinson
Modified: 2018-11-30 20:20 UTC (History)
27 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Upon startup, the bnx2x network driver experienced a panic dump when more than one network interface was configured to start up at boot time. With this update, statistics counter initialization for function IDs greater than "1" has been disabled, with the result that bnx2x no longer panic dumps when more than one interface has the "ONBOOT=yes" directive set.
Clone Of:
: 609184 (view as bug list)
Environment:
Last Closed: 2011-01-13 21:36:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
panic dump snapshot (113.28 KB, image/png)
2010-06-14 14:19 UTC, Tim Wilkinson
no flags Details
panic dum snapshot (119.43 KB, image/png)
2010-06-14 14:19 UTC, Tim Wilkinson
no flags Details
messages.mgmt1 (831.90 KB, application/octet-stream)
2010-06-16 17:44 UTC, Tim Wilkinson
no flags Details
messages.mgmt2 (1.02 MB, application/octet-stream)
2010-06-16 17:45 UTC, Tim Wilkinson
no flags Details
messages.rhelh (340.63 KB, application/octet-stream)
2010-06-16 17:46 UTC, Tim Wilkinson
no flags Details
/var/log/messages for reproduce step above (75.00 KB, application/x-gzip)
2010-06-18 15:04 UTC, Alan Pevec
no flags Details
Disable statistics counters initialization for function ids greater than 1, when initializing bnx2x hw (3.72 KB, patch)
2010-06-21 16:58 UTC, Dmitry Kravkov
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Tim Wilkinson 2010-06-09 19:27:19 UTC
Description of problem:
----------------------
By default, all 8 of the network interfaces on this HP server blade are enabled at install. For some reason yet unknown to us, if the other IFs are enabled I see bnx2x panic dumps. If I edit each of the other ifcfg-ethX files, add ONBOOT=no to all but one of the network config files and reboot, the panic dumps do not occur (Note that just restarting the network will not resolve). 

I can then enable (ifup) any one of those disabled IFs and the panic dumps will reappear. 



Version-Release:
---------------
2.6.18-194.3.1.el5



How reproducible:
----------------
Consistent



Steps to Reproduce:
------------------
1. Install RHEL 5.5 on a system with >2 network interfaces and let it reboot



Actual results:
--------------
Server experiences bnx2x fw dump immediately upon starting ...

  http://irish.lab.bos.redhat.com/pub/projects/cloud/images/issues/bnx2x_panic_dump2.png
  http://irish.lab.bos.redhat.com/pub/projects/cloud/images/issues/bnx2x_panic_dump.png
  


Expected results:
----------------
Server boots without bnx2x fw dumps



Additional Info:
---------------
The IFs that are not supposed to enable at boot (eth1-eth7) do not have an ONBOOT line in their network config files. I add "ONBOOT=no" by hand to disable them at boot. 

Also, the undesired IFs do have rather odd MAC addresses. e.g., 

  eth0 - 00:17:A4:77:24:08    * good
  eth1 - d8:d3:85:66:f5:61    * causes panic dumps when enabled

Comment 1 Alan Pevec 2010-06-14 12:35:11 UTC
*** Bug 602694 has been marked as a duplicate of this bug. ***

Comment 2 Tim Wilkinson 2010-06-14 13:50:29 UTC
Please note that in our testing, we were able to generate the same bnx2x_panic_dump messages on other blades than those reporting hw errors but not all. We found at least one blade that did not produce the bnx2x_panic_dump messages when enabling the extra interfaces with similar mac addresses.

  kernel: [bnx2x_panic_dump:(609(eth3)cqe[1f4]=[0:0:0:0]

Also note that once observed, although the server is responsive, all networking is disabled. The only workaround found so far is to add "ONBOOT=no" to all but one of the ifcfg-eth* files and reboot, just restarting the network will not resolve the problem.

Comment 3 Eilon Greenstein 2010-06-14 14:05:04 UTC
I cannot access irish.lab.bos.redhat.com – can you please provide the dump?

Thanks,
Eilon

Comment 4 Tim Wilkinson 2010-06-14 14:19:22 UTC
Created attachment 423849 [details]
panic dump snapshot

Comment 5 Tim Wilkinson 2010-06-14 14:19:54 UTC
Created attachment 423850 [details]
panic dum snapshot

Comment 6 Tim Wilkinson 2010-06-14 14:23:28 UTC
These are only snapshots of the error messages as seen in console. If we reproduce the errors an another blade, please let us know if the dump is on the system or if we can produce any useful debug info for you. Thanks Eilon.

Comment 7 Stanislaw Gruszka 2010-06-14 14:27:21 UTC
Please attach copy of /var/log/messages where are all messages from bnx2x panic.

Comment 8 Eilon Greenstein 2010-06-14 14:28:04 UTC
As Stanislaw wrote, the pictures only show a small snapshot of the dump. They do not show the reason to start the dump which should be further up. Can you attach the kernel log (/var/log/messages)?

Comment 9 Tim Wilkinson 2010-06-16 17:41:20 UTC
blade5 reported a machine check yesterday afternoon which showed no outward symptoms on any other blades until this morning when others (unfortunately both of our cluster members) began reporting the "NIG timer max" errors and lost their networks. We're not sure why some blades are affected by the net traffic once the machine checks are seen elsewhere in the chassis. Some blades are up and running that show no NIG timer errors at all.

Attached are the messages from blades 1 & 2 (cluster members mgmt1 & mgmt2) and from blade5 (rhelh).

 - blades 1 & 2 (mgmt1,mgmt2) reported NIG timer errors 
 - blade 5 (rhelh) reported the bnx2x panic dumps

Comment 10 Tim Wilkinson 2010-06-16 17:44:01 UTC
Created attachment 424533 [details]
messages.mgmt1

Comment 11 Tim Wilkinson 2010-06-16 17:45:48 UTC
Created attachment 424535 [details]
messages.mgmt2

Comment 12 Tim Wilkinson 2010-06-16 17:46:25 UTC
Created attachment 424536 [details]
messages.rhelh

Comment 13 Alan Pevec 2010-06-18 14:58:58 UTC
This seems to be the interaction with bnx2i:

*Without* bnx2i loaded, starting all bnx2x NICs works, but fails if bnx2i is loaded.

*With* bnx2i loaded, it works if we start only eth0 NIC, so workaround for RHEVH installation would be to add ONBOOT=no for not-selected NICs, but later RHEV-M could reconfigure other NICs and hit the issue. 

We'd appreciate if Broadcom engineers could look at bnx2i/bnx2x interaction and see what is going wrong.


Details about the setup (Tim please fill me in if I missed something):
• HP blade ProLiant BL460c G6 SKU Number: 507864-B21 (from DMI)
• 8[1] bnx2x NICs, pci id 14e4:1650
02:00.0 - 02:00.7 Ethernet controller: Broadcom Corporation NetXtreme II BCM57711E 10-Gigabit PCIe
dmesg|grep bnx
Broadcom NetXtreme II 5771x 10Gigabit Ethernet Driver bnx2x 1.52.1-6 (2010/02/16)
bnx2x: part number 412F4E-0-0-0
bnx2x: Loading bnx2x-e1h-5.2.13.0.fw

[1] from Tim: "one dual port 10g NIC in the blade chassis that can be used by each blade. Anything more than the 2 engages virtualization to split the IF 4 ways, dividing up the bandwidth among the virt NICs"
Note that it doesn't matter if all NICs were enabled/connected in the blade manager or not (not enabled NICs have MACs starting with d8:d3:85 while enabled start with 00:17:a4).


Reproduce log (RHEV-H boot with "rescue" parameter):
[root@localhost /]# lsmod|grep bnx
bnx2x                 626433  0 
[root@localhost /]# service network start
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth0:  [  OK  ]
Bringing up interface eth1:  [  OK  ]
Bringing up interface eth2:  [  OK  ]
Bringing up interface eth3:  [  OK  ]
Bringing up interface eth4:  [  OK  ]
Bringing up interface eth5:  [  OK  ]
Bringing up interface eth6:  [  OK  ]
Bringing up interface eth7:  [  OK  ]
Bringing up interface breth0:  
Determining IP information for breth0... done.
[  OK  ]
[root@localhost /]# service network stop 
Shutting down interface breth0:  [  OK  ]
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down interface eth2:  [  OK  ]
Shutting down interface eth3:  [  OK  ]
Shutting down interface eth4:  [  OK  ]
Shutting down interface eth5:  [  OK  ]
Shutting down interface eth6:  [  OK  ]
Shutting down interface eth7:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
[root@localhost /]# modprobe bnx2i
[root@localhost /]# lsmod|grep bnx
bnx2i                  74593  0 
libiscsi2              77765  1 bnx2i
cnic                   78297  1 bnx2i
scsi_transport_iscsi2    74073  2 bnx2i,libiscsi2
bnx2x                 626433  0 
scsi_mod              197593  9 bnx2i,libiscsi2,scsi_transport_iscsi2,qla2xxx,scsi_transport_fc,cciss,scsi_dh_rdac,scsi_dh,sr_mod
[root@localhost /]# service network start
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth0:  [  OK  ]
Bringing up interface eth1:  [  OK  ]
Bringing up interface eth2:  [  OK  ]
Bringing up interface eth3:  [  OK  ]
Bringing up interface eth4:  SIOCSIFFLAGS: Device or resource busy
[  OK  ]
Bringing up interface eth5:  SIOCSIFFLAGS: Device or resource busy
[  OK  ]
Bringing up interface eth6:  SIOCSIFFLAGS: Device or resource busy
[  OK  ]
Bringing up interface eth7:  SIOCSIFFLAGS: Device or resource busy
[  OK  ]
Bringing up interface breth0:  
Determining IP information for breth0...PING 10.16.143.254 (10.16.143.254) from 10.16.136.7 breth0: 56(84) bytes of data.

--- 10.16.143.254 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2999ms
, pipe 3
 failed.
[FAILED]

*** FROM THIS state only way to recover NICs is reboot, removing bnx2i didn't help ***
[root@localhost /]# rmmod bnx2i
[root@localhost /]# service network stop
Shutting down interface eth0:  [  OK  ]
Shutting down interface eth1:  [  OK  ]
Shutting down interface eth2:  [  OK  ]
Shutting down interface eth3:  [  OK  ]
Shutting down loopback interface:  [  OK  ]
[root@localhost /]# service network start
Bringing up loopback interface:  [  OK  ]
Bringing up interface eth0:  SIOCSIFFLAGS: Input/output error
[  OK  ]
Bringing up interface eth1:  SIOCSIFFLAGS: Input/output error
[  OK  ]
Bringing up interface eth2:  SIOCSIFFLAGS: Input/output error
[  OK  ]
Bringing up interface eth3:  [  OK  ]
Bringing up interface eth4:  [  OK  ]
Bringing up interface eth5:  [  OK  ]
Bringing up interface eth6:  [  OK  ]
Bringing up interface eth7:  [  OK  ]
Bringing up interface breth0:  
Determining IP information for breth0...PING 10.16.143.254 (10.16.143.254) from 10.16.136.7 breth0: 56(84) bytes of data.

--- 10.16.143.254 ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2999ms
, pipe 3
 failed.
[FAILED]

Comment 14 Alan Pevec 2010-06-18 15:04:26 UTC
Created attachment 425148 [details]
/var/log/messages for reproduce step above

Comment 15 Alan Pevec 2010-06-18 15:34:44 UTC
*** Bug 601637 has been marked as a duplicate of this bug. ***

Comment 16 Alan Pevec 2010-06-18 16:19:37 UTC
Stanislaw, is this enough info for you and Broadcom to investigate?

Comment 18 Stanislaw Gruszka 2010-06-21 07:36:01 UTC
(In reply to comment #16)
> Stanislaw, is this enough info for you and Broadcom to investigate?    

I think yes, but perhaps some more verbose debug will be needed, let see what iSCSI guys will say ...

Michael, we have bnx2x firmware crash when bnx2i module is loaded, see comment 13, full firmware crash is attached in comment 14.

Comment 19 Dmitry Kravkov 2010-06-21 11:17:33 UTC
We have reproduced this locally
Working on this issue ...

Dmitry

Comment 20 Steve Reichard 2010-06-21 12:11:08 UTC
BZ 602694 was updated stated this was reproduced on a RHEL 5.5 node.

The RHEL 5.5 node was installed using satellite, with a package of base, device-mapper-multipath, and ntp.  The system registers with satellite and performs an update.

After the system boots, the ONBOOT for interfaces 1-7 was set to 'no' then rebooted.


When this booted the bnx errors were seen.

Booting to the original installed kernel, the problem is not seen.

Original kernel - initrd-2.6.18-194.el5.img
updated kernel - initrd-2.6.18-194.3.1.el5.img

Comment 22 Dmitry Kravkov 2010-06-21 16:58:24 UTC
Created attachment 425695 [details]
Disable statistics counters initialization for function ids greater than 1, when initializing bnx2x hw

Comment 23 Guru Anbalagane 2010-06-21 20:08:45 UTC
Our customer also ran into this bnx2x panic dump when iscsi offload is enabled.

Comment 24 Eilon Greenstein 2010-06-21 20:14:31 UTC
Dmitry was able to root-cause and fix this issue. Please try with his patch from comment 22 and let us know if it solves the problem for you as well.

Thanks,
Eilon

Comment 25 Stanislaw Gruszka 2010-06-22 09:13:45 UTC
(In reply to comment #22)
> Created an attachment (id=425695) [details]
> Disable statistics counters initialization for function ids greater than 1,
> when initializing bnx2x hw    
Thanks for patch Dmitry!

Tim, Alan, please test build with above patch to confirm it fix issue in your environment:
https://brewweb.devel.redhat.com/taskinfo?taskID=2538442

Comment 26 Stanislaw Gruszka 2010-06-22 13:22:42 UTC
Tim, Alan could you please test above kernel ASAP.

Comment 27 Tim Wilkinson 2010-06-22 14:20:26 UTC
Booted the rhel-h node to the 194.3.1 to reproduce the issue, which on occasion occurs even if the additional IFs are disabled. Although the issue did not occur at boot, enabling a couple of the IFs effectively reproduced the bnx2x panic dumps.

Booted the 203 kernel with no issues. Enabled all IFs individually, still no issues.

Set all IF config files to start at boot and rebooted. All IFs attempt to start but only the configured bridge obtains an IP, as expected. No sign of bnx2x panics.

Comment 28 Stanislaw Gruszka 2010-06-22 15:04:56 UTC
Ok, just posted to RKML since issue is urgent for us.

Dmitry, could you be so kind and post patch upstream?

Would be nice if Broadcom could do QA of patch. I did some tests of network traffic, did not tested iSCSI.

Comment 32 Dmitry Kravkov 2010-06-23 20:11:24 UTC
the patch applied to upstream, we'll perform additional testing with L2
and iSCSI. Thanks

Comment 34 Douglas Silas 2010-06-28 20:54:37 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Upon startup, the bnx2x network driver experienced a panic dump when more than one network interface was configured to start up at boot time. With this update, statistics counter initialization for function IDs greater than "1" has been disabled, with the result that bnx2x no longer panic dumps when more than one interface has the "ONBOOT=yes" directive set.

Comment 35 Jarod Wilson 2010-06-29 13:36:01 UTC
in kernel-2.6.18-205.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 36 Jarod Wilson 2010-06-29 13:40:10 UTC
Not sure yet what went wrong w/the release script, but that should have been "in kernel-2.6.18-204.el5" (in build 204, not 205).

Comment 43 Kevin O'Brien 2010-10-02 14:37:53 UTC
We were just bitten by this bug with the same hardware running kernel-2.6.18-194.11.4.  Any status update on when this patch is going to be released?

Comment 44 Stanislaw Gruszka 2010-10-04 08:00:45 UTC
Patch for that bug (from  comment 22) is in kernel since 2.6.18-194.8.1.el5, so your problem is most likely different bug with similar symptoms. Please open a new bug report for it. BTW: does blacklist bnx2i and cnic modules help?

Comment 45 Kevin O'Brien 2010-10-04 17:33:16 UTC
It turns out the module panic happened right after the new kernel (2.6.18-194.8.1) was installed and was in mid-reboot.  We'll watch for any other panics as we update systems and create a new bug report if we see any panics after the new kernel is running.

Comment 49 errata-xmlrpc 2011-01-13 21:36:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.