Description of problem: ---------------------- By default, all 8 of the network interfaces on this HP server blade are enabled at install. For some reason yet unknown to us, if the other IFs are enabled I see bnx2x panic dumps. If I edit each of the other ifcfg-ethX files, add ONBOOT=no to all but one of the network config files and reboot, the panic dumps do not occur (Note that just restarting the network will not resolve). I can then enable (ifup) any one of those disabled IFs and the panic dumps will reappear. Version-Release: --------------- 2.6.18-194.3.1.el5 How reproducible: ---------------- Consistent Steps to Reproduce: ------------------ 1. Install RHEL 5.5 on a system with >2 network interfaces and let it reboot Actual results: -------------- Server experiences bnx2x fw dump immediately upon starting ... http://irish.lab.bos.redhat.com/pub/projects/cloud/images/issues/bnx2x_panic_dump2.png http://irish.lab.bos.redhat.com/pub/projects/cloud/images/issues/bnx2x_panic_dump.png Expected results: ---------------- Server boots without bnx2x fw dumps Additional Info: --------------- The IFs that are not supposed to enable at boot (eth1-eth7) do not have an ONBOOT line in their network config files. I add "ONBOOT=no" by hand to disable them at boot. Also, the undesired IFs do have rather odd MAC addresses. e.g., eth0 - 00:17:A4:77:24:08 * good eth1 - d8:d3:85:66:f5:61 * causes panic dumps when enabled
*** Bug 602694 has been marked as a duplicate of this bug. ***
Please note that in our testing, we were able to generate the same bnx2x_panic_dump messages on other blades than those reporting hw errors but not all. We found at least one blade that did not produce the bnx2x_panic_dump messages when enabling the extra interfaces with similar mac addresses. kernel: [bnx2x_panic_dump:(609(eth3)cqe[1f4]=[0:0:0:0] Also note that once observed, although the server is responsive, all networking is disabled. The only workaround found so far is to add "ONBOOT=no" to all but one of the ifcfg-eth* files and reboot, just restarting the network will not resolve the problem.
I cannot access irish.lab.bos.redhat.com – can you please provide the dump? Thanks, Eilon
Created attachment 423849 [details] panic dump snapshot
Created attachment 423850 [details] panic dum snapshot
These are only snapshots of the error messages as seen in console. If we reproduce the errors an another blade, please let us know if the dump is on the system or if we can produce any useful debug info for you. Thanks Eilon.
Please attach copy of /var/log/messages where are all messages from bnx2x panic.
As Stanislaw wrote, the pictures only show a small snapshot of the dump. They do not show the reason to start the dump which should be further up. Can you attach the kernel log (/var/log/messages)?
blade5 reported a machine check yesterday afternoon which showed no outward symptoms on any other blades until this morning when others (unfortunately both of our cluster members) began reporting the "NIG timer max" errors and lost their networks. We're not sure why some blades are affected by the net traffic once the machine checks are seen elsewhere in the chassis. Some blades are up and running that show no NIG timer errors at all. Attached are the messages from blades 1 & 2 (cluster members mgmt1 & mgmt2) and from blade5 (rhelh). - blades 1 & 2 (mgmt1,mgmt2) reported NIG timer errors - blade 5 (rhelh) reported the bnx2x panic dumps
Created attachment 424533 [details] messages.mgmt1
Created attachment 424535 [details] messages.mgmt2
Created attachment 424536 [details] messages.rhelh
This seems to be the interaction with bnx2i: *Without* bnx2i loaded, starting all bnx2x NICs works, but fails if bnx2i is loaded. *With* bnx2i loaded, it works if we start only eth0 NIC, so workaround for RHEVH installation would be to add ONBOOT=no for not-selected NICs, but later RHEV-M could reconfigure other NICs and hit the issue. We'd appreciate if Broadcom engineers could look at bnx2i/bnx2x interaction and see what is going wrong. Details about the setup (Tim please fill me in if I missed something): • HP blade ProLiant BL460c G6 SKU Number: 507864-B21 (from DMI) • 8[1] bnx2x NICs, pci id 14e4:1650 02:00.0 - 02:00.7 Ethernet controller: Broadcom Corporation NetXtreme II BCM57711E 10-Gigabit PCIe dmesg|grep bnx Broadcom NetXtreme II 5771x 10Gigabit Ethernet Driver bnx2x 1.52.1-6 (2010/02/16) bnx2x: part number 412F4E-0-0-0 bnx2x: Loading bnx2x-e1h-5.2.13.0.fw [1] from Tim: "one dual port 10g NIC in the blade chassis that can be used by each blade. Anything more than the 2 engages virtualization to split the IF 4 ways, dividing up the bandwidth among the virt NICs" Note that it doesn't matter if all NICs were enabled/connected in the blade manager or not (not enabled NICs have MACs starting with d8:d3:85 while enabled start with 00:17:a4). Reproduce log (RHEV-H boot with "rescue" parameter): [root@localhost /]# lsmod|grep bnx bnx2x 626433 0 [root@localhost /]# service network start Bringing up loopback interface: [ OK ] Bringing up interface eth0: [ OK ] Bringing up interface eth1: [ OK ] Bringing up interface eth2: [ OK ] Bringing up interface eth3: [ OK ] Bringing up interface eth4: [ OK ] Bringing up interface eth5: [ OK ] Bringing up interface eth6: [ OK ] Bringing up interface eth7: [ OK ] Bringing up interface breth0: Determining IP information for breth0... done. [ OK ] [root@localhost /]# service network stop Shutting down interface breth0: [ OK ] Shutting down interface eth0: [ OK ] Shutting down interface eth1: [ OK ] Shutting down interface eth2: [ OK ] Shutting down interface eth3: [ OK ] Shutting down interface eth4: [ OK ] Shutting down interface eth5: [ OK ] Shutting down interface eth6: [ OK ] Shutting down interface eth7: [ OK ] Shutting down loopback interface: [ OK ] [root@localhost /]# modprobe bnx2i [root@localhost /]# lsmod|grep bnx bnx2i 74593 0 libiscsi2 77765 1 bnx2i cnic 78297 1 bnx2i scsi_transport_iscsi2 74073 2 bnx2i,libiscsi2 bnx2x 626433 0 scsi_mod 197593 9 bnx2i,libiscsi2,scsi_transport_iscsi2,qla2xxx,scsi_transport_fc,cciss,scsi_dh_rdac,scsi_dh,sr_mod [root@localhost /]# service network start Bringing up loopback interface: [ OK ] Bringing up interface eth0: [ OK ] Bringing up interface eth1: [ OK ] Bringing up interface eth2: [ OK ] Bringing up interface eth3: [ OK ] Bringing up interface eth4: SIOCSIFFLAGS: Device or resource busy [ OK ] Bringing up interface eth5: SIOCSIFFLAGS: Device or resource busy [ OK ] Bringing up interface eth6: SIOCSIFFLAGS: Device or resource busy [ OK ] Bringing up interface eth7: SIOCSIFFLAGS: Device or resource busy [ OK ] Bringing up interface breth0: Determining IP information for breth0...PING 10.16.143.254 (10.16.143.254) from 10.16.136.7 breth0: 56(84) bytes of data. --- 10.16.143.254 ping statistics --- 4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2999ms , pipe 3 failed. [FAILED] *** FROM THIS state only way to recover NICs is reboot, removing bnx2i didn't help *** [root@localhost /]# rmmod bnx2i [root@localhost /]# service network stop Shutting down interface eth0: [ OK ] Shutting down interface eth1: [ OK ] Shutting down interface eth2: [ OK ] Shutting down interface eth3: [ OK ] Shutting down loopback interface: [ OK ] [root@localhost /]# service network start Bringing up loopback interface: [ OK ] Bringing up interface eth0: SIOCSIFFLAGS: Input/output error [ OK ] Bringing up interface eth1: SIOCSIFFLAGS: Input/output error [ OK ] Bringing up interface eth2: SIOCSIFFLAGS: Input/output error [ OK ] Bringing up interface eth3: [ OK ] Bringing up interface eth4: [ OK ] Bringing up interface eth5: [ OK ] Bringing up interface eth6: [ OK ] Bringing up interface eth7: [ OK ] Bringing up interface breth0: Determining IP information for breth0...PING 10.16.143.254 (10.16.143.254) from 10.16.136.7 breth0: 56(84) bytes of data. --- 10.16.143.254 ping statistics --- 4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2999ms , pipe 3 failed. [FAILED]
Created attachment 425148 [details] /var/log/messages for reproduce step above
*** Bug 601637 has been marked as a duplicate of this bug. ***
Stanislaw, is this enough info for you and Broadcom to investigate?
(In reply to comment #16) > Stanislaw, is this enough info for you and Broadcom to investigate? I think yes, but perhaps some more verbose debug will be needed, let see what iSCSI guys will say ... Michael, we have bnx2x firmware crash when bnx2i module is loaded, see comment 13, full firmware crash is attached in comment 14.
We have reproduced this locally Working on this issue ... Dmitry
BZ 602694 was updated stated this was reproduced on a RHEL 5.5 node. The RHEL 5.5 node was installed using satellite, with a package of base, device-mapper-multipath, and ntp. The system registers with satellite and performs an update. After the system boots, the ONBOOT for interfaces 1-7 was set to 'no' then rebooted. When this booted the bnx errors were seen. Booting to the original installed kernel, the problem is not seen. Original kernel - initrd-2.6.18-194.el5.img updated kernel - initrd-2.6.18-194.3.1.el5.img
Created attachment 425695 [details] Disable statistics counters initialization for function ids greater than 1, when initializing bnx2x hw
Our customer also ran into this bnx2x panic dump when iscsi offload is enabled.
Dmitry was able to root-cause and fix this issue. Please try with his patch from comment 22 and let us know if it solves the problem for you as well. Thanks, Eilon
(In reply to comment #22) > Created an attachment (id=425695) [details] > Disable statistics counters initialization for function ids greater than 1, > when initializing bnx2x hw Thanks for patch Dmitry! Tim, Alan, please test build with above patch to confirm it fix issue in your environment: https://brewweb.devel.redhat.com/taskinfo?taskID=2538442
Tim, Alan could you please test above kernel ASAP.
Booted the rhel-h node to the 194.3.1 to reproduce the issue, which on occasion occurs even if the additional IFs are disabled. Although the issue did not occur at boot, enabling a couple of the IFs effectively reproduced the bnx2x panic dumps. Booted the 203 kernel with no issues. Enabled all IFs individually, still no issues. Set all IF config files to start at boot and rebooted. All IFs attempt to start but only the configured bridge obtains an IP, as expected. No sign of bnx2x panics.
Ok, just posted to RKML since issue is urgent for us. Dmitry, could you be so kind and post patch upstream? Would be nice if Broadcom could do QA of patch. I did some tests of network traffic, did not tested iSCSI.
the patch applied to upstream, we'll perform additional testing with L2 and iSCSI. Thanks
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Upon startup, the bnx2x network driver experienced a panic dump when more than one network interface was configured to start up at boot time. With this update, statistics counter initialization for function IDs greater than "1" has been disabled, with the result that bnx2x no longer panic dumps when more than one interface has the "ONBOOT=yes" directive set.
in kernel-2.6.18-205.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Not sure yet what went wrong w/the release script, but that should have been "in kernel-2.6.18-204.el5" (in build 204, not 205).
We were just bitten by this bug with the same hardware running kernel-2.6.18-194.11.4. Any status update on when this patch is going to be released?
Patch for that bug (from comment 22) is in kernel since 2.6.18-194.8.1.el5, so your problem is most likely different bug with similar symptoms. Please open a new bug report for it. BTW: does blacklist bnx2i and cnic modules help?
It turns out the module panic happened right after the new kernel (2.6.18-194.8.1) was installed and was in mid-reboot. We'll watch for any other panics as we update systems and create a new bug report if we see any panics after the new kernel is running.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html