Bug 601637
| Summary: | bnx2x_hw_stats_update 'NIG timer max' results in disabled IF | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Tim Wilkinson <twilkins> | ||||
| Component: | kernel | Assignee: | Stanislaw Gruszka <sgruszka> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 5.5 | CC: | apevec, Dmitry.Kravkov, eilong, sreichar, vladz | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Bumped priority and severity since this is the current blocker for a important Cloud BU project
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2010-06-18 15:34:44 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Tim Wilkinson
2010-06-08 11:21:33 UTC
*** Bug 601634 has been marked as a duplicate of this bug. *** Please try below kernel build (when it finish to compile): https://brewweb.devel.redhat.com/taskinfo?taskID=2502010 If it does not help, please provide instructions how to get access to blades, thanks. It does not appear to have made a difference. Cluster nodes are: mgmt1.cloud.lab.eng.bos.redhat.com (10.16.136.10) [24^gold] mgmt2.cloud.lab.eng.bos.redhat.com (10.16.136.15) [24^gold] mgmt2 is accessible and you can then get to mgmt1 via the interconnect, mgmt1-ic. Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Bumped priority and severity since this is the current blocker for a important Cloud BU project The attempt to reproduce on the same system by re-installing was not sucessful, however other blade that is part of the cluster is seeing the problem. I rebooted and it still sat the problem. The counter does not seem to loop as we saw on the first instance, but will increment for a bit when I do an if-up eth0. This system has no second network for you to be able to log into, so use the console in the vnc from previous correspondence. Blade 3 is the one currently with the issue. Hello Broadcom We have strange issue on BL460c G6 blades with BCM 57711E devices. Device is not able to transmit/receive packets and we have lots of "NIG timer max" messages in dmesg. Issue happens in random after system provision. Sometimes problem can be triggered after system installation, sometimes not. If happens, it is consistent across reboots. Problem is observed on devices with 10Gbit link, other port in system with 2.5Gbit connection works fine. This is RHEL5.5 regression. Problem not happened in 2.6.18-164 kernel, showed up with 2.6.18-194 with bnx2x update to 1.52.1-6 (firmware 5.2.13.0). We tried "bnx2x: Protect code with NOMCP" upstream patch, does not help. Created attachment 422851 [details]
Logs with bnx2x debug=0x02003f
Log messages with debug mask I think is appropriate. We can provide more detailed logs or other information if needed.
(In reply to comment #5) > This system has no second network for you to be able to log into, so use the > console in the vnc from previous correspondence. > > Blade 3 is the one currently with the issue. In this system, reloading bnx2x module results in bnx2x module panic. I have log with firmware dump in file /root/messages.bnx2x_panic. If possible please copy it and attach to this bugzilla (since there is no working network, I don't know how to do this). Firmware dump can be useful information for Broadcom. One more thing: after bnx2x panics, system gets 'NMI for unknown reason' + Uncorrectable Machine Check Exceptions and hung. If we enter the problem, switching back to old kernel 2.6.18-164 does not help, we can not also setup interface. In ethtool statistics NIG timer is call "rx_constant_pause_events". I suppose problem here is that: IEEE802.x pause frames are permanently send to us by switch. This explain why sometimes, we can get device working again, just switch stop to send Pause. So we 3 potential causes of breakage: 1) In updated bnx2x we do something wrong that make switch permanently send Pause. 2) In bnx2x we do not handle Pause properly and can not bring up switch port to normal operation. 3) Switch is broken. Question for reporters: was there some changes with switch (firmware update, re-configuration). (In reply to comment #9) > Question for reporters: was there some changes with switch (firmware update, > re-configuration). None whatsoever. Hi, As always, Stanislaw is right on the spot (in comment 9). The problem reported by the bnx2x is an indication that we have packet(s) to send and we cannot transmit even one packet to the network for a long time (1 second). To prevent FW issues with internal queues, we drop the pending packets for 100ms – so if the network is really dead and we constently try to transmit, we will see this counter increment every 1.1 seconds. In the past, we saw such issues in blade environment when one of the blades died (BSOD under Windows or PSOD under VMware and even internal FW error under Linux in one of the earlier versions of the bnx2x) and the other blades in the chassis started reporting that the switch is sending constant pause. Once the "problematic" blade was fixed (restarted, power downed, taken out, etc), all other blades relaxed. We never saw cases in which the reporting blade was the one to blame, so I suggest to look at the other blades in the chassis and try to see if one of them is not responding anymore with a different error message. Regards, Eilon (In reply to comment #11) > In the past, we saw such issues in blade environment when one of the blades > died (BSOD under Windows or PSOD under VMware and even internal FW error under > Linux in one of the earlier versions of the bnx2x) Ha, blade 3 we are talked in comment 5 and comment 8 is suspicious. Tim, Steve, please remove it and see if that helps, and if possible please get messages.bnx2x_panic file from it. The question is if broken blade plaguing other blades by pause frames is hardware problem or firmware/driver problem. It's suspicious that problems do show up after RHEL update to 5.5 - that suggest this is firmware/driver malfunction. Tim opened new bug report describing crash on broken blade in bug 602402, we have only clips from firmware crash dump there, we are working on providing full dump. Tim told me that bnx2x panic described in bug 602402 can not be reproduced on other blades. If is interesting if broken blade have any difference compered to other (configuration, hardware, network connections, etc...). FIY: there is also one other blade in cluster that have Machine Check Exceptions in logs from last Monday. (In reply to comment #13) > The question is if broken blade plaguing other blades by pause frames is > hardware problem or firmware/driver problem. It's suspicious that problems do > show up after RHEL update to 5.5 - that suggest this is firmware/driver > malfunction. > > Tim opened new bug report describing crash on broken blade in bug 602402, we > have only clips from firmware crash dump there, we are working on providing > full dump. To clarify, bug 602402 was opened to report *different* error messages (not the NIG timer max errors) when we enable interfaces currently disabled ... kernel: [bnx2x_panic_dump:(609(eth3)cqe[1f4]=[0:0:0:0] ... and while there is no crash when we see these errors and the system *is functional*, all networking is disabled. The only workaround when this occurs appears to be to add "ONBOOT=no" to all but one of the network config files and reboot, (Note that just restarting the network will not resolve). ------ THIS bug (601637) may be similar (and ultimately may have the same root cause) but has to do with error messages we can not produce by any known activity ... kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (1) kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (2) kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (3) When they occur, they are immediately observed when the system is booting after it registers its new address record (avahi-daemon) and starts a cluster NFS export service (clurgmgrd). The system is not functional once these messages are observed and must be power cycled. Because we have observed machine checks due to memory errors on two of the 16 blades, they were powered down and factored out of the environment for the time being. In our first rebuild of the testbed since that time, we have yet to see the "NIG timer max" errors. > > Tim told me that bnx2x panic described in bug 602402 can not be reproduced on > other blades. If is interesting if broken blade have any difference compered to > other (configuration, hardware, network connections, etc...). This is incorrect, forgive me if I mislead. The errors seen in bug 602402 [bnx2x_panic_dump messages when enabling the extra interfaces] have been reproducible on *almost* any blade on which we enabled previously disabled interfaces. We did note that at least one blade would not generate the bnx2x_panic_dump messages regardless of how many IF we enabled. While bug 602402 sounds like a real issue (unfortunately, I cannot access irish.lab.bos.redhat.com so I cannot add any information about that bug), this issue sounds like an expected side effect. Due to the failure in another blade, the blade reporting the "NIG timer" message simply cannot send any packet to the network – this is a serious network issue (the network is "dead" and sending constant pause messages) so IMHO it is worth drawing the user attention. I think we should focus on bug 602402 since it is the real cause. I was incorrect regarding the functionality of the server once the error is observed in that I said that the systems were not functional and required a reset. To correct, the presence of the "NIG timer" message disables the public IF only and we are able to connect to the system via the cluster interconnect IF. It is actually the other way around: since the network is not functional, you see the counter error message. (In reply to comment #17) > It is actually the other way around: since the network is not functional, you > see the counter error message. So this can be closed as a duplicate of bug 602402 ? I am confident that the issues are the same. If not we will re-open as needed but this can be closed as a dup. *** This bug has been marked as a duplicate of bug 602402 *** |