Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 601637

Summary: bnx2x_hw_stats_update 'NIG timer max' results in disabled IF
Product: Red Hat Enterprise Linux 5 Reporter: Tim Wilkinson <twilkins>
Component: kernelAssignee: Stanislaw Gruszka <sgruszka>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.5CC: apevec, Dmitry.Kravkov, eilong, sreichar, vladz
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Bumped priority and severity since this is the current blocker for a important Cloud BU project
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-18 15:34:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs with bnx2x debug=0x02003f none

Description Tim Wilkinson 2010-06-08 11:21:33 UTC
Description of problem:
----------------------
After using an HP bladesystem without problem for several months of repeated automation testing that includes yum updates, we've been blocked by an error message after testing this past weekend (6-jun.  While the message is annoyingly repeat every couple of seconds, it also results in the disabling of the public interface.



Version-Release:
---------------
2.6.18-194.el5 #1 SMP Tue Mar 16 21:52:39 EDT 2010 x86_64



How reproducible:
----------------
Consistent



Steps to Reproduce:
------------------
1. Boot server blade
2. Observe error below



Actual results: 
--------------
 ...
Jun  7 16:42:38 mgmt1 avahi-daemon[6193]: Registering new address record for 10.16.136.20 on cloud0.
Jun  7 16:42:39 mgmt1 clurgmgrd[12088]: <notice> Service service:rhev-nfs started
Jun  7 16:50:22 mgmt1 kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (1)
Jun  7 16:50:23 mgmt1 kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (2)
 ...
 [error repeats consistently once observed]



Expected results:
----------------
 ...
Jun  7 16:51:23 mgmt2 avahi-daemon[6192]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::5c9b:96ff:fe48:7665.
Jun  7 16:51:23 mgmt2 avahi-daemon[6192]: Registering new address record for fe80::5c9b:96ff:fe48:7665 on vnet0.
Jun  7 16:51:34 mgmt2 kernel: kvm: 10958: cpu0 unimplemented perfctr wrmsr: 0x186 data 0x130079
Jun  7 16:51:34 mgmt2 kernel: kvm: 10958: cpu0 unimplemented perfctr wrmsr: 0xc1 data 0xffd74ea6
...
Jun  7 16:51:35 mgmt2 kernel: kvm: 10958: cpu3 unimplemented perfctr wrmsr: 0x186 data 0x130079
Jun  7 16:51:36 mgmt2 kernel: cloud0: topology change detected, propagating
Jun  7 16:51:36 mgmt2 kernel: cloud0: port 2(vnet0) entering forwarding state
Jun  7 17:30:53 mgmt2 named[5400]: listening on IPv4 interface virbr0, 192.168.122.1#53
Jun  7 17:30:53 mgmt2 named[5400]: binding TCP socket: address in use
Jun  7 17:40:13 mgmt2 init: Trying to re-exec init
 ...
 [normal boot sequence continues]



Additional info:
---------------
The blades are updated to the latest patches each time the sequence is tested. We have not updated the systems since Saturday. Two blades are running RH Cluster Suite with several KVM VM services, an ext2 NFS Export service, and a shared GFS2 volume housing the VM config files.

In our current situation we can log into the node exhibiting the problem via the cluster interconnect from the other cluster member. There is nothing obvious to us in the messages prior to the errors appearance.

The blades are available for access if required.

Comment 1 Tim Wilkinson 2010-06-08 11:29:28 UTC
*** Bug 601634 has been marked as a duplicate of this bug. ***

Comment 2 Stanislaw Gruszka 2010-06-08 13:10:42 UTC
Please try below kernel build (when it finish to compile):
https://brewweb.devel.redhat.com/taskinfo?taskID=2502010
If it does not help, please provide instructions how to get access to blades, thanks.

Comment 3 Tim Wilkinson 2010-06-08 14:57:38 UTC
It does not appear to have made a difference. Cluster nodes are:

  mgmt1.cloud.lab.eng.bos.redhat.com (10.16.136.10) [24^gold]
  mgmt2.cloud.lab.eng.bos.redhat.com (10.16.136.15) [24^gold]

mgmt2 is accessible and you can then get to mgmt1 via the interconnect, mgmt1-ic.

Comment 4 Steve Reichard 2010-06-09 12:29:07 UTC
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Bumped priority and severity since this is the current blocker for a important Cloud BU project

Comment 5 Steve Reichard 2010-06-09 21:46:38 UTC
The attempt to reproduce on the same system by re-installing was not sucessful,
however other blade that is part of the cluster is seeing the problem.   I rebooted and it still sat the problem.   The counter does not seem to loop as we saw on the first instance, but will increment for a bit when I do an if-up eth0.

This system has no second network for you to be able to log into, so use the console in the vnc from previous correspondence.   

Blade 3 is the one currently with the issue.

Comment 6 Stanislaw Gruszka 2010-06-10 10:21:48 UTC
Hello Broadcom

We have strange issue on BL460c G6 blades with BCM 57711E devices. Device is not able to transmit/receive packets and we have lots of "NIG timer max" messages in dmesg. Issue happens in random after system provision. Sometimes problem can be triggered after system installation, sometimes not. If happens, it is consistent across reboots. Problem is observed on devices with 10Gbit link, other port in system with 2.5Gbit connection works fine.  

This is RHEL5.5 regression. Problem not happened in 2.6.18-164 kernel, showed up with 2.6.18-194 with bnx2x update to 1.52.1-6 (firmware 5.2.13.0).

We tried "bnx2x: Protect code with NOMCP" upstream patch, does not help.

Comment 7 Stanislaw Gruszka 2010-06-10 10:25:39 UTC
Created attachment 422851 [details]
Logs with bnx2x debug=0x02003f

Log messages with debug mask I think is appropriate. We can provide more detailed logs or other information if needed.

Comment 8 Stanislaw Gruszka 2010-06-10 10:37:34 UTC
(In reply to comment #5)
> This system has no second network for you to be able to log into, so use the
> console in the vnc from previous correspondence.   
> 
> Blade 3 is the one currently with the issue.    

In this system, reloading bnx2x module results in bnx2x module panic. I have log with firmware dump in file /root/messages.bnx2x_panic. If possible please copy it and attach to this bugzilla (since there is no working network, I don't know how to do this). Firmware dump can be useful information for Broadcom. 

One more thing: after bnx2x panics, system gets 'NMI for unknown reason' + Uncorrectable Machine Check Exceptions and hung.

Comment 9 Stanislaw Gruszka 2010-06-10 12:20:50 UTC
If we enter the problem, switching back to old kernel 2.6.18-164 does not help, we can not also setup interface.

In ethtool statistics NIG timer is call "rx_constant_pause_events". I suppose problem here is that: IEEE802.x pause frames are permanently send to us by switch. This explain why sometimes, we can get device working again, just switch stop to send Pause.

So we 3 potential causes of breakage:
1) In updated bnx2x we do something wrong that make switch permanently send Pause.
2) In bnx2x we do not handle Pause properly and can not bring up switch port to normal operation.
3) Switch is broken.

Question for reporters: was there some changes with switch (firmware update, re-configuration).

Comment 10 Tim Wilkinson 2010-06-10 12:48:19 UTC
(In reply to comment #9)
> Question for reporters: was there some changes with switch (firmware update,
> re-configuration).    

None whatsoever.

Comment 11 Eilon Greenstein 2010-06-10 13:03:34 UTC
Hi,

As always, Stanislaw is right on the spot (in comment 9). The problem reported by the bnx2x is an indication that we have packet(s) to send and we cannot transmit even one packet to the network for a long time (1 second). To prevent FW issues with internal queues, we drop the pending packets for 100ms – so if the network is really dead and we constently try to transmit, we will see this counter increment every 1.1 seconds.

In the past, we saw such issues in blade environment when one of the blades died (BSOD under Windows or PSOD under VMware and even internal FW error under Linux in one of the earlier versions of the bnx2x) and the other blades in the chassis started reporting that the switch is sending constant pause. Once the "problematic" blade was fixed (restarted, power downed, taken out, etc), all other blades relaxed. We never saw cases in which the reporting blade was the one to blame, so I suggest to look at the other blades in the chassis and try to see if one of them is not responding anymore with a different error message.

Regards,
Eilon

Comment 12 Stanislaw Gruszka 2010-06-10 14:27:17 UTC
(In reply to comment #11)
> In the past, we saw such issues in blade environment when one of the blades
> died (BSOD under Windows or PSOD under VMware and even internal FW error under
> Linux in one of the earlier versions of the bnx2x) 

Ha, blade 3 we are talked in comment 5 and comment 8 is suspicious. Tim, Steve, please remove it and see if that helps, and if possible please get messages.bnx2x_panic file from it.

Comment 13 Stanislaw Gruszka 2010-06-14 08:32:14 UTC
The question is if broken blade plaguing other blades by pause frames is hardware problem or firmware/driver problem. It's suspicious that problems do show up after RHEL update to 5.5 - that suggest this is firmware/driver malfunction. 

Tim opened new bug report describing crash on broken blade in bug 602402, we have only clips from firmware crash dump there, we are working on providing full dump.

Tim told me that bnx2x panic described in bug 602402 can not be reproduced on other blades. If is interesting if broken blade have any difference compered to other (configuration, hardware, network connections, etc...).

FIY: there is also one other blade in cluster that have Machine Check Exceptions in logs from last Monday.

Comment 14 Tim Wilkinson 2010-06-14 13:44:16 UTC
(In reply to comment #13)
> The question is if broken blade plaguing other blades by pause frames is
> hardware problem or firmware/driver problem. It's suspicious that problems do
> show up after RHEL update to 5.5 - that suggest this is firmware/driver
> malfunction. 
> 
> Tim opened new bug report describing crash on broken blade in bug 602402, we
> have only clips from firmware crash dump there, we are working on providing
> full dump.

To clarify, bug 602402 was opened to report *different* error messages (not the NIG timer max errors) when we enable interfaces currently disabled ...

  kernel: [bnx2x_panic_dump:(609(eth3)cqe[1f4]=[0:0:0:0]

... and while there is no crash when we see these errors and the system *is functional*, all networking is disabled. The only workaround when this occurs appears to be to add "ONBOOT=no" to all but one of the network config files and reboot, (Note that just restarting the network will not resolve).

------

THIS bug (601637) may be similar (and ultimately may have the same root cause) but has to do with error messages we can not produce by any known activity ...

  kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (1)
  kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (2)
  kernel: [bnx2x_hw_stats_update:3972(eth0)]NIG timer max (3)

When they occur, they are immediately observed when the system is booting after it registers its new address record (avahi-daemon) and starts a cluster NFS export service (clurgmgrd). The system is not functional once these messages are observed and must be power cycled.

Because we have observed machine checks due to memory errors on two of the 16 blades, they were powered down and factored out of the environment for the time being. In our first rebuild of the testbed since that time, we have yet to see the "NIG timer max" errors.


> 
> Tim told me that bnx2x panic described in bug 602402 can not be reproduced on
> other blades. If is interesting if broken blade have any difference compered to
> other (configuration, hardware, network connections, etc...).

This is incorrect, forgive me if I mislead. 

The errors seen in bug 602402 [bnx2x_panic_dump messages when enabling the extra interfaces] have been reproducible on *almost* any blade on which we enabled previously disabled interfaces. We did note that at least one blade would not generate the bnx2x_panic_dump messages regardless of how many IF we enabled.

Comment 15 Eilon Greenstein 2010-06-14 14:05:15 UTC
While bug 602402 sounds like a real issue (unfortunately, I cannot access irish.lab.bos.redhat.com so I cannot add any information about that bug), this issue sounds like an expected side effect. Due to the failure in another blade, the blade reporting the "NIG timer" message simply cannot send any packet to the network – this is a serious network issue (the network is "dead" and sending constant pause messages) so IMHO it is worth drawing the user attention.
I think we should focus on bug 602402 since it is the real cause.

Comment 16 Tim Wilkinson 2010-06-14 14:17:58 UTC
I was incorrect regarding the functionality of the server once the error is observed in that I said that the systems were not functional and required a reset. To correct, the presence of the "NIG timer" message disables the public IF only and we are able to connect to the system via the cluster interconnect IF.

Comment 17 Eilon Greenstein 2010-06-14 14:24:33 UTC
It is actually the other way around: since the network is not functional, you see the counter error message.

Comment 18 Alan Pevec 2010-06-18 15:01:18 UTC
(In reply to comment #17)
> It is actually the other way around: since the network is not functional, you
> see the counter error message.    

So this can be closed as a duplicate of bug 602402 ?

Comment 19 Tim Wilkinson 2010-06-18 15:06:30 UTC
I am confident that the issues are the same. If not we will re-open as needed but this can be closed as a dup.

Comment 20 Alan Pevec 2010-06-18 15:34:44 UTC

*** This bug has been marked as a duplicate of bug 602402 ***