Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 800556

Summary:	Server becomes unresponsive on network, recovers after several hours
Product:	Red Hat Enterprise Linux 5	Reporter:	Lynn <ljhunt>
Component:	kernel	Assignee:	John Feeney <jfeeney>
Status:	CLOSED DUPLICATE	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	5.3	CC:	gkong, jfeeney, jwu, mzhan, rwu, yupzhang, zpeng
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-01-10 19:51:48 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lynn 2012-03-06 17:47:42 UTC

Description of problem:
Server becomes unresponsive twice a week, have ruled out application, this is ongoing since November and I amd opening in behalf of the Linux/UNIX organization.

Server is available by console only and recovers on it's own normally in 3-4 hours (s10014)

To date:
1.      The rx_fw_discards (ethtool –S eth0) counter is 0 and Eth0 is currently running on APIC mode, hence we can be certain that we are being hit not being hit by the MSI-X bug.

2.      Network utilization is not too high, only about 25 – 30% bandwidth utilized during 4:18 – 4:21am. The issue started occurring on 4:33 – 4:45, during this period the network utilization is very low.

3.      This issue apparently affects the network connectivity to all mounted NFS filesystems, in this case, all 3 filers were not accessible (ushovfstd017, ushovfstd010, ushovfsep153).

4.      Three minutes before the NFS issue occurs, the IBM Tivoli DNS monitoring script reported errors with the DNS. Majority of the servers in Houston did not exhibit this error during the time frame, though I have identified 7 other servers that experience similar DNS errors, albeit on different time frames.

5.      There was a spike in TCP/UDP socket utilization (between 4:20 – 4:40am) though the utilization was not high (about 532 sockets used during the incident). 

6.      No network errors or collisions reported on the network interface

7.      There was a spike in NFS traffic between 4:20 – 4:30 am though the utilization was not high (compared to Oracle servers).

8.      Memory utilization increased by 26% between 4:10 – 4:20 am, possibly due to pmdtm jobs running at the time.

9.      Overall CPU utilization increased by 10 – 15% between 4:10 – 05:00 am.

10.     No network interface flapping (NIC cards replaced twice, and changed routers with no impact on this issue.

11.     S10011 is in the Grid (Active:Active configuration) with this server and does not show any on these issues. 

Version-Release number of selected component (if applicable):

2.6.18-128.23.2.el5

How reproducible:
Every Thursday and Friday night under modest networkload

Stressing with backups / Informix sync to another server (seperately and together) fails to duplicate the issue  

Steps to Reproduce: None are successful
1. 
2. 
3. 
  
Actual results:


Expected results:


Additional info: We are at the point of changing IP or rebuilding the server, as this is a production server this is not desired.

Note from other bugs read both server at same OS (firmware on nic's are different, 11 works 14 is the server with the issues)   

dmidecode | grep Product
        Product Name: IBM 3850 M2 / x3950 M2 -[7233AC1]-

[root@houic-n-s10014 log]# lspci -v | grep Broadcom
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

root@houic-n-s10011 ~]# lspci -v | grep Broadcom
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01)

on 14
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
        Subsystem: IBM Unknown device 037c
        Flags: bus master, fast devsel, latency 0, IRQ 193
        Memory at f4000000 (64-bit, non-prefetchable) [size=32M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable-
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number 5a-77-c4-fe-ff-5e-21-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

on 11
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01)
        Subsystem: IBM Unknown device 037c
        Flags: bus master, fast devsel, latency 0, IRQ 58
        Memory at f2000000 (64-bit, non-prefetchable) [size=32M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable+
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=8
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number ac-c1-35-fe-ff-64-1a-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

Comment 1 Lynn 2012-03-07 19:03:00 UTC

This has been happening since March 2011, it was initially ignored and rebooted.

We upgraded the server to 5.3 from 4.X in Novenmber 2011 and migrated from SAN storage to NAS. This change did not stop the situation, middleware team has this same configuration in dev and test environments which does not produce a hang or reboot required to bring server back to the network quickly.

Comment 2 yuping zhang 2012-03-13 13:17:17 UTC

Hi Lynn,

Thanks for filing this bug.
The server is your RHEL5.3 OS? Do you use virt-manager?  
You filed this issue with virt-manager component,but I didn't find any information about virt-manager.

Thanks
Yuping

Comment 3 Cole Robinson 2012-04-02 20:53:50 UTC

Since there is no mention of virt, I'm assuming this was misfiled. Moving to kernel. But if this is a virt issue, it should probably be filed against qemu-kvm.

Lynn, please provide the info requested in comment #2

Comment 4 Lynn 2012-04-02 21:28:18 UTC

#2 A Grid configured Informatica server(s), set active/active with a nfs filesystem on the NFS going unavailable with console access only.

Comment 5 Lynn 2012-04-02 21:45:56 UTC

#2 A Grid configured Informatica server(s), set active/active with a nfs filesystem on the NFS going unavailable with console access only.

Comment 6 John Feeney 2012-10-15 18:09:21 UTC

I am pretty sure this was fixed in RHEL5.6 where a fix was added to the bnx2
driver to avoid the situation where the 5709 NIC would lock up. See
errata http://errata.devel.redhat.com/errata/show/9700

That errata fixed a situation reported by a number of customers where a 5709
NIC with MSI-X enabled would periodically lock up when a specific set of 
events occur simultaneously. This sounds pretty much what you have experienced.

I am sorry you had to wait for this info, but I just noticed this bugzilla.

Comment 7 John Feeney 2013-01-07 17:40:10 UTC

Any update on this?

Comment 8 Lynn 2013-01-07 19:55:56 UTC

Yes, new note of a SAP data dump to this server impacting read/write disk activity and serious increase in paging to 35k+/second.

End user has been requested to purchase memory &/or have only this single db instance per server to increase performance to a suitable level of support.

Case can be closed.

Now on 5.7, but this is a product of Informatica usage and consolidation.

Can be closed.

Comment 9 John Feeney 2013-01-10 19:51:48 UTC

Per comment #8, I am closing this bugzilla as CLOSED DUPLICATE of bz511368.

*** This bug has been marked as a duplicate of bug 511368 ***