Bug 800556 - Server becomes unresponsive on network, recovers after several hours
Server becomes unresponsive on network, recovers after several hours
Status: CLOSED DUPLICATE of bug 511368
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
unspecified Severity urgent
: rc
: ---
Assigned To: John Feeney
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-06 12:47 EST by Lynn
Modified: 2013-01-10 14:51 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-01-10 14:51:48 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lynn 2012-03-06 12:47:42 EST
Description of problem:
Server becomes unresponsive twice a week, have ruled out application, this is ongoing since November and I amd opening in behalf of the Linux/UNIX organization.

Server is available by console only and recovers on it's own normally in 3-4 hours (s10014)

To date:
1.      The rx_fw_discards (ethtool –S eth0) counter is 0 and Eth0 is currently running on APIC mode, hence we can be certain that we are being hit not being hit by the MSI-X bug.

2.      Network utilization is not too high, only about 25 – 30% bandwidth utilized during 4:18 – 4:21am. The issue started occurring on 4:33 – 4:45, during this period the network utilization is very low.

3.      This issue apparently affects the network connectivity to all mounted NFS filesystems, in this case, all 3 filers were not accessible (ushovfstd017, ushovfstd010, ushovfsep153).

4.      Three minutes before the NFS issue occurs, the IBM Tivoli DNS monitoring script reported errors with the DNS. Majority of the servers in Houston did not exhibit this error during the time frame, though I have identified 7 other servers that experience similar DNS errors, albeit on different time frames.

5.      There was a spike in TCP/UDP socket utilization (between 4:20 – 4:40am) though the utilization was not high (about 532 sockets used during the incident). 

6.      No network errors or collisions reported on the network interface

7.      There was a spike in NFS traffic between 4:20 – 4:30 am though the utilization was not high (compared to Oracle servers).

8.      Memory utilization increased by 26% between 4:10 – 4:20 am, possibly due to pmdtm jobs running at the time.

9.      Overall CPU utilization increased by 10 – 15% between 4:10 – 05:00 am.

10.     No network interface flapping (NIC cards replaced twice, and changed routers with no impact on this issue.

11.     S10011 is in the Grid (Active:Active configuration) with this server and does not show any on these issues. 

Version-Release number of selected component (if applicable):

2.6.18-128.23.2.el5

How reproducible:
Every Thursday and Friday night under modest networkload

Stressing with backups / Informix sync to another server (seperately and together) fails to duplicate the issue  

Steps to Reproduce: None are successful
1. 
2. 
3. 
  
Actual results:


Expected results:


Additional info: We are at the point of changing IP or rebuilding the server, as this is a production server this is not desired.

Note from other bugs read both server at same OS (firmware on nic's are different, 11 works 14 is the server with the issues)   

dmidecode | grep Product
        Product Name: IBM 3850 M2 / x3950 M2 -[7233AC1]-

[root@houic-n-s10014 log]# lspci -v | grep Broadcom
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

root@houic-n-s10011 ~]# lspci -v | grep Broadcom
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01)

on 14
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
        Subsystem: IBM Unknown device 037c
        Flags: bus master, fast devsel, latency 0, IRQ 193
        Memory at f4000000 (64-bit, non-prefetchable) [size=32M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable-
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number 5a-77-c4-fe-ff-5e-21-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel

on 11
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01)
        Subsystem: IBM Unknown device 037c
        Flags: bus master, fast devsel, latency 0, IRQ 58
        Memory at f2000000 (64-bit, non-prefetchable) [size=32M]
        Capabilities: [48] Power Management version 3
        Capabilities: [50] Vital Product Data
        Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable+
        Capabilities: [a0] MSI-X: Enable- Mask- TabSize=8
        Capabilities: [ac] Express Endpoint IRQ 0
        Capabilities: [100] Device Serial Number ac-c1-35-fe-ff-64-1a-00
        Capabilities: [110] Advanced Error Reporting
        Capabilities: [150] Power Budgeting
        Capabilities: [160] Virtual Channel
Comment 1 Lynn 2012-03-07 14:03:00 EST
This has been happening since March 2011, it was initially ignored and rebooted.

We upgraded the server to 5.3 from 4.X in Novenmber 2011 and migrated from SAN storage to NAS. This change did not stop the situation, middleware team has this same configuration in dev and test environments which does not produce a hang or reboot required to bring server back to the network quickly.
Comment 2 yuping zhang 2012-03-13 09:17:17 EDT
Hi Lynn,

Thanks for filing this bug.
The server is your RHEL5.3 OS? Do you use virt-manager?  
You filed this issue with virt-manager component,but I didn't find any information about virt-manager.

Thanks
Yuping
Comment 3 Cole Robinson 2012-04-02 16:53:50 EDT
Since there is no mention of virt, I'm assuming this was misfiled. Moving to kernel. But if this is a virt issue, it should probably be filed against qemu-kvm.

Lynn, please provide the info requested in comment #2
Comment 4 Lynn 2012-04-02 17:28:18 EDT
#2 A Grid configured Informatica server(s), set active/active with a nfs filesystem on the NFS going unavailable with console access only.
Comment 5 Lynn 2012-04-02 17:45:56 EDT
#2 A Grid configured Informatica server(s), set active/active with a nfs filesystem on the NFS going unavailable with console access only.
Comment 6 John Feeney 2012-10-15 14:09:21 EDT
I am pretty sure this was fixed in RHEL5.6 where a fix was added to the bnx2
driver to avoid the situation where the 5709 NIC would lock up. See
errata http://errata.devel.redhat.com/errata/show/9700

That errata fixed a situation reported by a number of customers where a 5709
NIC with MSI-X enabled would periodically lock up when a specific set of 
events occur simultaneously. This sounds pretty much what you have experienced.

I am sorry you had to wait for this info, but I just noticed this bugzilla.
Comment 7 John Feeney 2013-01-07 12:40:10 EST
Any update on this?
Comment 8 Lynn 2013-01-07 14:55:56 EST
Yes, new note of a SAP data dump to this server impacting read/write disk activity and serious increase in paging to 35k+/second.

End user has been requested to purchase memory &/or have only this single db instance per server to increase performance to a suitable level of support.

Case can be closed.

Now on 5.7, but this is a product of Informatica usage and consolidation.

Can be closed.
Comment 9 John Feeney 2013-01-10 14:51:48 EST
Per comment #8, I am closing this bugzilla as CLOSED DUPLICATE of bz511368.

*** This bug has been marked as a duplicate of bug 511368 ***

Note You need to log in before you can comment on or make changes to this bug.