| Summary: | Server becomes unresponsive on network, recovers after several hours | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Lynn <ljhunt> |
| Component: | kernel | Assignee: | John Feeney <jfeeney> |
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 5.3 | CC: | gkong, jfeeney, jwu, mzhan, rwu, yupzhang, zpeng |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-01-10 19:51:48 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
This has been happening since March 2011, it was initially ignored and rebooted. We upgraded the server to 5.3 from 4.X in Novenmber 2011 and migrated from SAN storage to NAS. This change did not stop the situation, middleware team has this same configuration in dev and test environments which does not produce a hang or reboot required to bring server back to the network quickly. Hi Lynn, Thanks for filing this bug. The server is your RHEL5.3 OS? Do you use virt-manager? You filed this issue with virt-manager component,but I didn't find any information about virt-manager. Thanks Yuping Since there is no mention of virt, I'm assuming this was misfiled. Moving to kernel. But if this is a virt issue, it should probably be filed against qemu-kvm. Lynn, please provide the info requested in comment #2 #2 A Grid configured Informatica server(s), set active/active with a nfs filesystem on the NFS going unavailable with console access only. #2 A Grid configured Informatica server(s), set active/active with a nfs filesystem on the NFS going unavailable with console access only. I am pretty sure this was fixed in RHEL5.6 where a fix was added to the bnx2 driver to avoid the situation where the 5709 NIC would lock up. See errata http://errata.devel.redhat.com/errata/show/9700 That errata fixed a situation reported by a number of customers where a 5709 NIC with MSI-X enabled would periodically lock up when a specific set of events occur simultaneously. This sounds pretty much what you have experienced. I am sorry you had to wait for this info, but I just noticed this bugzilla. Any update on this? Yes, new note of a SAP data dump to this server impacting read/write disk activity and serious increase in paging to 35k+/second. End user has been requested to purchase memory &/or have only this single db instance per server to increase performance to a suitable level of support. Case can be closed. Now on 5.7, but this is a product of Informatica usage and consolidation. Can be closed. Per comment #8, I am closing this bugzilla as CLOSED DUPLICATE of bz511368. *** This bug has been marked as a duplicate of bug 511368 *** |
Description of problem: Server becomes unresponsive twice a week, have ruled out application, this is ongoing since November and I amd opening in behalf of the Linux/UNIX organization. Server is available by console only and recovers on it's own normally in 3-4 hours (s10014) To date: 1. The rx_fw_discards (ethtool –S eth0) counter is 0 and Eth0 is currently running on APIC mode, hence we can be certain that we are being hit not being hit by the MSI-X bug. 2. Network utilization is not too high, only about 25 – 30% bandwidth utilized during 4:18 – 4:21am. The issue started occurring on 4:33 – 4:45, during this period the network utilization is very low. 3. This issue apparently affects the network connectivity to all mounted NFS filesystems, in this case, all 3 filers were not accessible (ushovfstd017, ushovfstd010, ushovfsep153). 4. Three minutes before the NFS issue occurs, the IBM Tivoli DNS monitoring script reported errors with the DNS. Majority of the servers in Houston did not exhibit this error during the time frame, though I have identified 7 other servers that experience similar DNS errors, albeit on different time frames. 5. There was a spike in TCP/UDP socket utilization (between 4:20 – 4:40am) though the utilization was not high (about 532 sockets used during the incident). 6. No network errors or collisions reported on the network interface 7. There was a spike in NFS traffic between 4:20 – 4:30 am though the utilization was not high (compared to Oracle servers). 8. Memory utilization increased by 26% between 4:10 – 4:20 am, possibly due to pmdtm jobs running at the time. 9. Overall CPU utilization increased by 10 – 15% between 4:10 – 05:00 am. 10. No network interface flapping (NIC cards replaced twice, and changed routers with no impact on this issue. 11. S10011 is in the Grid (Active:Active configuration) with this server and does not show any on these issues. Version-Release number of selected component (if applicable): 2.6.18-128.23.2.el5 How reproducible: Every Thursday and Friday night under modest networkload Stressing with backups / Informix sync to another server (seperately and together) fails to duplicate the issue Steps to Reproduce: None are successful 1. 2. 3. Actual results: Expected results: Additional info: We are at the point of changing IP or rebuilding the server, as this is a production server this is not desired. Note from other bugs read both server at same OS (firmware on nic's are different, 11 works 14 is the server with the issues) dmidecode | grep Product Product Name: IBM 3850 M2 / x3950 M2 -[7233AC1]- [root@houic-n-s10014 log]# lspci -v | grep Broadcom 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) root@houic-n-s10011 ~]# lspci -v | grep Broadcom 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01) 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01) on 14 02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Subsystem: IBM Unknown device 037c Flags: bus master, fast devsel, latency 0, IRQ 193 Memory at f4000000 (64-bit, non-prefetchable) [size=32M] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/4 Enable- Capabilities: [a0] MSI-X: Enable- Mask- TabSize=9 Capabilities: [ac] Express Endpoint IRQ 0 Capabilities: [100] Device Serial Number 5a-77-c4-fe-ff-5e-21-00 Capabilities: [110] Advanced Error Reporting Capabilities: [150] Power Budgeting Capabilities: [160] Virtual Channel on 11 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 01) Subsystem: IBM Unknown device 037c Flags: bus master, fast devsel, latency 0, IRQ 58 Memory at f2000000 (64-bit, non-prefetchable) [size=32M] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable+ Capabilities: [a0] MSI-X: Enable- Mask- TabSize=8 Capabilities: [ac] Express Endpoint IRQ 0 Capabilities: [100] Device Serial Number ac-c1-35-fe-ff-64-1a-00 Capabilities: [110] Advanced Error Reporting Capabilities: [150] Power Budgeting Capabilities: [160] Virtual Channel