Bug 606453

Summary: Cluster not reacting to network failure
Product: Red Hat Enterprise Linux 5 Reporter: Steve Reichard <sreichar>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED WONTFIX QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.5CC: cluster-maint, edamato, jkortus, twilkins
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-03-18 15:58:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Steve Reichard 2010-06-21 16:15:28 UTC
Description of problem:

Our cluster has experienced some network failures (BZs  601637  602402, 602694)

With these failures we see a loss of network connectivity however ethtool shows the link as still connected.

our NFS service did not attempt to relocate.


Version-Release number of selected component (if applicable):



[root@mgmt1 vmconfig]#  uname -a
Linux mgmt1.cloud.lab.eng.bos.redhat.com 2.6.18-194.3.1.el5 #1 SMP Sun May 2 04:17:42 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[root@mgmt1 vmconfig]#  yum list  ricci cman rgmanager modcluster
Loaded plugins: rhnplugin, security
Installed Packages
cman.x86_64                                                                   2.0.115-34.el5_5.1                                                              installed
modcluster.x86_64                                                             0.12.1-2.el5                                                                    installed
rgmanager.x86_64                                                              2.0.52-6.el5                                                                    installed
ricci.x86_64                                                                  0.12.2-12.el5                                                                   installed
[root@mgmt1 vmconfig]# 


How reproducible:

Reproducible each time we see our network failures.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jaroslav Kortus 2010-06-21 16:23:01 UTC
Can you please attach some more info? Like cluster configuration, log files, cman_tool status, cman_tool services...?

Maybe you got so severe network disruption that there was no quorum and therefore no relocation.

Comment 2 Steve Reichard 2010-06-21 18:01:49 UTC
First, I should have mentioned that I have discussed this a bit with Lon.

When this was first seen, the cluster interconnect remained running, it was just the public network that went down. So a clustat would show that the cluster was up, but we had an NFS service with a monitored IP address, so I figured it should have relocated to the node with the working public network. 

When Lon investigated with me he mentioned that he thought the monitoring is just now checking the ethtool link status.   This remains up, but pings fail.


I am not currently in this condition so I can not provide current files and/or status dumps.

Comment 3 Lon Hohberger 2010-09-30 20:13:43 UTC
To resolve this, we could check the route on the affected interface and ping the gateway.  If there was no gateway, then we can't deterministically report the failure.

Comment 7 Lon Hohberger 2011-03-18 15:58:33 UTC
We can't do much else from within the cluster software automatically; we tend to rely on things like ethernet link reporting working.  However, clusters experiencing issues where ethtool is succeeding but packet delivery is failing, users can use qdiskd:

<quorumd ...>
  <heuristic program="ping -c1 -w2 <upstream_router>" tko="5" />
</quorumd>

Alternatively, users may use the watchdog daemon (see the ping option in watchdog.conf(5) man page).