Red Hat Bugzilla – Bug 606453
Cluster not reacting to network failure
Last modified: 2011-03-18 11:58:33 EDT
Description of problem:
Our cluster has experienced some network failures (BZs 601637 602402, 602694)
With these failures we see a loss of network connectivity however ethtool shows the link as still connected.
our NFS service did not attempt to relocate.
Version-Release number of selected component (if applicable):
[root@mgmt1 vmconfig]# uname -a
Linux mgmt1.cloud.lab.eng.bos.redhat.com 2.6.18-194.3.1.el5 #1 SMP Sun May 2 04:17:42 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
[root@mgmt1 vmconfig]# yum list ricci cman rgmanager modcluster
Loaded plugins: rhnplugin, security
cman.x86_64 2.0.115-34.el5_5.1 installed
modcluster.x86_64 0.12.1-2.el5 installed
rgmanager.x86_64 2.0.52-6.el5 installed
ricci.x86_64 0.12.2-12.el5 installed
Reproducible each time we see our network failures.
Steps to Reproduce:
Can you please attach some more info? Like cluster configuration, log files, cman_tool status, cman_tool services...?
Maybe you got so severe network disruption that there was no quorum and therefore no relocation.
First, I should have mentioned that I have discussed this a bit with Lon.
When this was first seen, the cluster interconnect remained running, it was just the public network that went down. So a clustat would show that the cluster was up, but we had an NFS service with a monitored IP address, so I figured it should have relocated to the node with the working public network.
When Lon investigated with me he mentioned that he thought the monitoring is just now checking the ethtool link status. This remains up, but pings fail.
I am not currently in this condition so I can not provide current files and/or status dumps.
To resolve this, we could check the route on the affected interface and ping the gateway. If there was no gateway, then we can't deterministically report the failure.
We can't do much else from within the cluster software automatically; we tend to rely on things like ethernet link reporting working. However, clusters experiencing issues where ethtool is succeeding but packet delivery is failing, users can use qdiskd:
<heuristic program="ping -c1 -w2 <upstream_router>" tko="5" />
Alternatively, users may use the watchdog daemon (see the ping option in watchdog.conf(5) man page).