Red Hat Bugzilla – Bug 459731
Fenced node rejoins the cluster and fails abruptly without reason
Last modified: 2009-04-16 18:30:25 EDT
Created attachment 314735 [details]
Contains /var/log/messages files, cluster configuration & automation script
Description of problem:
In a 3 node cluster with GFS1 GF, the node fenced (because of heartbeat i/f failure) rejoins the cluster and then fails without any reason.
The network failure is simulated by bringing down the n/w interface on a node (rum). As expected one among the other 2 nodes (vodka/beer) fences rum via HP ILO where the node is powered off followed by a power on. The fenced node during reboot starts up cman, clvmd, and gfs and successfully rejoins the other nodes forming a 3 node cluster.
And then without any reason the node fails as indicated by the /var/log/messages (timestamped around 23:35:35) of rum. The logs for vodka and beer shows that it has lost the member rum and fences it again.
Version-Release number of selected component (if applicable):
- RHEL 5 update 1 with RHCS and GFS modules
- Kernel 2.6.18-53.el5
- It is not always reproducible but occured thrice in a 10 CHO. The script used for the test is attached.
Steps to Reproduce:
n/w failure is simulated on a randomly selected node (among the 3 cluster nodes) by bringing down the heartbeat interface. Once n/w interface is down, the other node will wait till the node is fenced and joins back the cluster. Then above sequence repeat all overagain. Each node run the script (attached) which is started up at boot time (i.e., init.d) after cman, clvmd and gfs are started.
Yes, this is what happens
Aug 17 23:31:50 rum openais: [MAIN ] Node rum.ind.hp.com not joined to cman because it has rejoined an inquorate cluster
If you disconnect a system from the cluster it will be fenced. However it is possible that the outage lasts less time than it takes for the fencing agent to do its work.
If a node rejoins a cluster that has existing state (DLM, GFS for example) then it doesn't know what state has changed while it was 'away', and cannot be merged back into the cluster. So the remaining nodes (if they still have quorum) will remove it from the cluster. This is effectively the same as it being fenced, it just looks slightly messier.
(In reply to comment #1)
> Yes, this is what happens
> Aug 17 23:31:50 rum openais: [MAIN ] Node rum.ind.hp.com not joined to
> cman because it has rejoined an inquorate cluster
> If you disconnect a system from the cluster it will be fenced. However it is
> possible that the outage lasts less time than it takes for the fencing agent to
> do its work.
> If a node rejoins a cluster that has existing state (DLM, GFS for example) then
> it doesn't know what state has changed while it was 'away', and cannot be
> merged back into the cluster. So the remaining nodes (if they still have
> quorum) will remove it from the cluster. This is effectively the same as it
> being fenced, it just looks slightly messier.
From your message I understood that:
1) node (rum), after being fenced, attempted to join a inquorate cluster (cluster that has no quorum) and hence was fenced again by other nodes.
2) Node rum had a temporary n/w outage (which lasted less than what it takes for fencing agent to fence a node). When the n/w was restored and it was found that GFS/CLVM was still active then one of the other nodes fenced it.
Q1. Which of the above is the reason for node (rum) to fail just after it rebooted after being fenced:
Q2. Should'nt there be a log statement in vodka or beer "fenced[xxxx]: fencing rum.ind.hp.com". However I did not any find any such logs in vodka or beer at the time rum failed to indicate rum was being fenced in both 1) and 2). I only found logs when the node was fenced the first time.