Bug 459731 - Fenced node rejoins the cluster and fails abruptly without reason
Fenced node rejoins the cluster and fails abruptly without reason
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.1
i686 Linux
medium Severity high
: rc
: ---
Assigned To: Christine Caulfield
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-08-21 13:35 EDT by Rajeev
Modified: 2009-04-16 18:30 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-08-22 03:24:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Contains /var/log/messages files, cluster configuration & automation script (77.40 KB, text/plain)
2008-08-21 13:35 EDT, Rajeev
no flags Details

  None (edit)
Description Rajeev 2008-08-21 13:35:47 EDT
Created attachment 314735 [details]
Contains /var/log/messages files, cluster configuration & automation script

Description of problem:
In a 3 node cluster with GFS1 GF, the node fenced (because of heartbeat i/f failure) rejoins the cluster and then fails without any reason. 

The network failure is simulated by bringing down the n/w interface on a node (rum). As expected one among the other 2 nodes (vodka/beer) fences rum via HP ILO where the node is powered off followed by a power on. The fenced node during reboot starts up cman, clvmd, and gfs and successfully rejoins the other nodes forming a 3 node cluster.

And then without any reason the node fails as indicated by the /var/log/messages (timestamped around 23:35:35) of rum. The logs for vodka and beer shows that it has lost the member rum and fences it again.


Version-Release number of selected component (if applicable):
- RHEL 5 update 1 with RHCS and GFS modules
- Kernel 2.6.18-53.el5

How reproducible:
- It is not always reproducible but occured thrice in a 10 CHO. The script used for the test is attached.

Steps to Reproduce:
n/w failure is simulated on a randomly selected node (among the 3 cluster nodes) by bringing down the heartbeat interface. Once n/w interface is down, the other node will wait till the node is fenced and joins back the cluster. Then above sequence repeat all overagain. Each node run the script (attached) which is started up at boot time (i.e., init.d) after cman, clvmd and gfs are started.
Comment 1 Christine Caulfield 2008-08-22 03:24:05 EDT
Yes, this is what happens

Aug 17 23:31:50 rum openais[2672]: [MAIN ] Node rum.ind.hp.com not joined to cman because it has rejoined an inquorate cluster 

If you disconnect a system from the cluster it will be fenced. However it is possible that the outage lasts less time than it takes for the fencing agent to do its work. 

If a node rejoins a cluster that has existing state (DLM, GFS for example) then it doesn't know what state has changed while it was 'away', and cannot be merged back into the cluster. So the remaining nodes (if they still have quorum) will remove it from the cluster. This is effectively the same as it being fenced, it just looks slightly messier.
Comment 2 Rajeev 2008-08-25 14:04:47 EDT
(In reply to comment #1)
> Yes, this is what happens
> Aug 17 23:31:50 rum openais[2672]: [MAIN ] Node rum.ind.hp.com not joined to
> cman because it has rejoined an inquorate cluster 
> If you disconnect a system from the cluster it will be fenced. However it is
> possible that the outage lasts less time than it takes for the fencing agent to
> do its work. 
> If a node rejoins a cluster that has existing state (DLM, GFS for example) then
> it doesn't know what state has changed while it was 'away', and cannot be
> merged back into the cluster. So the remaining nodes (if they still have
> quorum) will remove it from the cluster. This is effectively the same as it
> being fenced, it just looks slightly messier.

From your message I understood that:

1) node (rum), after being fenced, attempted to join a inquorate cluster (cluster that has no quorum) and hence was fenced again by other nodes. 

2) Node rum had a temporary n/w outage (which lasted less than what it takes for fencing agent to fence a node). When the n/w was restored and it was found that GFS/CLVM was still active then one of the other nodes fenced it.

Q1. Which of the above is the reason for node (rum) to fail just after it rebooted after being fenced:

Q2. Should'nt there be a log statement in vodka or beer "fenced[xxxx]: fencing rum.ind.hp.com". However I did not any find any such logs in vodka or beer at the time rum failed to indicate rum was being fenced in both 1) and 2). I only found logs when the node was fenced the first time.

Note You need to log in before you can comment on or make changes to this bug.