Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 459731

Summary: Fenced node rejoins the cluster and fails abruptly without reason
Product: Red Hat Enterprise Linux 5 Reporter: Rajeev <rajpurush04>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 5.1CC: cluster-maint, edamato, rajpurush04, rick.stern
Target Milestone: rc   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-22 07:24:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Contains /var/log/messages files, cluster configuration & automation script none

Description Rajeev 2008-08-21 17:35:47 UTC
Created attachment 314735 [details]
Contains /var/log/messages files, cluster configuration & automation script

Description of problem:
In a 3 node cluster with GFS1 GF, the node fenced (because of heartbeat i/f failure) rejoins the cluster and then fails without any reason. 

The network failure is simulated by bringing down the n/w interface on a node (rum). As expected one among the other 2 nodes (vodka/beer) fences rum via HP ILO where the node is powered off followed by a power on. The fenced node during reboot starts up cman, clvmd, and gfs and successfully rejoins the other nodes forming a 3 node cluster.

And then without any reason the node fails as indicated by the /var/log/messages (timestamped around 23:35:35) of rum. The logs for vodka and beer shows that it has lost the member rum and fences it again.


Version-Release number of selected component (if applicable):
- RHEL 5 update 1 with RHCS and GFS modules
- Kernel 2.6.18-53.el5

How reproducible:
- It is not always reproducible but occured thrice in a 10 CHO. The script used for the test is attached.

Steps to Reproduce:
n/w failure is simulated on a randomly selected node (among the 3 cluster nodes) by bringing down the heartbeat interface. Once n/w interface is down, the other node will wait till the node is fenced and joins back the cluster. Then above sequence repeat all overagain. Each node run the script (attached) which is started up at boot time (i.e., init.d) after cman, clvmd and gfs are started.

Comment 1 Christine Caulfield 2008-08-22 07:24:05 UTC
Yes, this is what happens

Aug 17 23:31:50 rum openais[2672]: [MAIN ] Node rum.ind.hp.com not joined to cman because it has rejoined an inquorate cluster 

If you disconnect a system from the cluster it will be fenced. However it is possible that the outage lasts less time than it takes for the fencing agent to do its work. 

If a node rejoins a cluster that has existing state (DLM, GFS for example) then it doesn't know what state has changed while it was 'away', and cannot be merged back into the cluster. So the remaining nodes (if they still have quorum) will remove it from the cluster. This is effectively the same as it being fenced, it just looks slightly messier.

Comment 2 Rajeev 2008-08-25 18:04:47 UTC
(In reply to comment #1)
> Yes, this is what happens
> Aug 17 23:31:50 rum openais[2672]: [MAIN ] Node rum.ind.hp.com not joined to
> cman because it has rejoined an inquorate cluster 
> If you disconnect a system from the cluster it will be fenced. However it is
> possible that the outage lasts less time than it takes for the fencing agent to
> do its work. 
> If a node rejoins a cluster that has existing state (DLM, GFS for example) then
> it doesn't know what state has changed while it was 'away', and cannot be
> merged back into the cluster. So the remaining nodes (if they still have
> quorum) will remove it from the cluster. This is effectively the same as it
> being fenced, it just looks slightly messier.

From your message I understood that:

1) node (rum), after being fenced, attempted to join a inquorate cluster (cluster that has no quorum) and hence was fenced again by other nodes. 

2) Node rum had a temporary n/w outage (which lasted less than what it takes for fencing agent to fence a node). When the n/w was restored and it was found that GFS/CLVM was still active then one of the other nodes fenced it.

Q1. Which of the above is the reason for node (rum) to fail just after it rebooted after being fenced:

Q2. Should'nt there be a log statement in vodka or beer "fenced[xxxx]: fencing rum.ind.hp.com". However I did not any find any such logs in vodka or beer at the time rum failed to indicate rum was being fenced in both 1) and 2). I only found logs when the node was fenced the first time.