Bug 144309

Summary: cman_tool leave remove: not adjusting quorum for continued operation
Product: [Retired] Red Hat Cluster Suite Reporter: Derek Anderson <danderso>
Component: cmanAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED NEXTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-02-11 18:11:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Derek Anderson 2005-01-05 20:44:43 UTC
Description of problem:
From the cman_tool manpage under the leave section:
"If this node is to be down for an extended period of time and you
need to keep  the  cluster  running, add the remove option, and the
remaining nodes will recalculate quorum such that activity can continue."

Test is to have a three node cluster up and quorate with no other
services running, then to run 'cman_tool leave remove' on two of them;
expect that the remaining node does not block activity.

Nodes are link-10,link-11,link-12.

### First run 'cman_tool leave remove' on link-10.  
link-10 kernel: CMAN: we are leaving the cluster. Removed
link-11 kernel: <no messages>
link-12 kernel: CMAN: Node link-10 is leaving the cluster, Removed

### Now run 'cman_tool leave remove on link-11.
link-11 kernel: CMAN: we are leaving the cluster. Removed
link-12 kernel: CMAN: Node link-11 is leaving the cluster, Removed
link-12 kernel: CMAN: quorum lost, blocking activity

### Double check status of remaining node, link-12.
[root@link-12 root]# cat /proc/cluster/status
Protocol version: 4.0.1
Config version: 1
Cluster name: MILTON
Cluster ID: 4812
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 3
Total_votes: 1
Quorum: 2  Activity blocked
Active subsystems: 0
Node addresses: 192.168.44.162

Version-Release number of selected component (if applicable):
6.1 RPMS built Wed 15 Dec 2004 01:13:08 PM CST

How reproducible:
Yes.

Steps to Reproduce:
1. 3 node quorate cman cluster
2. remove 2 of the nodes with 'cman_tool leave remove'
3.
  
Actual results:
Activity blocked on the remaining node due to loss of quorum

Expected results:
Activity not blocked.

Additional info:

Comment 1 Christine Caulfield 2005-01-06 16:41:19 UTC
This was fixed in a checkin on the 16th December - and works for me
with current CVS.

Comment 2 Derek Anderson 2005-01-11 19:54:55 UTC
I am still seeing this with the RPMs built yesterday, Monday January 10.

Comment 3 Christine Caulfield 2005-01-13 14:14:20 UTC
Missed a corner case, sorry

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.47; previous revision: 1.46
done


Comment 4 Derek Anderson 2005-02-10 17:06:33 UTC
Still doesn't appear to be working.  These messages are from link-10;
ran 'cman_tool leave remove' on link-11 and then link-12.  15 seconds
later activity is blocked.

CMAN: removing node link-12 from the cluster : Removed
Feb 10 11:03:16 link-10 kernel: CMAN: Node link-12 is leaving the
cluster, Removed
Feb 10 11:03:16 link-10 kernel: CMAN: removing node link-12 from the
cluster : Removed
CMAN: removing node link-11 from the cluster : Removed
Feb 10 11:03:41 link-10 kernel: CMAN: Node link-11 is leaving the
cluster, Removed
Feb 10 11:03:41 link-10 kernel: CMAN: removing node link-11 from the
cluster : Removed
CMAN: quorum lost, blocking activity
Feb 10 11:03:56 link-10 kernel: CMAN: quorum lost, blocking activity

Node  Votes Exp Sts  Name
   1    1    3   M   link-10
   2    1    3   X   link-11
   3    1    3   X   link-12
Protocol version: 5.0.1
Config version: 2
Cluster name: MILTON
Cluster ID: 4812
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 3
Total_votes: 1
Quorum: 2  Activity blocked
Active subsystems: 9
Node addresses: 192.168.44.160

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "clvmd"                             2   3 run       -
[1]

DLM Lock Space:  "data1"                             3   4 run       -
[1]

DLM Lock Space:  "data2"                             5   6 run       -
[1]

GFS Mount Group: "data1"                             4   5 run       -
[1]

GFS Mount Group: "data2"                             6   7 run       -
[1]

Comment 5 Christine Caulfield 2005-02-10 17:28:52 UTC
You've got 2 GFS filesystems mounted, it shouldn't even start shutdown
on that node. Did you get a failure message from cman_tool leave?

Comment 6 Derek Anderson 2005-02-10 17:43:35 UTC
<puzzled> ? 
 
Yes, I have 2 filesystems mounted on link-10.  I'm not trying to 
shut down cman on link-10, however.  The expectation is that since 
the other two nodes left with "remove" the last node would remain 
quorate and not go into "Activity Blocked" mode. 

Comment 7 Christine Caulfield 2005-02-11 10:20:35 UTC
Apologies, I read that message just before I left and thought it
referred to a different bz.

Looks like a bit more patience would have helped me when testing the
previous fix too. It seems that the transition timer was being set for
a single-node transition when it shouldn't have. So after the node had
settled down nicely and self-quorate, the timer kicked in 15 seconds
later and spoiled it all.

This checkin should fix it, and it will also get rid of the duplicate
leave message too.

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.60; previous revision: 1.59
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.44.2.9; previous revision: 1.44.2.8
done


Comment 8 Derek Anderson 2005-02-11 18:11:11 UTC
Fix verified in cman-kernel-2.6.9-18.0.