Bug 146327

Summary:	cman_tool leave simultaneously on all nodes causes the "last" one to hang
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	cman	Assignee:	Christine Caulfield <ccaulfie>
Status:	CLOSED NEXTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2005-03-14 22:34:39 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2005-01-26 23:24:53 UTC

Description of problem:
Every one is in the cman cluster:

[root@morph-02 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    5   M   morph-01
   2    1    5   M   morph-03
   3    1    5   M   morph-05
   4    1    5   M   morph-04
   5    1    5   M   morph-02

[root@morph-02 root]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Cluster-Member
Nodes: 5
Expected_votes: 5
Total_votes: 5
Quorum: 3
Active subsystems: 0
Node addresses: 192.168.44.62

I then do a 'cman_tool leave' on all nodes at the same time and the
cmd on the "last" node hangs.

All nodes but morph-02 are no longer in the cluster:
root@morph-01 root]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Not-in-Cluster
[root@morph-01 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name


[root@morph-03 root]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Not-in-Cluster
[root@morph-03 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name


[root@morph-04 root]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Not-in-Cluster
[root@morph-04 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name


[root@morph-05 root]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Not-in-Cluster
[root@morph-05 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name


But morph-02 has a different view:

[root@morph-02 root]# cat /proc/cluster/status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Membership state: Transition-Master
Nodes: 4
Expected_votes: 5
Total_votes: 4
Quorum: 3
Active subsystems: 0
Node addresses: 192.168.44.62

[root@morph-02 root]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    5   X   morph-01
   2    1    5   M   morph-03
   3    1    5   M   morph-05
   4    1    5   M   morph-04
   5    1    5   M   morph-02

...and a still hung cman_tool leave cmd.

All the other nodes spit out the following messages:
Jan 26 17:24:38 morph-01 ccsd[3813]: Unable to connect to cluster
infrastructure after 990 seconds.


Version-Release number of selected component (if applicable):
CMAN <CVS> (built Jan 25 2005 15:37:28) installed


How reproducible:
Always

Comment 1 Christine Caulfield 2005-01-27 10:40:56 UTC

How do you manage to do it "at the same time"? 
Every time I try it most of the nodes won't leave because they are
already doing the a transition to remove the first node.

In theory (ahem) this should time out once the last node notices that
the rest have gone away.

Comment 2 Corey Marthaler 2005-01-27 15:39:45 UTC

I open sessions to all nodes and then use the "Send Input to All
Sessions" ability from this window manager under the "View" tab.

I waited quite awhile so I'm not to sure it would time out eventually.
It looked pretty hung but I could wait and actually see if you wanted
me to?

Comment 3 Christine Caulfield 2005-01-28 11:24:10 UTC

"Window Manager" ? "View Tab"  -  what are these things of which you
speak? is that anything like a screen session ?

The nearest I can get is screen's 
  :at bench# stuff 'cman_tool leave'\012
which still isn't quick enough to catch the others out.

If you've waited more than a couple of minutes and it's not timed out
then I suspect it's not going to. The worst case is
TRANSITION_RESTARTS*TRANSITION_TIMER (10x15 seconds, 2.5 minutes). So
it looks like the transition timer probably isn't firing.

Comment 4 Christine Caulfield 2005-01-28 15:02:47 UTC

Ok, I've managed to reproduce this with a slightly hacked up cnxman.c
(rip the transition check out of the ioctl code).

I need to run some more tests over the weekend. The last node will
still take a couple of minutes to die but it's such an odd
circumstance that I'm not going to lose any sleep over it.

What is really needed here is something like VMS's CLUSTER_SHUTDOWN
option, but that will have to wait.

Comment 5 Christine Caulfield 2005-01-31 10:46:47 UTC

heartbeat thread didn't take any notice of the "quit_threads" flag
relying instead on it's friends to shut it down. This was not reliable
when we were the last node out of a cluster.

Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.57; previous revision: 1.56
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.44.2.7; previous revision: 1.44.2.6
done

Comment 6 Corey Marthaler 2005-02-01 22:05:33 UTC

still seeing this, although not as often.

Comment 7 Christine Caulfield 2005-02-03 10:27:20 UTC

Take 2, There were places where threads could have been blocked
waiting for things to happen that just were never going to.

Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.48; previous revision: 1.47
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.59; previous revision: 1.58
done

RHEL4 branch:
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.6; previous revision: 1.42.2.5
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/membership.c,v  <--  membership.c
new revision: 1.44.2.8; previous revision: 1.44.2.7
done

Comment 8 Corey Marthaler 2005-03-14 22:34:39 UTC

fix verified.