Bug 146327
Summary: | cman_tool leave simultaneously on all nodes causes the "last" one to hang | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cman | Assignee: | Christine Caulfield <ccaulfie> |
Status: | CLOSED NEXTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 4 | CC: | cluster-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-03-14 22:34:39 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2005-01-26 23:24:53 UTC
How do you manage to do it "at the same time"? Every time I try it most of the nodes won't leave because they are already doing the a transition to remove the first node. In theory (ahem) this should time out once the last node notices that the rest have gone away. I open sessions to all nodes and then use the "Send Input to All Sessions" ability from this window manager under the "View" tab. I waited quite awhile so I'm not to sure it would time out eventually. It looked pretty hung but I could wait and actually see if you wanted me to? "Window Manager" ? "View Tab" - what are these things of which you speak? is that anything like a screen session ? The nearest I can get is screen's :at bench# stuff 'cman_tool leave'\012 which still isn't quick enough to catch the others out. If you've waited more than a couple of minutes and it's not timed out then I suspect it's not going to. The worst case is TRANSITION_RESTARTS*TRANSITION_TIMER (10x15 seconds, 2.5 minutes). So it looks like the transition timer probably isn't firing. Ok, I've managed to reproduce this with a slightly hacked up cnxman.c (rip the transition check out of the ioctl code). I need to run some more tests over the weekend. The last node will still take a couple of minutes to die but it's such an odd circumstance that I'm not going to lose any sleep over it. What is really needed here is something like VMS's CLUSTER_SHUTDOWN option, but that will have to wait. heartbeat thread didn't take any notice of the "quit_threads" flag relying instead on it's friends to shut it down. This was not reliable when we were the last node out of a cluster. Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.57; previous revision: 1.56 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.44.2.7; previous revision: 1.44.2.6 done still seeing this, although not as often. Take 2, There were places where threads could have been blocked waiting for things to happen that just were never going to. Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.48; previous revision: 1.47 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.59; previous revision: 1.58 done RHEL4 branch: Checking in cnxman.c; /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v <-- cnxman.c new revision: 1.42.2.6; previous revision: 1.42.2.5 done Checking in membership.c; /cvs/cluster/cluster/cman-kernel/src/membership.c,v <-- membership.c new revision: 1.44.2.8; previous revision: 1.44.2.7 done fix verified. |