Red Hat Bugzilla – Bug 315631
conga doesn't handle the cluster restart operation properly
Last modified: 2009-04-16 18:59:30 EDT
Description of problem:
Problems occur when attemping the 'restart cluster' operation. From what I've
seen, different things will happen depending on what state the cluster is in
before the restart is executed.
Scenario 1: Start with the cluster stopped
Then, after attempting the 'restart' with conga, the cluster will usually start
properly on all nodes, sometimes it will fail to start the service on one of the
nodes in the cluster.
Scenario 2: Start with the cluster started but no clvmd or rgmanager
Then, after attempting the 'restart' with conga, the cluster will most likey end
up with all nodes in the stopped state and appears to not even have tried the
start operation, but sometimes it will start on a subset of the cluster.
Scenario 3: Start with the cluster started and all services running
Then, after attempting the 'restart' with conga, the cluster will either end up
completely stopped with out a start attempted, end up in some hung loop due to
timing issues, or the restart will appear to work on just a subset of the nodes.
I've tried these cmds manually countless times and never seen issues, that is:
for i in rgmanager clvmd cman
service $i stop
for i in cman clvmd rgmaneger
service $i start
Version-Release number of selected component (if applicable):
Problems occur because some nodes may be starting while others are still in the
process of stopping. According to sdake, this ought to work (it definitely
doesn't consistently work presently), but we can easily work around it in conga.
I don't see a complete fix. Nodes aren't reliably restarting and rejoining the
cluster after a cluster restart. Seems like I'm hitting above scenarios 2 & 3
I've had one case where after restart, one node was up and another wasn't
started and in the cluster. I've had another case which seems to be stalled at
"Please be patient - this cluster's configuration is being modified.", with a
page reload roughly every 5 seconds.
Is there other behavior that I'm hoping to see, or perhaps a more clear case to
The desired behavior is that all nodes are stopped before any node is restarted.
Sometimes problems in the cluster stack prevent nodes from stopping (and/or
starting) cleanly, and there's nothing conga can do to fix this.
Can you check to see whether there are init scripts that are hung on the nodes
that aren't in the state you expect them to be in? In my experience, more often
than not, the culprit tends to be clvmd.
no hung initscripts that I see. Currently, looks like one of the nodes is
dropping out of the cluster entirely when restarted. Without a restart, it
seemed to be fine and the cluster held quorom.
double-checking a few more times... but I've definitely noted exactly the trend
described in the first comment, across many restarts.
Do you get different results if you stop the cluster using conga, then start it
once that has finished?
-(~:$)-> service cman stop
Stopping virtual machine fencing host... done
Stopping fencing... done
Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
-(root@dell-pesc430-01:pts/0)-(0 jobs)-(98:23)-(Thu May 15:16:27:29)-
-(~:$)-> service cman status
fence_xvmd is stopped
-(root@dell-pesc430-01:pts/0)-(0 jobs)-(99:24)-(Thu May 15:16:27:38)-
-(~:$)-> service cman start
Enabling workaround for Xend bridged networking... done
Loading modules... done
Mounting configfs... done
Starting ccsd... done
Starting cman... done
Starting daemons... done
Starting fencing... done
Starting virtual machine fencing host... done
[ OK ]
After that manual restart, Conga seems to mostly work (basic smoke test). But,
luci - cluster - nodes - properties shows rgmanager as not running
-(~:$)-> service rgmanager status
clurgmgrd dead but pid file exists
last previous command executed for rgmanager was a only a status check, and it
indicated normal running:
-(~:$)-> service rgmanager status
clurgmgrd (pid 4795 4794) is running...
Conga reports no problems if the restart is performed from the shell, and the
node remains in the cluster rather than an indefinite hang in the UI.
I'll debug the cman script separately, unless it's related to comment 0.
Changing to NEEDINFO, please let me know what other info is useful. I'm
rebuilding another cluster to see if it's reproducible with a fresh install.
by "Conga seems to mostly work", I also mean that the node in question is still
participating in the cluster
If the init scripts fail to either start or stop the cluster (whichever you're
doing), conga isn't smart enough to figure out what's going on, and restarting
the cluster won't work. re: above, if the cman script won't stop and gives that
error (about leaving the cluster), you're probably SOL no matter whether you're
using conga to manage it or whether you're doing it manually, and manual
intervention is likely your best shot to get the cluster back into a sensible
state (might need to do a cman kill -n <node>). The best we can do in conga is
to issue the correct commands (stop everywhere, then start everywhere). If
something fails along the way, it's a bug in some other component.
That completely clears it up for me. Changing state to verified.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.