Description of problem: Problems occur when attemping the 'restart cluster' operation. From what I've seen, different things will happen depending on what state the cluster is in before the restart is executed. Scenario 1: Start with the cluster stopped Then, after attempting the 'restart' with conga, the cluster will usually start properly on all nodes, sometimes it will fail to start the service on one of the nodes in the cluster. Scenario 2: Start with the cluster started but no clvmd or rgmanager Then, after attempting the 'restart' with conga, the cluster will most likey end up with all nodes in the stopped state and appears to not even have tried the start operation, but sometimes it will start on a subset of the cluster. Scenario 3: Start with the cluster started and all services running Then, after attempting the 'restart' with conga, the cluster will either end up completely stopped with out a start attempted, end up in some hung loop due to timing issues, or the restart will appear to work on just a subset of the nodes. I've tried these cmds manually countless times and never seen issues, that is: for i in rgmanager clvmd cman do service $i stop done for i in cman clvmd rgmaneger do service $i start done Version-Release number of selected component (if applicable): 2.6.18-48.el5 cman-2.0.73-1.el5 luci-0.10.0-6.el5
Problems occur because some nodes may be starting while others are still in the process of stopping. According to sdake, this ought to work (it definitely doesn't consistently work presently), but we can easily work around it in conga.
I don't see a complete fix. Nodes aren't reliably restarting and rejoining the cluster after a cluster restart. Seems like I'm hitting above scenarios 2 & 3 mostly. I've had one case where after restart, one node was up and another wasn't started and in the cluster. I've had another case which seems to be stalled at "Please be patient - this cluster's configuration is being modified.", with a page reload roughly every 5 seconds. Is there other behavior that I'm hoping to see, or perhaps a more clear case to find improvement?
The desired behavior is that all nodes are stopped before any node is restarted. Sometimes problems in the cluster stack prevent nodes from stopping (and/or starting) cleanly, and there's nothing conga can do to fix this. Can you check to see whether there are init scripts that are hung on the nodes that aren't in the state you expect them to be in? In my experience, more often than not, the culprit tends to be clvmd.
no hung initscripts that I see. Currently, looks like one of the nodes is dropping out of the cluster entirely when restarted. Without a restart, it seemed to be fine and the cluster held quorom. double-checking a few more times... but I've definitely noted exactly the trend described in the first comment, across many restarts.
Do you get different results if you stop the cluster using conga, then start it once that has finished?
-(~:$)-> service cman stop Stopping cluster: Stopping virtual machine fencing host... done Stopping fencing... done Stopping cman... failed /usr/sbin/cman_tool: Error leaving cluster: Device or resource busy [FAILED] -(root@dell-pesc430-01:pts/0)-(0 jobs)-(98:23)-(Thu May 15:16:27:29)- -(~:$)-> service cman status fence_xvmd is stopped -(root@dell-pesc430-01:pts/0)-(0 jobs)-(99:24)-(Thu May 15:16:27:38)- -(~:$)-> service cman start Starting cluster: Enabling workaround for Xend bridged networking... done Loading modules... done Mounting configfs... done Starting ccsd... done Starting cman... done Starting daemons... done Starting fencing... done Starting virtual machine fencing host... done [ OK ] After that manual restart, Conga seems to mostly work (basic smoke test). But, luci - cluster - nodes - properties shows rgmanager as not running -(~:$)-> service rgmanager status clurgmgrd dead but pid file exists last previous command executed for rgmanager was a only a status check, and it indicated normal running: -(~:$)-> service rgmanager status clurgmgrd (pid 4795 4794) is running... Conga reports no problems if the restart is performed from the shell, and the node remains in the cluster rather than an indefinite hang in the UI. I'll debug the cman script separately, unless it's related to comment 0. Changing to NEEDINFO, please let me know what other info is useful. I'm rebuilding another cluster to see if it's reproducible with a fresh install.
by "Conga seems to mostly work", I also mean that the node in question is still participating in the cluster
If the init scripts fail to either start or stop the cluster (whichever you're doing), conga isn't smart enough to figure out what's going on, and restarting the cluster won't work. re: above, if the cman script won't stop and gives that error (about leaving the cluster), you're probably SOL no matter whether you're using conga to manage it or whether you're doing it manually, and manual intervention is likely your best shot to get the cluster back into a sensible state (might need to do a cman kill -n <node>). The best we can do in conga is to issue the correct commands (stop everywhere, then start everywhere). If something fails along the way, it's a bug in some other component.
That completely clears it up for me. Changing state to verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0407.html