315631 – conga doesn't handle the cluster restart operation properly

Bug 315631 - conga doesn't handle the cluster restart operation properly

Summary: conga doesn't handle the cluster restart operation properly

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	conga
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Ryan McCabe
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-10-02 17:24 UTC by Corey Marthaler
Modified:	2009-04-16 22:59 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2008-0407
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:47:23 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0407	0	normal	SHIPPED_LIVE	conga bug fix and enhancement update	2008-05-21 15:46:31 UTC

Description Corey Marthaler 2007-10-02 17:24:31 UTC

Description of problem:
Problems occur when attemping the 'restart cluster' operation. From what I've
seen, different things will happen depending on what state the cluster is in
before the restart is executed.

Scenario 1: Start with the cluster stopped
Then, after attempting the 'restart' with conga, the cluster will usually start
properly on all nodes, sometimes it will fail to start the service on one of the
nodes in the cluster.

Scenario 2: Start with the cluster started but no clvmd or rgmanager
Then, after attempting the 'restart' with conga, the cluster will most likey end
up with all nodes in the stopped state and appears to not even have tried the
start operation, but sometimes it will start on a subset of the cluster.

Scenario 3: Start with the cluster started and all services running
Then, after attempting the 'restart' with conga, the cluster will either end up
completely stopped with out a start attempted, end up in some hung loop due to
timing issues, or the restart will appear to work on just a subset of the nodes.


I've tried these cmds manually countless times and never seen issues, that is:
for i in rgmanager clvmd cman
do 
service $i stop
done

for i in cman clvmd rgmaneger
do 
service $i start
done



Version-Release number of selected component (if applicable):
2.6.18-48.el5
cman-2.0.73-1.el5
luci-0.10.0-6.el5

Comment 1 Ryan McCabe 2007-10-02 20:38:10 UTC

Problems occur because some nodes may be starting while others are still in the
process of stopping. According to sdake, this ought to work (it definitely
doesn't consistently work presently), but we can easily work around it in conga.

Comment 3 Brian Brock 2008-05-15 17:27:35 UTC

I don't see a complete fix.  Nodes aren't reliably restarting and rejoining the
cluster after a cluster restart.  Seems like I'm hitting above scenarios 2 & 3
mostly.

I've had one case where after restart, one node was up and another wasn't
started and in the cluster.  I've had another case which seems to be stalled at
"Please be patient - this cluster's configuration is being modified.", with a
page reload roughly every 5 seconds.

Is there other behavior that I'm hoping to see, or perhaps a more clear case to
find improvement?

Comment 4 Ryan McCabe 2008-05-15 17:56:02 UTC

The desired behavior is that all nodes are stopped before any node is restarted.
Sometimes problems in the cluster stack prevent nodes from stopping (and/or
starting) cleanly, and there's nothing conga can do to fix this.

Can you check to see whether there are init scripts that are hung on the nodes
that aren't in the state you expect them to be in? In my experience, more often
than not, the culprit tends to be clvmd.

Comment 5 Brian Brock 2008-05-15 19:07:17 UTC

no hung initscripts that I see.  Currently, looks like one of the nodes is
dropping out of the cluster entirely when restarted.  Without a restart, it
seemed to be fine and the cluster held quorom.

double-checking a few more times... but I've definitely noted exactly the trend
described in the first comment, across many restarts.

Comment 6 Ryan McCabe 2008-05-15 19:33:51 UTC

Do you get different results if you stop the cluster using conga, then start it
once that has finished?

Comment 7 Brian Brock 2008-05-15 20:41:42 UTC

-(~:$)-> service cman stop
Stopping cluster: 
   Stopping virtual machine fencing host... done
   Stopping fencing... done
   Stopping cman... failed
/usr/sbin/cman_tool: Error leaving cluster: Device or resource busy
[FAILED]
-(root@dell-pesc430-01:pts/0)-(0 jobs)-(98:23)-(Thu May 15:16:27:29)-
-(~:$)-> service cman status
fence_xvmd is stopped
-(root@dell-pesc430-01:pts/0)-(0 jobs)-(99:24)-(Thu May 15:16:27:38)-
-(~:$)-> service cman start
Starting cluster: 
   Enabling workaround for Xend bridged networking... done
   Loading modules... done
   Mounting configfs... done
   Starting ccsd... done
   Starting cman... done
   Starting daemons... done
   Starting fencing... done
   Starting virtual machine fencing host... done
[  OK  ]



After that manual restart, Conga seems to mostly work (basic smoke test).  But, 
luci - cluster - nodes - properties shows rgmanager as not running

-(~:$)-> service rgmanager status
clurgmgrd dead but pid file exists

last previous command executed for rgmanager was a only a status check, and it
indicated normal running:

-(~:$)-> service rgmanager status
clurgmgrd (pid 4795 4794) is running...


Conga reports no problems if the restart is performed from the shell, and the
node remains in the cluster rather than an indefinite hang in the UI.

I'll debug the cman script separately, unless it's related to comment 0.

Changing to NEEDINFO, please let me know what other info is useful.  I'm
rebuilding another cluster to see if it's reproducible with a fresh install.

Comment 8 Brian Brock 2008-05-15 20:42:41 UTC

by "Conga seems to mostly work", I also mean that the node in question is still
participating in the cluster

Comment 9 Ryan McCabe 2008-05-15 22:45:54 UTC

If the init scripts fail to either start or stop the cluster (whichever you're
doing), conga isn't smart enough to figure out what's going on, and restarting
the cluster won't work. re: above, if the cman script won't stop and gives that
error (about leaving the cluster), you're probably SOL no matter whether you're
using conga to manage it or whether you're doing it manually, and manual
intervention is likely your best shot to get the cluster back into a sensible
state (might need to do a cman kill -n <node>). The best we can do in conga is
to issue the correct commands (stop everywhere, then start everywhere). If
something fails along the way, it's a bug in some other component.

Comment 10 Brian Brock 2008-05-16 04:21:08 UTC

That completely clears it up for me.  Changing state to verified.

Comment 12 errata-xmlrpc 2008-05-21 15:47:23 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0407.html

Note You need to log in before you can comment on or make changes to this bug.