175099 – Unable to bring up an eight node cluster

Bug 175099 - Unable to bring up an eight node cluster

Summary: Unable to bring up an eight node cluster

Keywords:
Status:	CLOSED DUPLICATE of bug 177340
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-12-06 16:54 UTC by Henry Harris
Modified:	2009-04-16 20:18 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-02-02 18:02:43 UTC
Embargoed:

Attachments	(Terms of Use)
Fix from 171153 (pass 1) - increase msg_open timeout for clustat (556 bytes, patch) 2005-12-07 02:15 UTC, Lon Hohberger	no flags	Details \| Diff
8 node cluster.conf (8.52 KB, text/plain) 2005-12-10 00:45 UTC, Henry Harris	no flags	Details
View All

Description Henry Harris 2005-12-06 16:54:30 UTC

Description of problem: With a cluster configured for eight nodes, was not 
able to get clustat on all eight nodes to show eight nodes in the cluster and 
all of the services.  Four nodes (nodes 2-5) showed nodes 1-6 in the cluster 
with all of the services.  On two of the nodes (node 1 & node 6), clustat 
hung.  On the last two nodes (node7 & node 8), clustat showed eight nodes in 
the cluster but no services althought rgmanager was running.


Version-Release number of selected component (if applicable):
rgmanager-1.9.38-0

How reproducible:
Every time

Steps to Reproduce:
1. Configure eight node cluster
2. Bring up all eight nodes
3. Run clustat on all eight nodes
  
Actual results:
Clustat either hangs or shows inconsistant data from node to node

Expected results:
Clustat should show all nodes and all services when run from any node

Additional info:

Comment 1 Henry Harris 2005-12-06 21:13:51 UTC

Could this bug be related to bug #171153?  We have multiple applications that 
run on every node in the cluster that all run clustat at various times.  Can 
we get the fix you just referred to for bug #171153?

Comment 2 Lon Hohberger 2005-12-06 21:48:18 UTC

I increased the connect timeout in clustat.  When rgmanager's busy, 2 seconds is
not enough to report consistent data; so yes, it's likely they are somewhat
related (if not the same).

If this does not solve it, I will need to add a message queueing / handling
thread to rgmanager.  This is not difficult at all, but I was trying to avoid it
for simplicity.  That way, accept() will happen immediately, even if it takes a
long time to get the appropriate cluster locks to get service data...

Comment 3 Henry Harris 2005-12-06 23:01:12 UTC

Can you please provide the source code fix?  I looked at the source code but 
was not able to find a timeout for clu_connect.

Comment 4 Lon Hohberger 2005-12-07 02:15:21 UTC

Created attachment 121956 [details]
Fix from 171153 (pass 1) - increase msg_open timeout for clustat

Comment 5 Lon Hohberger 2005-12-08 19:31:40 UTC


*** This bug has been marked as a duplicate of 175033 ***

Comment 6 Henry Harris 2005-12-10 00:45:26 UTC

Created attachment 122091 [details]
8 node cluster.conf

Comment 7 Henry Harris 2005-12-10 00:47:41 UTC

I loaded both magma-plugins-1.0.3-0.3bz175033 and rgmanager-1.9.43-0 (see bug 
#175033) on an eight node cluster.  We are still having problems getting 
clustat to run on an eight node cluster with 11 services.  Sometimes clustat 
hangs, sometimes the IP services do not all start, etc.  The cluster.conf is 
attached in comment #6.  The /proc/cluster/services file is 606 bytes.

Comment 8 Henry Harris 2005-12-10 00:48:42 UTC

Reopening per previous comment.

Comment 9 Corey Marthaler 2005-12-13 18:45:17 UTC

Just a note that the tested way of adding an additional node to a non two node
cluster is to:
* add the new node with the GUI on the quorate cluster
* propagate that new file to the quorate cluster using the GUI
* copy that new file by hand to the new node
* run 'service ccsd start' and then 'service cman/lock_gulmd start' on the new node

Comment 10 Henry Harris 2006-01-05 22:39:55 UTC

Rgmanager version 1.9.43 and dlm version 1.0.0-5 have helped with this problem.
With the same eight node cluster configuration as in comment #6 above, I have 
found that if I bring up 5 of the 8 nodes and allow the cluster to gain quorum 
and stablize, I can then boot up the remaining three nodes without any problem.
If all 8 nodes are bought up at the same time, clustat will hang on one or 
more nodes (usually 3).  If I reboot the nodes that have hung on clustat, then 
they come up and clustat runs as expected.  I will use this as a workaround, 
but would be interested in your comments or a fix if possible.

Comment 11 Lon Hohberger 2006-02-02 18:02:43 UTC

This is actually related to the rgmanager performance problem in #177340 as well
as the previous problems addressed in U2.

*** This bug has been marked as a duplicate of 177340 ***

Note You need to log in before you can comment on or make changes to this bug.