Red Hat Bugzilla – Bug 175099
Unable to bring up an eight node cluster
Last modified: 2009-04-16 16:18:58 EDT
Description of problem: With a cluster configured for eight nodes, was not
able to get clustat on all eight nodes to show eight nodes in the cluster and
all of the services. Four nodes (nodes 2-5) showed nodes 1-6 in the cluster
with all of the services. On two of the nodes (node 1 & node 6), clustat
hung. On the last two nodes (node7 & node 8), clustat showed eight nodes in
the cluster but no services althought rgmanager was running.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Configure eight node cluster
2. Bring up all eight nodes
3. Run clustat on all eight nodes
Clustat either hangs or shows inconsistant data from node to node
Clustat should show all nodes and all services when run from any node
Could this bug be related to bug #171153? We have multiple applications that
run on every node in the cluster that all run clustat at various times. Can
we get the fix you just referred to for bug #171153?
I increased the connect timeout in clustat. When rgmanager's busy, 2 seconds is
not enough to report consistent data; so yes, it's likely they are somewhat
related (if not the same).
If this does not solve it, I will need to add a message queueing / handling
thread to rgmanager. This is not difficult at all, but I was trying to avoid it
for simplicity. That way, accept() will happen immediately, even if it takes a
long time to get the appropriate cluster locks to get service data...
Can you please provide the source code fix? I looked at the source code but
was not able to find a timeout for clu_connect.
Created attachment 121956 [details]
Fix from 171153 (pass 1) - increase msg_open timeout for clustat
*** This bug has been marked as a duplicate of 175033 ***
Created attachment 122091 [details]
8 node cluster.conf
I loaded both magma-plugins-1.0.3-0.3bz175033 and rgmanager-1.9.43-0 (see bug
#175033) on an eight node cluster. We are still having problems getting
clustat to run on an eight node cluster with 11 services. Sometimes clustat
hangs, sometimes the IP services do not all start, etc. The cluster.conf is
attached in comment #6. The /proc/cluster/services file is 606 bytes.
Reopening per previous comment.
Just a note that the tested way of adding an additional node to a non two node
cluster is to:
* add the new node with the GUI on the quorate cluster
* propagate that new file to the quorate cluster using the GUI
* copy that new file by hand to the new node
* run 'service ccsd start' and then 'service cman/lock_gulmd start' on the new node
Rgmanager version 1.9.43 and dlm version 1.0.0-5 have helped with this problem.
With the same eight node cluster configuration as in comment #6 above, I have
found that if I bring up 5 of the 8 nodes and allow the cluster to gain quorum
and stablize, I can then boot up the remaining three nodes without any problem.
If all 8 nodes are bought up at the same time, clustat will hang on one or
more nodes (usually 3). If I reboot the nodes that have hung on clustat, then
they come up and clustat runs as expected. I will use this as a workaround,
but would be interested in your comments or a fix if possible.
This is actually related to the rgmanager performance problem in #177340 as well
as the previous problems addressed in U2.
*** This bug has been marked as a duplicate of 177340 ***