Description of problem: With a cluster configured for eight nodes, was not able to get clustat on all eight nodes to show eight nodes in the cluster and all of the services. Four nodes (nodes 2-5) showed nodes 1-6 in the cluster with all of the services. On two of the nodes (node 1 & node 6), clustat hung. On the last two nodes (node7 & node 8), clustat showed eight nodes in the cluster but no services althought rgmanager was running. Version-Release number of selected component (if applicable): rgmanager-1.9.38-0 How reproducible: Every time Steps to Reproduce: 1. Configure eight node cluster 2. Bring up all eight nodes 3. Run clustat on all eight nodes Actual results: Clustat either hangs or shows inconsistant data from node to node Expected results: Clustat should show all nodes and all services when run from any node Additional info:
Could this bug be related to bug #171153? We have multiple applications that run on every node in the cluster that all run clustat at various times. Can we get the fix you just referred to for bug #171153?
I increased the connect timeout in clustat. When rgmanager's busy, 2 seconds is not enough to report consistent data; so yes, it's likely they are somewhat related (if not the same). If this does not solve it, I will need to add a message queueing / handling thread to rgmanager. This is not difficult at all, but I was trying to avoid it for simplicity. That way, accept() will happen immediately, even if it takes a long time to get the appropriate cluster locks to get service data...
Can you please provide the source code fix? I looked at the source code but was not able to find a timeout for clu_connect.
Created attachment 121956 [details] Fix from 171153 (pass 1) - increase msg_open timeout for clustat
*** This bug has been marked as a duplicate of 175033 ***
Created attachment 122091 [details] 8 node cluster.conf
I loaded both magma-plugins-1.0.3-0.3bz175033 and rgmanager-1.9.43-0 (see bug #175033) on an eight node cluster. We are still having problems getting clustat to run on an eight node cluster with 11 services. Sometimes clustat hangs, sometimes the IP services do not all start, etc. The cluster.conf is attached in comment #6. The /proc/cluster/services file is 606 bytes.
Reopening per previous comment.
Just a note that the tested way of adding an additional node to a non two node cluster is to: * add the new node with the GUI on the quorate cluster * propagate that new file to the quorate cluster using the GUI * copy that new file by hand to the new node * run 'service ccsd start' and then 'service cman/lock_gulmd start' on the new node
Rgmanager version 1.9.43 and dlm version 1.0.0-5 have helped with this problem. With the same eight node cluster configuration as in comment #6 above, I have found that if I bring up 5 of the 8 nodes and allow the cluster to gain quorum and stablize, I can then boot up the remaining three nodes without any problem. If all 8 nodes are bought up at the same time, clustat will hang on one or more nodes (usually 3). If I reboot the nodes that have hung on clustat, then they come up and clustat runs as expected. I will use this as a workaround, but would be interested in your comments or a fix if possible.
This is actually related to the rgmanager performance problem in #177340 as well as the previous problems addressed in U2. *** This bug has been marked as a duplicate of 177340 ***