Red Hat Bugzilla – Bug 471290
qpidd should monitor cman quorum status.
Last modified: 2009-10-06 13:53:23 EDT
Description of problem:
Clustered qpidd daemon needs to monitor cluster quorum status using libcman and shut down if the node loses contact with the quorum. cman fencing can take several seconds to power down a node so there's a window where messages could be lost if qpidd doesn't monitor this and shut down immediately.
Start a cluster using cman to configure quorum, verify that qpidd shuts down immediately if one node is disconnected from the cluster.
Implemented but not properly tested. Needs testing in a full cman cluster.
If --cluster-cman is enabled but cman cannot be contacted for some reason, the broker seg-faults rather than exiting cleanly.
(03:35:33 PM) Alan Conway: astitcher_wfh: if the cluster plugin throws during initialization, I'm seeing a core dump in ~PollerHandlerPrivate trying to delete a locked mutex.
(03:36:39 PM) astitcher_wfh: aconway: interesting
(03:36:41 PM) Alan Conway: astitcher_wfh: ~PollerHandle() calls PHDeletionManager.markForDeletion(impl) with impl->lock() held. That seems incorrect if markFor may delete the object.
(03:41:51 PM) Alan Conway: astitcher_wfh: maybe I'm abusing the poller model, this is happening in a PollableQueue that is constructed then deleted because of an execption. I get the crash whether or not I start() the queue
#1 above has been fixed. Still need to test in a cman cluster.
Set up a cluster of 3 nodes. Can set up config file with --system-config-cluster.
Check the cluster suite docs for more details - in particular this seems to be picky about network setup. All we need for this test is to get cman running on 3 nodes, we don't need any other cluster services.
Try the following scenarios:
- start cman on node 1 only.
- start qpidd on node 1 with info logging. You should see "waiting for cluster" log message
- start a qpidd client. The client should hang
- start cman on node 2, qpidd on node1 should start processing messages, the client should complete normally
- start qpidd on node 2, should start normally, clients should work normally.
- stop cman on node 2
- run clients against qpidd on each node. qpidd should shut down as soon as the clients connect, the clients should not be able to send or receive any messages.
Got a cman setup up'n'running. Let the whole cluster settle with 3 nodes. Tried four different test scenarios to see if I had understood things correctly.
The expected_votes was set to 5, since 3/5 is a majority of a healthy cluster.
** Test scenario 1:
Shutdown the cman daemons on two of the nodes. Waited. Started qpidd with the following command line:
qpidd -t --no-module-dir --load-module=/usr/lib64/qpid/daemon/cluster.so --cluster-cman --cluster-name=mrgtest
The daemon did not list any message as expected ("waiting for cluster"). Started perftest against this node, and it managed to send and receive messages.
At this point, only one node was running cman and this node was also running qpidd.
** Test scenario 2:
Started cman on the two other nodes (so a total of 3 qpidd is running). Waited for things to settle and restarted qpidd, with the same command line as above. No "waiting for cluster" message was seen. Only "first in cluster".
Tried to run perftest against this broker on this node. And perftest did not hang, as was expected.
** Test scenario 3:
Killed one qpidd, to see if the others followed. None followed, and I had two qpidd still alive.
** Test scenario 4:
Restarted the killed qpidd. Shutdown cman on the node. Those two qpidd processes received and logged the following message line:
2008-dec-05 15:41:31 info 18.104.22.168:5623(READY) member update: 22.214.171.124:7017(member) 126.96.36.199:5623(member) 188.8.131.52:6397(member)
How I understand the bug description, this is not what we would expect. On the scenario 3 and 4, I understand it as all qpidd alive would shutdown. This did not happen.
For the scenario 1 and 2, I understand it as the perftest program would "hang" until all three nodes where available. This did not happen.
Created attachment 325905 [details]
cluster configuration used for test scenario 1 and 2
Created attachment 325906 [details]
cluster configuration used for test scenario 3 and 4
Just some version info ... RPMs was tagged with SVN rev. 722891 ... latest available on Fri, Dec 5.
Retested this bug, this time on a 2 node cluster. One of the nodes got a vote value of 2, and the second one got the vote value 1. The cluster is setup to have 3 votes to be healthy. Which means that if the node with the 2 votes disappears, the cluster is out of balance, and the
I ran these test scenarios, using perftest on the client side. I waited a few minutes between each action and tried perftest twice (with some time in between) after a status change.
** 1. Smoke-test - Starting 2 nodes
No problem, both nodes receives messages. qpid-tool says clusterSize=2.
** 2. Stopping qpidd (CTRL-C) on 2-votes-node, leaving 2 nodes running cman
Messages was passed to the qpidd being alive. qpid-tool says clusterSize=1.
** 3. Stopped both qpidd and cman (in this order) on 2-votes-node,
leaving only 1 node running
Messages was passed to the qpidd being alive. qpidd-tool says clusterSize=1.
** 4. Started both nodes with qpidd and cman. Stopped cman on 2-votes-node.
qpidd shutdown on the node where cman was shutdown. The other node was
running both cman and qpidd, and messages was passed to this broker as before.
I still believe this bug is not fixed.
I've verified the following 2 scenarios on a 4 node cluster, this is the expected behaviour:
Scenario 1: wait on start-up
- start cman on one node
- simultaneously start qpidd on the same node with info+ logging.
> Should see "waiting for quorum" messages from qpidd.
> Client attempting to connect gets "connection refused"
- start cman on 2 other nodes
> once quorum reached should see "Ready" message
> run a client to verify qpidd is working.
Scenario 2: shutdown on loss of quorum
- stop cman on one of the nodes
- run a client against qpidd
> qpidd should shut down with a "no quorum" message
> client should be disconnected with "connection closed"
I believe this issue can be tested and closed, see comment above.
As of revision 801740 qpidd responds immediately to loss of quorum, it does not require an active client.
*** This bug has been marked as a duplicate of bug 501537 ***