Bug 471290

Summary: qpidd should monitor cman quorum status.
Product: Red Hat Enterprise MRG Reporter: Alan Conway <aconway>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED DUPLICATE QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: high Docs Contact:
Priority: high    
Version: betaCC: davids
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-11 12:52:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
cluster configuration used for test scenario 1 and 2
none
cluster configuration used for test scenario 3 and 4 none

Description Alan Conway 2008-11-12 21:04:14 UTC
Description of problem:

Clustered qpidd daemon needs to monitor cluster quorum status using libcman and shut down if the node loses contact with the quorum. cman fencing can take several seconds to power down a node so there's a window where messages could be lost if qpidd doesn't monitor this and shut down immediately.

Testing info:

Start a cluster using cman to configure quorum, verify that qpidd shuts down immediately if one node is disconnected from the cluster.

Comment 1 Alan Conway 2008-11-18 21:01:15 UTC
Implemented but not properly tested. Needs testing in a full cman cluster. 

If --cluster-cman is enabled but cman cannot be contacted for some reason, the broker seg-faults rather than exiting cleanly.


(03:35:33 PM) Alan Conway: astitcher_wfh: if the cluster plugin throws during initialization, I'm seeing a core dump in ~PollerHandlerPrivate trying to delete a locked mutex. 
(03:36:39 PM) astitcher_wfh: aconway: interesting
(03:36:41 PM) Alan Conway: astitcher_wfh: ~PollerHandle() calls PHDeletionManager.markForDeletion(impl) with impl->lock() held. That seems incorrect if markFor may delete the object.
(03:41:51 PM) Alan Conway: astitcher_wfh: maybe I'm abusing the poller model, this is happening in a PollableQueue that is constructed then deleted because of an execption. I get the crash whether or not I start() the queue

Comment 2 Alan Conway 2008-11-26 22:46:21 UTC
#1 above has been fixed. Still need to test in a cman cluster. 

To test: 

Set up a cluster of 3 nodes. Can set up config file with --system-config-cluster.
Check the cluster suite docs for more details - in particular this seems to be picky about network setup. All we need for this test is to get cman running on 3 nodes, we don't need any other cluster services.

Try the following scenarios:

- start cman on node 1 only. 
- start qpidd on node 1 with info logging. You should see "waiting for cluster" log message
- start a qpidd client. The client should hang 
- start cman on node 2, qpidd on node1 should start processing messages, the client should complete normally
- start qpidd on node 2, should start normally, clients should work normally.
- stop cman on node 2
- run clients against qpidd on each node. qpidd should shut down as soon as the clients connect, the clients should not be able to send or receive any messages.

Comment 4 David Sommerseth 2008-12-05 20:47:24 UTC
Got a cman setup up'n'running.  Let the whole cluster settle with 3 nodes.  Tried four different test scenarios to see if I had understood things correctly.

The expected_votes was set to 5, since 3/5 is a majority of a healthy cluster.

** Test scenario 1:
Shutdown the cman daemons on two of the nodes.  Waited.  Started qpidd with the following command line:

qpidd -t --no-module-dir --load-module=/usr/lib64/qpid/daemon/cluster.so --cluster-cman --cluster-name=mrgtest

The daemon did not list any message as expected ("waiting for cluster").  Started perftest against this node, and it managed to send and receive messages.

At this point, only one node was running cman and this node was also running qpidd.


** Test scenario 2:
Started cman on the two other nodes (so a total of 3 qpidd is running).  Waited for things to settle and restarted qpidd, with the same command line as above.  No "waiting for cluster" message was seen.  Only "first in cluster".

Tried to run perftest against this broker on this node.  And perftest did not hang, as was expected.

** Test scenario 3:
Killed one qpidd, to see if the others followed.  None followed, and I had two qpidd still alive.

** Test scenario 4:
Restarted the killed qpidd.  Shutdown cman on the node.  Those two qpidd processes received and logged the following message line:

2008-dec-05 15:41:31 info 2.0.0.0:5623(READY) member update: 1.0.0.0:7017(member) 2.0.0.0:5623(member) 3.0.0.0:6397(member) 


How I understand the bug description, this is not what we would expect.  On the scenario 3 and 4, I understand it as all qpidd alive would shutdown.  This did not happen.

For the scenario 1 and 2, I understand it as the perftest program would "hang" until all three nodes where available.  This did not happen.

Comment 5 David Sommerseth 2008-12-05 20:48:25 UTC
Created attachment 325905 [details]
cluster configuration used for test scenario 1 and 2

Comment 6 David Sommerseth 2008-12-05 20:49:38 UTC
Created attachment 325906 [details]
cluster configuration used for test scenario 3 and 4

Comment 7 David Sommerseth 2008-12-08 17:36:28 UTC
Just some version info ... RPMs was tagged with SVN rev. 722891 ... latest available on Fri, Dec 5.

Comment 8 David Sommerseth 2008-12-10 17:06:15 UTC
Retested this bug, this time on a 2 node cluster.  One of the nodes got a vote value of 2, and the second one got the vote value 1.  The cluster is setup to have 3 votes to be healthy.  Which means that if the node with the 2 votes disappears, the cluster is out of balance, and the 

I ran these test scenarios, using perftest on the client side.  I waited a few minutes between each action and tried perftest twice (with some time in between) after a status change.

** 1.  Smoke-test - Starting 2 nodes
   No problem, both nodes receives messages. qpid-tool says clusterSize=2.

** 2.  Stopping qpidd (CTRL-C) on 2-votes-node, leaving 2 nodes running cman
   Messages was passed to the qpidd being alive.  qpid-tool says clusterSize=1.

** 3.  Stopped both qpidd and cman (in this order) on 2-votes-node, 
       leaving only 1 node running
   Messages was passed to the qpidd being alive.  qpidd-tool says clusterSize=1.

** 4.  Started both nodes with qpidd and cman.  Stopped cman on 2-votes-node.
   qpidd shutdown on the node where cman was shutdown.  The other node was 
   running both cman and qpidd, and messages was passed to this broker as before.

I still believe this bug is not fixed.

Comment 9 Alan Conway 2009-02-13 20:09:14 UTC
I've verified the following 2 scenarios on a 4 node cluster, this is the expected behaviour:

Scenario 1: wait on start-up 
- start cman on one node
- simultaneously start qpidd on the same node with info+ logging. 
 > Should see "waiting for quorum" messages from qpidd.
 > Client attempting to connect gets "connection refused"
- start cman on 2 other nodes
 > once quorum reached should see "Ready" message
 > run a client to verify qpidd is working.

Scenario 2: shutdown on loss of quorum

 - stop cman on one of the nodes
 - run a client against qpidd
  > qpidd should shut down with a "no quorum" message
  > client should be disconnected with "connection closed"

Comment 10 Alan Conway 2009-05-21 12:13:20 UTC
I believe this issue can be tested and closed, see comment above.

Comment 11 Alan Conway 2009-09-08 21:32:08 UTC
As of revision 801740 qpidd responds immediately to loss of quorum, it does not require an active client.

Comment 12 Alan Conway 2009-09-11 12:52:04 UTC

*** This bug has been marked as a duplicate of bug 501537 ***