Hide Forgot
This was inspired JIRA-2992 When a user sets up complex messaging systems involving clusters ( i.e. federation of two clusters ) it's easy to write scripts that cause "race conditions" between different parts of the messaging system. These conditions are easy to interpret as software bugs -- when in fact they are unavoidable consequences of the fact that clusters take finite amounts of time to propagate information across their membership. Here is my suggested text for an addition to the chapter on High Availability. I have already published this to the messaging group, and incorporated suggested changes. --------- suggested text ---------------------------------------------- Multi-program race conditions can arise in multi-broker Qpid messaging systems. Occasional unexpected behaviors may occur, caused by the fact that information takes a finite amount of time to propagate across the brokers in a cluster. Here are some example scenarios, together with suggestions for avoiding surprises. General Considerations ====================== a) There is some latency in syncing brokers in a cluster. b) Adding removing brokers, federated links, etc. to a broker therefore takes time to reach a consistent cluster state. c) The timing of controlled broker shutdowns in such cases is critical. In such cases allow time for syncing before administrating shutdown procedures of any broker in the cluster. d) For examples of specific use cases and how to mitigate risk of inconsistent or lost state, see below. 1. Persistent cluster, and "no clean store" ============================================= A cluster that uses the message store for persistence will have one copy of the store for each cluster broker. When the cluster is shut down, by accident or design, it needs to determine which copy of the store should be used when the cluster is restarted. All of the new brokers need to start up with identical stores, and they need to use the store that was owned by the Last Man Standing from the previous instance of the cluster. But when the a cluster is being shut down, it takes a finite amount of time for the last remaining broker to notice that he is indeed the Last Man Standing, and to mark his store as the "clean" one. It is posssible for a test script to kill all of the cluster's brokers so quickly that the last survivor does not have sufficient time to mark his store. In that case, when you attempt to restart the cluster, you will see a "no clean store" error message, and the cluster will refuse to start. You will then have to manually mark one of the stores as clean, a procedure which is documented elsewhere. Best Practice ----------------------- To shut down the cluster safely, use qpid-cluster --all-stop It will perform a coordinated shutdown that will leave all stores clean. 2. Federation-of-Clusters Topology Change ============================================== Suppose that you have two clusters, A and B, each of two brokers, 1 and 2, and you want to federate the two clusters. To federate them, you will add a route whose source broker is B1 and whose destination broker is A1, using this command: qpid-route -s route add \ 127.0.0.1:${PORT_A_1} \ 127.0.0.1:${PORT_B_1} \ ${EXCHANGE} \ ${KEY} \ -d The "-d" makes the route durable, so that it will be restored if cluster B is shutdown and then restarted. But note that this topology change has so far been communicated only to the #1 brokers in both clusters. Information about the change will take a small amount of time to propagate to the #2 brokers in both clusters. The amount of time required will vary, depending on system load. And now suppose that your script decides to kill broker B1 first -- before it has been able to communicate the topology change to B2. This means that broker B2 will now be the Last Man Standing in cluster B -- and its store contains no knowledge of your route! When you restart cluster B, the route will not be restored. Best Practice ----------------------- Before shutting down the brokers in cluster B, use these commands: qpid-config exchanges --bindings --broker-addr=127.0.0.1:${PORT_B_1} qpid-config exchanges --bindings --broker-addr=127.0.0.1:${PORT_B_2} Use the output to confirm that your route is known to both brokers of the cluster before shutting down either. 3. Newbie Broker Update ============================================== When a new broker is added to a cluster, it gets updated to the current cluster state with this process: 1. the newbie broadcasts an update request 2. veteran brokers make update offers to it. 3. it chooses one. 4. the chosen veteran sends the newbie all state-update information This process will take a variable amount of time, depending on cluster load. If it is interrupted by killing the veteran broker before the update is complete, the newbie will also exit. If there were only two brokers in your cluster, you no longer have a cluster! Best Practice ----------------------- 1. When a client tries to connect to a clustered broker that is not yet updated, it will block until the broker is ready. You can use this behavior to determine when the newbie update has completed. When your client is able to connect, the newbie update is complete. 2. If the log level on the newbie broker is set to debug or greater the newbie broker will output a line to its log that contains the string "update completed".
In the above comment, s/JIRA-2992/JIRA QPID-2992
Bumped back in for 2.2.3.
Content incorporated here: http://documentation-devel.engineering.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2/html-single/Messaging_Installation_and_Configuration_Guide/index.html#sect-Avoiding_Race_Conditions_in_Clusters
Generally I like the change. I propose to do cosmetic changes about hostnames. alpha] qpid-route case qpid-route -s route add \ 127.0.0.1:${PORT_A_1} \ 127.0.0.1:${PORT_B_1} \ ${EXCHANGE} \ ${KEY} \ -d should become: qpid-route -s route add \ ${HOST_A_1}:${PORT_A_1} \ ${HOST_B_1}:${PORT_B_1} \ ${EXCHANGE} \ ${KEY} \ -d beta] qpid-config case qpid-config exchanges --bindings --broker-addr=127.0.0.1:${PORT_B_1} qpid-config exchanges --bindings --broker-addr=127.0.0.1:${PORT_B_2} should become: qpid-config exchanges --bindings --broker-addr=${HOST_B_1}:${PORT_B_1} qpid-config exchanges --bindings --broker-addr=${HOST_B_2}:${PORT_B_2} -> ASSIGNED
I've updated the addresses to use the generic pattern: http://documentation-devel.engineering.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2/html-single/Messaging_Installation_and_Configuration_Guide/index.html#sect-Avoiding_Race_Conditions_in_Clusters
I'm happy with added paragraph now. Thanks for cooperation. -> VERIFIED
MRG Messaging 2.2.3 docs have been released as of 14 November 2012, the docs are now available on https://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_MRG/