Created attachment 385771 [details] Description of how to reproduce bug A cluster can behave badly after recovering all nodes from persistent stores. In particular, the following error message is seen on one or more nodes: 2010-01-20 14:09:15 error Execution exception: invalid-argument: anonymous.f06a1d50-05ae-401f-bfa7-286758fab447: confirmed < (5+0) but only sent < (4+0) (qpid/SessionState.cpp:151) 2010-01-20 14:09:15 critical cluster(10.16.16.49:15910 READY/error) local error 699 did not occur on member 10.16.16.49:15961: invalid-argument: anonymous.f06a1d50-05ae-401f-bfa7-286758fab447: confirmed < (5+0) but only sent < (4+0) (qpid/SessionState.cpp:151) 2010-01-20 14:09:15 critical Error delivering frames: local error did not occur on all cluster members : invalid-argument: anonymous.f06a1d50-05ae-401f-bfa7-286758fab447: confirmed < (5+0) but only sent < (4+0) (qpid/SessionState.cpp:151) (qpid/cluster/ErrorCheck.cpp:89) 2010-01-20 14:09:15 notice cluster(10.16.16.49:15910 LEFT/error) leaving cluster clusterX Revisions: Qpid: 900860 Store: 3809 Method for reproduction is described in attached file.
While testing on r.905680, I see the behavior of this bug has changed a little. Now, when restarting the three nodes after the shutdown, the second and third nodes fail with an error: error Exchange already created: nfl.scores (MessageStoreImpl.cpp:529) during the cluster catch-up.
The issue here is with a broker with a clean store trying to join a cluster that is already running. The decision to push or recover the store is made early (in Cluster::Cluster) based only on whether the store is clean (recover) or dirty (push) This is before initMapCompleted when the broker knows the disposition of the other cluster members and their stores. So if a broker with a clean store joins a running cluster, it has already allowed the store to recover by the time it discovers that there are already active cluster members - the current code attempts an update in this case which fails as above. At the least there should be a better error message saying "clean store can't join cluster, delete my store", but what we really want in this scenario is for the broker to ditch its store and join. We can probably break the initialStatus negotiation into two phases: 1. In ctor, wait till we have status from all _currently_ running brokers and make push/recover descision based on that. 2. After ctor continue to full completion - i.e. wait for N brokers --cluster-size=N There's also need for doc/release notes to explain the use of --cluster-size for persistent clusters, and perhaps we should enforce --cluster-size > 1 for persistent brokers.
Alan, where did we end up on this one?
I haven't done anything on this beyond what it says in comment 2. Why is it marked modified?
The fix to this is to drop the --cluster-size option. The only function it serves is to allow multiple clean brokers to recover from store rather than having the first broker recover and the rest get an update. This isn't an important enough optimization to justify the extra configuration complexity.
comment 5 is not correct, we do need --cluster-size. comment 2 has the right solution
The binding to <unknown> seems to be independent of the persistent cluster start-up issue, it has been assigned to bug 572221
Fixed in r922412
bug reproduced, verified on RHEL 5.5 - i386/x86_64: # rpm -qa | grep -E '(ais|qpid)' qpid-cpp-client-0.7.935473-1.el5 qpid-cpp-server-xml-0.7.935473-1.el5 qpid-tools-0.7.934605-2.el5 openais-0.80.6-16.el5_5.1 qpid-cpp-server-ssl-0.7.935473-1.el5 qpid-cpp-server-cluster-0.7.935473-1.el5 qpid-cpp-client-ssl-0.7.935473-1.el5 qpid-java-common-0.7.934605-1.el5 qpid-java-client-0.7.934605-1.el5 qpid-cpp-server-store-0.7.935473-1.el5 qpid-cpp-server-0.7.935473-1.el5 python-qpid-0.7.938298-1.el5 --> VERIFIED opened new bug for clarification of "cluster-size" option https://bugzilla.redhat.com/show_bug.cgi?id=592995