Description of problem: --cluster-size N: Wait for at least N initial members before recovering from store and listening for client connections. This is problematic behavior. May hold up start scripts: If qpidd is run with the {{daemon}} and {{cluster-size N}} options, it will not return until N members have joined. How reproducible: always Steps to Reproduce: Start 3 brokers with qpidd -d --cluster-size 3 --cluster-name foo Actual results: First 2 qpidd invocations will not return till the 3rd broker starts. Expected results: All qpidd invocations return immediately. Additional info: We don't need to hold up broker startup till after the cluster has exchanged persistence parameters. Each node can assume it can go ahead with the apropriate action: - empty do nothing - clean recover from store - dirty push current store That means we can let broker init complete and qpidd -d return before cluster init is complete. We stall clients until the cluster init is complete so if any inconsistencies are discovered during cluster init we can still shut down before touching the clean DBs. ALso need to verify/fix that we don't over-write previously pushed DB and we don't push empty DBs.
Fixed on my private branch, waiting for 0.6 release to commit to trunk.
fixed in r896536
The issue seems to be resolved, but there is one weirdness which need to be confirmed. I can see that brokers ran on different hosts are not waiting for each other in case broker's -d and --cluster-size=N is given. The strange behavior is entered when cluster width is lower than N, in such a case clients hang, tested with qpid-cluster and perftest. This behavior might be acceptable but needs to be understand and documented. Alan, could you possibly review my observation, please? Additional info: In the moment when multiple brokers are run with -d --cluster-size=N and with of the cluster is lower than N no broker is accepting client's request for connection/session, more precisely for instance perftest hang in qpid::client::StateManager::waitFor(): Thread 2 (Thread 0x41ff7940 (LWP 2971)): #0 0x00000033414d4018 in epoll_wait () from /lib64/libc.so.6 #1 0x00002b6a90b1f5af in qpid::sys::Poller::wait () #2 0x00002b6a90b1ffd2 in qpid::sys::Poller::run () #3 0x00002b6a90b161ca in ?? () from /usr/lib64/libqpidcommon.so.2 #4 0x0000003341c06617 in start_thread () from /lib64/libpthread.so.0 #5 0x00000033414d3c2d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2b6a910a4ca0 (LWP 2970)): #0 0x0000003341c0ad09 in pthread_cond_wait@@GLIBC_2.3.2 () #1 0x00002b6a90753321 in qpid::client::StateManager::waitFor () #2 0x00002b6a907084f1 in qpid::client::ConnectionHandler::waitForOpen () #3 0x00002b6a9071711b in qpid::client::ConnectionImpl::open () #4 0x00002b6a907070cb in qpid::client::Connection::open () #5 0x0000000000414887 in ?? () #6 0x000000000040ca3e in __cxa_pure_virtual () #7 0x000000334141d994 in __libc_start_main () from /lib64/libc.so.6 #8 0x000000000040aac9 in __cxa_pure_virtual () #9 0x00007fffb7c372d8 in ?? () #10 0x0000000000000000 in ?? () If --cluster-width is not given no such behavior is seen. Package set: qpid-cpp-*-0.7.935473-1.el5 Could you possibly review and comment, please?
That is the expected behaviour. Brokers in an incomplete cluster put their clients on hold until the cluster is complete. I have added a documentation BZ to explain this.
The issue has been fixed/resolved, tested on RHEL 5.5 i386 / x86_64 on package set qpid-cpp-*-0.7.935473-1.el5. -> VERIFIED Documentation of the behavior is tracked as bug 592358.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously, when multiple brokers were run with the "-d --cluster-size=N" parameters specified, they would hang until N members have joined the cluster. With this update, brokers do not wait for N members to join the cluster. Instead, brokers block any client until all daemons are started.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html