543524 – Cluster with --cluster-size should not hold up init scripts.

Bug 543524 - Cluster with --cluster-size should not hold up init scripts.

Summary: Cluster with --cluster-size should not hold up init scripts.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Frantisek Reznicek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-12-02 15:24 UTC by Alan Conway
Modified:	2015-11-16 01:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, when multiple brokers were run with the "-d --cluster-size=N" parameters specified, they would hang until N members have joined the cluster. With this update, brokers do not wait for N members to join the cluster. Instead, brokers block any client until all daemons are started.
Clone Of:
Environment:
Last Closed:	2010-10-14 16:04:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0773	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3	2010-10-14 15:56:44 UTC

Description Alan Conway 2009-12-02 15:24:10 UTC

Description of problem:

--cluster-size N:  Wait for at least N initial members before recovering from store and listening for client connections.

This is problematic behavior. May hold up start scripts: If qpidd is run with the {{daemon}} and {{cluster-size N}} options, it will not return until N members have joined.

How reproducible: always


Steps to Reproduce: 

Start 3 brokers with qpidd -d --cluster-size 3 --cluster-name foo 
  
Actual results:

First 2 qpidd invocations will not return till the 3rd broker starts.

Expected results:

All qpidd invocations return immediately.

Additional info:

We don't need to hold up broker startup till after the cluster has exchanged persistence parameters. Each node can assume it can go ahead with the apropriate action:
 - empty do nothing
 - clean recover from store
 - dirty push current store

That means we can let broker init complete and qpidd -d return before cluster init is complete. We stall clients until the cluster init is complete so if any inconsistencies are discovered during cluster init we can still shut down before touching the clean DBs.

ALso need to verify/fix that we don't over-write previously pushed DB and we don't push empty DBs.

Comment 1 Alan Conway 2009-12-11 21:04:59 UTC

Fixed on my private branch, waiting for 0.6 release to commit to trunk.

Comment 2 Alan Conway 2010-01-11 18:59:54 UTC

fixed in r896536

Comment 3 Frantisek Reznicek 2010-05-14 10:58:14 UTC

The issue seems to be resolved, but there is one weirdness which need to be confirmed.

I can see that brokers ran on different hosts are not waiting for each other in case broker's -d and --cluster-size=N is given.
The strange behavior is entered when cluster width is lower than N, in such a case clients hang, tested with qpid-cluster and perftest.

This behavior might be acceptable but needs to be understand and documented.

Alan, could you possibly review my observation, please?

Additional info:
In the moment when multiple brokers are run with -d --cluster-size=N and with of the cluster is lower than N no broker is accepting client's request for connection/session, more precisely for instance perftest hang in qpid::client::StateManager::waitFor():

 Thread 2 (Thread 0x41ff7940 (LWP 2971)):
#0  0x00000033414d4018 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b6a90b1f5af in qpid::sys::Poller::wait ()
#2  0x00002b6a90b1ffd2 in qpid::sys::Poller::run ()
#3  0x00002b6a90b161ca in ?? () from /usr/lib64/libqpidcommon.so.2
#4  0x0000003341c06617 in start_thread () from /lib64/libpthread.so.0
#5  0x00000033414d3c2d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b6a910a4ca0 (LWP 2970)):
#0  0x0000003341c0ad09 in pthread_cond_wait@@GLIBC_2.3.2 ()
#1  0x00002b6a90753321 in qpid::client::StateManager::waitFor ()
#2  0x00002b6a907084f1 in qpid::client::ConnectionHandler::waitForOpen ()
#3  0x00002b6a9071711b in qpid::client::ConnectionImpl::open ()
#4  0x00002b6a907070cb in qpid::client::Connection::open ()
#5  0x0000000000414887 in ?? ()
#6  0x000000000040ca3e in __cxa_pure_virtual ()
#7  0x000000334141d994 in __libc_start_main () from /lib64/libc.so.6
#8  0x000000000040aac9 in __cxa_pure_virtual ()
#9  0x00007fffb7c372d8 in ?? ()
#10 0x0000000000000000 in ?? ()

If --cluster-width is not given no such behavior is seen. Package set: qpid-cpp-*-0.7.935473-1.el5

Could you possibly review and comment, please?

Comment 4 Alan Conway 2010-05-14 16:03:15 UTC

That is the expected behaviour. Brokers in an incomplete cluster put their clients on hold until the cluster is complete. I have added a documentation BZ to explain this.

Comment 5 Frantisek Reznicek 2010-05-17 08:41:22 UTC

The issue has been fixed/resolved, tested on RHEL 5.5 i386 / x86_64 on package set qpid-cpp-*-0.7.935473-1.el5.

-> VERIFIED

Documentation of the behavior is tracked as bug 592358.

Comment 6 Martin Prpič 2010-10-11 14:00:49 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, when multiple brokers were run with the "-d --cluster-size=N" parameters specified, they would hang until N members have joined the cluster. With this update, brokers do not wait for N members to join the cluster. Instead, brokers block any client until all daemons are started.

Comment 8 errata-xmlrpc 2010-10-14 16:04:17 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html

Note You need to log in before you can comment on or make changes to this bug.