557832 – Broker boot sequence doesn't synchronize when clustered.

Bug 557832 - Broker boot sequence doesn't synchronize when clustered.

Summary: Broker boot sequence doesn't synchronize when clustered.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ted Ross
QA Contact:	Jeff Needle
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	501015
TreeView+	depends on / blocked

Reported:	2010-01-22 17:17 UTC by jrd
Modified:	2013-02-27 04:26 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-12-07 17:42:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Proposed fix for hot-wiring peer state of clustered broker (7.46 KB, application/octet-stream) 2010-01-22 19:15 UTC, jrd	no flags	Details
View All

Description jrd 2010-01-22 17:17:02 UTC

Description of problem:

I discovered this when debugging my fix for the object-number discrepancy described in 501015.  If you start broker 1 with a data-dir, and broker 2 with no data-dir or a different one, you can (almost always) end up with a different boot sequence.  That means if you cluster these two together, when the updatee gets updates, he'll use different OIDs that the master.

Version-Release number of selected component (if applicable):

trunk

How reproducible:

always

Steps to Reproduce:
1.  bring up two brokers, one with data dir and one without
2.  run Ted's verify_cluster_objects script
3.
  
Actual results:

Errors spew out

Expected results:

Silence

Additional info:

I don't know whether it's expected to be legal to run brokers this way, but it's easy enough to do it inadvertently, and nothing checks.  We should either make them really sync up or flame out if they aren't able to.

Comment 1 jrd 2010-01-22 17:18:30 UTC

Assigning to myself and tying to 501015

Comment 2 jrd 2010-01-22 19:15:06 UTC

Created attachment 386225 [details]
Proposed fix for hot-wiring peer state of clustered broker

This appears to fix the bugs.  It's likely that there are cleaner ways to manage some of the c++'isms.

Comment 3 jrd 2010-01-22 20:01:05 UTC

Added to jira.  See https://issues.apache.org/jira/browse/QPID-2357

Comment 4 Alan Conway 2010-01-25 15:39:19 UTC

What does "hotwire" refer to. The naming doesn't ring a bell for me.

QPID_LOG: don't put __FILE__/__LINE__ in the message, the logging system will
add them if the user configures --log-source=yes.

UpdateClient::update: move all the hotwire code into the updateHotwire
function to avoid clutter in update(). It's only 3 lines now but
likely to expand in future.

UpdateClient::update - why are you setting the updater's numbers here?
That looks wrong to me. In general giving an update does not modify
the updater's state, just sends it to the updatee.

Automate the test: add an automated test to cluster_tests.py that sets
up some state and does an update that fails before your patch and
passes after. This will be a very valuable test to have, we can extend it as
we fix other object-id issues going forward.

Comment 6 Alan Conway 2010-03-31 12:26:13 UTC

Fixed in r904268

Comment 9 Jan Sarenik 2010-04-14 11:48:48 UTC

On Fedora Rawhide I compile current trunk qpid, run

 # qpidd --cluster-name=ahoj --data-dir=dir1 --auth=no
 # qpidd -p0 --cluster-name=ahoj --data-dir=dir2 --auth=no
 # python cpp/src/tests/verify_cluster_objects

There is no output in case I am not running any QMF-mangling
tools (e.g. qpid-tool or qpid-config) in the meantime of running
the verify test.

I will verify on RHEL and current packages.

Comment 10 Jan Sarenik 2010-04-14 12:53:39 UTC

[root@rhel5x ~]# python verify_cluster_objects --verbose 1
Connecting to the cluster...
    Broker connected at: 10.34.31.200:5672
Loading management data from nodes...
Verifying objects based on object name...
Success

Comment 11 Jan Sarenik 2010-04-14 13:13:04 UTC

[root@rhel5 ~]# python verify_cluster_objects --verbose 1
Connecting to the cluster...
    Broker connected at: 10.34.31.200:5672
    Broker connected at: 10.34.31.200:43258
Loading management data from nodes...
Verifying objects based on object name...
Success

Comment 13 Jan Sarenik 2010-04-14 13:39:50 UTC

This is how I test (store module has to be disabled for --no-data-dir to work)
on RHEL5:

 # qpidd --cluster-name=ahoj --auth=no --data-dir=data1
 # qpidd --cluster-name=ahoj --auth=no --no-data-dir -p0
 # qpid-tool
 qpid: schema
   ...
 qpid: show org.apache.qpid.cluster:cluster
   ...
 qpid: quid
 Exiting...
 # qpid-config
   ...
 # qpid-config add queue ahoj
 # python verify_cluster_objects --verbose 1

And this is what I am getting:
[root@rhel5x ~]# python verify_cluster_objects --verbose 1
Connecting to the cluster...
    Broker connected at: 10.34.31.200:5672
    Broker connected at: 10.34.31.200:58104
Loading management data from nodes...
Verifying objects based on object name...
Success

VERIFIED on both i386 and x86_64 RHEL5 (RHN-updated)

Comment 14 jrd 2011-06-29 17:06:20 UTC

Not sure why this was assigned to me.  Ted?

Note You need to log in before you can comment on or make changes to this bug.