Bug 557832

Summary:

Broker boot sequence doesn't synchronize when clustered.

Product:

Red Hat Enterprise MRG

Reporter:

jrd <jrd>

Component:

qpid-cpp

Assignee:

Ted Ross <tross>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Jeff Needle <jneedle>

Severity:

medium

Docs Contact:

Priority:

low

Version:

Development

CC:

aconway

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-12-07 17:42:12 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

501015

Attachments:

Description	Flags
Proposed fix for hot-wiring peer state of clustered broker	none

Description jrd 2010-01-22 17:17:02 UTC

Description of problem:

I discovered this when debugging my fix for the object-number discrepancy described in 501015.  If you start broker 1 with a data-dir, and broker 2 with no data-dir or a different one, you can (almost always) end up with a different boot sequence.  That means if you cluster these two together, when the updatee gets updates, he'll use different OIDs that the master.

Version-Release number of selected component (if applicable):

trunk

How reproducible:

always

Steps to Reproduce:
1.  bring up two brokers, one with data dir and one without
2.  run Ted's verify_cluster_objects script
3.
  
Actual results:

Errors spew out

Expected results:

Silence

Additional info:

I don't know whether it's expected to be legal to run brokers this way, but it's easy enough to do it inadvertently, and nothing checks.  We should either make them really sync up or flame out if they aren't able to.

Comment 1 jrd 2010-01-22 17:18:30 UTC

Assigning to myself and tying to 501015

Comment 2 jrd 2010-01-22 19:15:06 UTC

Created attachment 386225 [details]
Proposed fix for hot-wiring peer state of clustered broker

This appears to fix the bugs.  It's likely that there are cleaner ways to manage some of the c++'isms.

Comment 3 jrd 2010-01-22 20:01:05 UTC

Added to jira.  See https://issues.apache.org/jira/browse/QPID-2357

Comment 4 Alan Conway 2010-01-25 15:39:19 UTC

What does "hotwire" refer to. The naming doesn't ring a bell for me.

QPID_LOG: don't put __FILE__/__LINE__ in the message, the logging system will
add them if the user configures --log-source=yes.

UpdateClient::update: move all the hotwire code into the updateHotwire
function to avoid clutter in update(). It's only 3 lines now but
likely to expand in future.

UpdateClient::update - why are you setting the updater's numbers here?
That looks wrong to me. In general giving an update does not modify
the updater's state, just sends it to the updatee.

Automate the test: add an automated test to cluster_tests.py that sets
up some state and does an update that fails before your patch and
passes after. This will be a very valuable test to have, we can extend it as
we fix other object-id issues going forward.

Comment 6 Alan Conway 2010-03-31 12:26:13 UTC

Fixed in r904268

Comment 9 Jan Sarenik 2010-04-14 11:48:48 UTC

On Fedora Rawhide I compile current trunk qpid, run

 # qpidd --cluster-name=ahoj --data-dir=dir1 --auth=no
 # qpidd -p0 --cluster-name=ahoj --data-dir=dir2 --auth=no
 # python cpp/src/tests/verify_cluster_objects

There is no output in case I am not running any QMF-mangling
tools (e.g. qpid-tool or qpid-config) in the meantime of running
the verify test.

I will verify on RHEL and current packages.

Comment 10 Jan Sarenik 2010-04-14 12:53:39 UTC

[root@rhel5x ~]# python verify_cluster_objects --verbose 1
Connecting to the cluster...
    Broker connected at: 10.34.31.200:5672
Loading management data from nodes...
Verifying objects based on object name...
Success

Comment 11 Jan Sarenik 2010-04-14 13:13:04 UTC

[root@rhel5 ~]# python verify_cluster_objects --verbose 1
Connecting to the cluster...
    Broker connected at: 10.34.31.200:5672
    Broker connected at: 10.34.31.200:43258
Loading management data from nodes...
Verifying objects based on object name...
Success

Comment 13 Jan Sarenik 2010-04-14 13:39:50 UTC

This is how I test (store module has to be disabled for --no-data-dir to work)
on RHEL5:

 # qpidd --cluster-name=ahoj --auth=no --data-dir=data1
 # qpidd --cluster-name=ahoj --auth=no --no-data-dir -p0
 # qpid-tool
 qpid: schema
   ...
 qpid: show org.apache.qpid.cluster:cluster
   ...
 qpid: quid
 Exiting...
 # qpid-config
   ...
 # qpid-config add queue ahoj
 # python verify_cluster_objects --verbose 1

And this is what I am getting:
[root@rhel5x ~]# python verify_cluster_objects --verbose 1
Connecting to the cluster...
    Broker connected at: 10.34.31.200:5672
    Broker connected at: 10.34.31.200:58104
Loading management data from nodes...
Verifying objects based on object name...
Success

VERIFIED on both i386 and x86_64 RHEL5 (RHN-updated)

Comment 14 jrd 2011-06-29 17:06:20 UTC

Not sure why this was assigned to me.  Ted?