Description of problem: I discovered this when debugging my fix for the object-number discrepancy described in 501015. If you start broker 1 with a data-dir, and broker 2 with no data-dir or a different one, you can (almost always) end up with a different boot sequence. That means if you cluster these two together, when the updatee gets updates, he'll use different OIDs that the master. Version-Release number of selected component (if applicable): trunk How reproducible: always Steps to Reproduce: 1. bring up two brokers, one with data dir and one without 2. run Ted's verify_cluster_objects script 3. Actual results: Errors spew out Expected results: Silence Additional info: I don't know whether it's expected to be legal to run brokers this way, but it's easy enough to do it inadvertently, and nothing checks. We should either make them really sync up or flame out if they aren't able to.
Assigning to myself and tying to 501015
Created attachment 386225 [details] Proposed fix for hot-wiring peer state of clustered broker This appears to fix the bugs. It's likely that there are cleaner ways to manage some of the c++'isms.
Added to jira. See https://issues.apache.org/jira/browse/QPID-2357
What does "hotwire" refer to. The naming doesn't ring a bell for me. QPID_LOG: don't put __FILE__/__LINE__ in the message, the logging system will add them if the user configures --log-source=yes. UpdateClient::update: move all the hotwire code into the updateHotwire function to avoid clutter in update(). It's only 3 lines now but likely to expand in future. UpdateClient::update - why are you setting the updater's numbers here? That looks wrong to me. In general giving an update does not modify the updater's state, just sends it to the updatee. Automate the test: add an automated test to cluster_tests.py that sets up some state and does an update that fails before your patch and passes after. This will be a very valuable test to have, we can extend it as we fix other object-id issues going forward.
Fixed in r904268
On Fedora Rawhide I compile current trunk qpid, run # qpidd --cluster-name=ahoj --data-dir=dir1 --auth=no # qpidd -p0 --cluster-name=ahoj --data-dir=dir2 --auth=no # python cpp/src/tests/verify_cluster_objects There is no output in case I am not running any QMF-mangling tools (e.g. qpid-tool or qpid-config) in the meantime of running the verify test. I will verify on RHEL and current packages.
[root@rhel5x ~]# python verify_cluster_objects --verbose 1 Connecting to the cluster... Broker connected at: 10.34.31.200:5672 Loading management data from nodes... Verifying objects based on object name... Success
[root@rhel5 ~]# python verify_cluster_objects --verbose 1 Connecting to the cluster... Broker connected at: 10.34.31.200:5672 Broker connected at: 10.34.31.200:43258 Loading management data from nodes... Verifying objects based on object name... Success
This is how I test (store module has to be disabled for --no-data-dir to work) on RHEL5: # qpidd --cluster-name=ahoj --auth=no --data-dir=data1 # qpidd --cluster-name=ahoj --auth=no --no-data-dir -p0 # qpid-tool qpid: schema ... qpid: show org.apache.qpid.cluster:cluster ... qpid: quid Exiting... # qpid-config ... # qpid-config add queue ahoj # python verify_cluster_objects --verbose 1 And this is what I am getting: [root@rhel5x ~]# python verify_cluster_objects --verbose 1 Connecting to the cluster... Broker connected at: 10.34.31.200:5672 Broker connected at: 10.34.31.200:58104 Loading management data from nodes... Verifying objects based on object name... Success VERIFIED on both i386 and x86_64 RHEL5 (RHN-updated)
Not sure why this was assigned to me. Ted?