We have a hack in place that suppresses exceptions when the session receives completions for transfers not yet sent (which is the usual manifestation of the unpredictability). I.e. we have in essence disabled consistency checking for management sessions. This solved immediate problems but would quickly stop working if sessions/connections could be used for management and other things (as will be more likely with QMFv2 where using management becomes quite straightforward).
The problem with management updates in a timer is indepedent of the object ID problem. Created a separate Bug 557138
I put a proposed patch which fixes this and Bug 557832 in the latter bz.
Alan, I believe this one got resolved, correct? Please bounce it back to me if not. Thanks...
We still don't have consistent object IDs in QMFv1 so we still have to disable some management commands as per the description. With QMFv2 we should be OK, but we need to test it.
How can I test this, gentlemen?
Run a 4 node cluster with --mgmt-sub-interval=1 to get frequent management updates. Run perftest, qpid-config -b, qpid-queue-stats and sesame in loops. Kill & restart one of the brokers a few times while this is all running. Let the clients run for an hour & verify no failures.
Stopping and restarting one or more of the nodes while the test described in comment 8 is running is also useful (tests the join/update protocol).
should be --mgmt-pub-interval=1
qpid-queue-stat is running since I started it and number of other AMQP and also QMF clients were run on this four-node cluster consisting of 2 RHEL5 i386 nodes and 2 RHEL5 x86_64 nodes. Every 5 minutes a broker on random one of them is restarted via "service qpidd restart". Still the qpid-queue-stat runs fine until now. I am setting this bug to VERIFIED state. qpid-cpp-server-cluster-0.7.946106-2.el5
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The management component is now capable of working in a cluster.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html