condor-7.1.4-0.3.el4 qmf-0.3.709187-4.el4 qpidc-0.3.709187-4.el4 qpid-tool on north-15 shows all RHEL5 daemons, e.g. Masters, and none from RHEL4 machines. the plugins on the RHEL4 machines are loading successfully
It appears that this is a qpid problem. The call to failover.reset at the end of ConnectionImpl::open() never returns; because this executes in a separate thread (for the grid qmf plugin), it does not cause the condor_master to hang, but the master will not be published in any case. To reproduce, run "condor_master -t -f" on north-04, which should try and connect to a broker on north-15.
Mick observed a crash in a similar place in an example program; I am reproducing his email below. --- I found a little something interesting about willb's hang, which I want to record here for posterity. It doesn't look like a race -- and in my case I do not see a hang -- but I do see Interesting Behavior in the same place will is seeing a hang. I reproduced by running a simple client (declare_queues) on RHEL4, and talking to a broker on RHEL5. The FailoverListener ctor exits early because this is true: session.exchangeQuery(arg::name=AMQ_FAILOVER).getNotFound() That looks reasonable, since the RHEL5 broker is non-clustered -- but I bet that's where Will is seeing it hang rather than return early as expected. That's all I've got so far....
r720973 | tross | 2008-11-26 14:48:44 -0600 (Wed, 26 Nov 2008) | 7 lines Bug fixes for QMF: ManagementAgentImpl - don't send messages if broker is not connected. ManagementBroker - agents could be assigned the same agentBank - don't send console-attached for attached agents - handle multiple qmf messages in an AMQP body schema.py - Don't use the FieldTable copy-constructor, use .clear() ------------------------------------------------------------------------ r720972 | tross | 2008-11-26 14:43:14 -0600 (Wed, 26 Nov 2008) | 12 lines Added a copy constructor and assignment operator to FieldTable. This was done to solve a library problem with the RHEL4 distribution. The compiler generated the assignment operator in an application using the C++ qpid client libraries. This generated function (referenced by a weak symbol) appeared to be causing problems in the heart of the library (handling of the ConnectionStartBody) with regard to the handling of field tables. The failure mechanism is not fully understood, but this seemingly innocuous change solves the problem.
condor 7.2.0-0.6 will require qpidc&qmf >= 720973
This was not resolved in 720973, turns out ft.clear() did not solve the problem, which is somehow related to weak symbols and/or function definitions in header files. The known workaround is to define FieldTable::clear in a .cpp file, or to make qmf-gen generate separate blocks, e.g. {}, around ft's use.
This appears to be resolved in 7.2.0-0.8 with the addition of -I/usr/local/qpid-boost for RHEL4 builds
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0036.html