Bug 470320

Summary: RHEL4 daemons not getting published
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Matthew Farrellee <matt>
Status: CLOSED ERRATA QA Contact: Kim van der Riet <kim.vdriet>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.0CC: tross
Target Milestone: 1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-04 16:03:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2008-11-06 17:40:26 UTC
condor-7.1.4-0.3.el4
qmf-0.3.709187-4.el4
qpidc-0.3.709187-4.el4

qpid-tool on north-15 shows all RHEL5 daemons, e.g. Masters, and none from RHEL4 machines.

the plugins on the RHEL4 machines are loading successfully

Comment 1 Will Benton 2008-11-18 15:00:23 UTC
It appears that this is a qpid problem.  The call to failover.reset at the end of ConnectionImpl::open() never returns; because this executes in a separate thread (for the grid qmf plugin), it does not cause the condor_master to hang, but the master will not be published in any case.

To reproduce, run "condor_master -t -f" on north-04, which should try and connect to a broker on north-15.

Comment 2 Will Benton 2008-11-18 15:02:21 UTC
Mick observed a crash in a similar place in an example program; I am reproducing his email below.

---

I found a little something interesting about willb's hang, which I want
to record here for posterity.

It doesn't look like a race -- and in my case I do not see a hang -- but
I do see Interesting Behavior in the same place will is seeing a hang.

I reproduced by running a simple client (declare_queues) on RHEL4, and
talking to a broker on RHEL5.  The FailoverListener ctor exits early
because this is true:

    session.exchangeQuery(arg::name=AMQ_FAILOVER).getNotFound()

That looks reasonable, since the RHEL5 broker is non-clustered -- but I
bet that's where Will is seeing it hang rather than return early as
expected.

That's all I've got so far....

Comment 3 Matthew Farrellee 2008-11-26 21:46:23 UTC
r720973 | tross | 2008-11-26 14:48:44 -0600 (Wed, 26 Nov 2008) | 7 lines

Bug fixes for QMF:
  ManagementAgentImpl - don't send messages if broker is not connected.
  ManagementBroker - agents could be assigned the same agentBank
                   - don't send console-attached for attached agents
                   - handle multiple qmf messages in an AMQP body
  schema.py - Don't use the FieldTable copy-constructor, use .clear()

------------------------------------------------------------------------
r720972 | tross | 2008-11-26 14:43:14 -0600 (Wed, 26 Nov 2008) | 12 lines

Added a copy constructor and assignment operator to FieldTable.
This was done to solve a library problem with the RHEL4 distribution.

The compiler generated the assignment operator in an application using
the C++ qpid client libraries.  This generated function (referenced by
a weak symbol) appeared to be causing problems in the heart of the
library (handling of the ConnectionStartBody) with regard to the
handling of field tables.

The failure mechanism is not fully understood, but this seemingly
innocuous change solves the problem.

Comment 4 Matthew Farrellee 2008-11-26 22:13:00 UTC
condor 7.2.0-0.6 will require qpidc&qmf >= 720973

Comment 6 Matthew Farrellee 2008-12-02 21:09:18 UTC
This was not resolved in 720973, turns out ft.clear() did not solve the problem, which is somehow related to weak symbols and/or function definitions in header files.

The known workaround is to define FieldTable::clear in a .cpp file, or to make qmf-gen generate separate blocks, e.g. {}, around ft's use.

Comment 7 Matthew Farrellee 2008-12-08 19:14:33 UTC
This appears to be resolved in 7.2.0-0.8 with the addition of -I/usr/local/qpid-boost for RHEL4 builds

Comment 9 errata-xmlrpc 2009-02-04 16:03:52 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0036.html