We've seen instances of the condor_master crashing when the QMF agent is unable to contact it's broker. repro was: * Cause one of the clustered qpidd hosts to run out of memory. * This causes qpidd to hang connections. * The master process crashes with a backtrace in an uncaught C++ exception in the master's QMF plugin.
Any chance we can get a corefile and/or a full stacktrace showing the particular C++ exception that occurs? Thanks, -K
Are there any places where the QMF agent could be throwing exceptions when it can no longer contact the broker?
Yes Matt - there are two exceptions that can occur in the Connection Thread spawned by the ManagementAgentImpl class as a result of lost broker connections (See qpid/cpp/src/qpid/agent/ManagementAgentImpl.cpp). Turns out, I did fix a shared pointer issue that -could- be the cause of this crash. See https://bugzilla.redhat.com/show_bug.cgi?id=489557 Without a stacktrace/corefile for this particular issue, I cannot be totally sure. But it seems likely that this bug would be solved by BZ489557. Can you retest with the new fix, or supply a core/trace? thanks,
We could not get a core for this.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Corrected occurrences of the condor_master crashing when the QMF agent loses its broker connection (521138)
I'm failing to reproduce this bug, any suggestions on how to more rapidly reproduce? Also the fact that core file backtrace is not pasted makes it very difficult to verify. So far brutal force stress testing is used to restart condor daemons using reproducer https://bugzilla.redhat.com/attachment.cgi?id=369018
Never did get a reproducer or a backtrace of this bug. The underlying code was changed quite a bit to address https://bugzilla.redhat.com/show_bug.cgi?id=489557, and these *could* be the same bug. Without backtrace or repro, we cannot be sure.
After quite an effort to make it reproduced, I'm still failing to reproduce. Let me set NEEDINFO for this BZ for jthomas Here are the items I'd like to get clarified: - more precisely how to reproduce - what version of packages were used (was it really 1.1 set or later?) - brokers in cluster were on one machine or multiple ones? - broker cluster width - condor configuration (personal condor or remote configuration?) - if remote configuration what was pool layout? - name of the exception thrown from QMF agent - core backtrace from condor_master crash (if possible) I'm unable to go forward with verification until major part of above points are known, because it would be just guessing...
forwarded req to support
Status: I'm still failing to reproduce (after few days of testing) No more info gathered till now, rising NEEDINFO again!
I can't do a relnote on this if it hasn't been fixed. Setting relnote to - for now. Please reset to ? if relnote required. LKB
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,9 @@ +Management bug fix + +C: The QMF agent can not contact the broker +C: condor_master crashes +F: NOT FIXED +R: + + Corrected occurrences of the condor_master crashing when the QMF agent loses its broker connection (521138)
The issue was not reproduced. Original Issue Tracker was closed by initiator based on request for more information. Moreover possibly linked problems like bug 489557, bug 578137, bug 534073, bug 596210, bug 625450 are all fixed. -> CLOSED/INSUFFICIENT_DATA