Bug 521138 - QMF agent causes condor_master to crash
Summary: QMF agent causes condor_master to crash
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-qmf
Version: 1.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.3
: ---
Assignee: Ken Giusti
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-03 20:08 UTC by Jon Thomas
Modified: 2018-10-27 15:48 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Management bug fix C: The QMF agent can not contact the broker C: condor_master crashes F: NOT FIXED R: Corrected occurrences of the condor_master crashing when the QMF agent loses its broker connection (521138)
Clone Of:
Environment:
Last Closed: 2010-09-27 15:05:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jon Thomas 2009-09-03 20:08:20 UTC
We've seen instances of the condor_master crashing when the QMF agent is unable to contact it's broker.

repro was:

* Cause one of the clustered qpidd hosts to run out of memory.
* This causes qpidd to hang connections.
* The master process crashes with a backtrace in an uncaught C++ exception in the master's QMF plugin.

Comment 2 Ken Giusti 2009-10-06 14:35:34 UTC
Any chance we can get a corefile and/or a full stacktrace showing the particular C++ exception that occurs?   Thanks,
-K

Comment 3 Matthew Farrellee 2009-10-12 11:21:25 UTC
Are there any places where the QMF agent could be throwing exceptions when it can no longer contact the broker?

Comment 4 Ken Giusti 2009-10-15 15:54:51 UTC
Yes Matt - there are two exceptions that can occur in the Connection Thread spawned by the ManagementAgentImpl class as a result of lost broker connections (See qpid/cpp/src/qpid/agent/ManagementAgentImpl.cpp).  Turns out, I did fix a shared pointer issue that -could- be the cause of this crash.  

See https://bugzilla.redhat.com/show_bug.cgi?id=489557

Without a stacktrace/corefile for this particular issue, I cannot be totally sure.  But it seems likely that this bug would be solved by BZ489557.

Can you retest with the new fix, or supply a core/trace?

thanks,

Comment 5 Matthew Farrellee 2009-10-15 18:37:05 UTC
We could not get a core for this.

Comment 7 Irina Boverman 2009-10-28 17:26:42 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Corrected occurrences of the condor_master crashing when the QMF agent loses its broker connection (521138)

Comment 8 Frantisek Reznicek 2009-11-11 13:12:29 UTC
I'm failing to reproduce this bug, any suggestions on how to more rapidly reproduce? Also the fact that core file backtrace is not pasted makes it very difficult to verify.

So far brutal force stress testing is used to restart condor daemons using reproducer https://bugzilla.redhat.com/attachment.cgi?id=369018

Comment 9 Ken Giusti 2009-11-11 13:56:13 UTC
Never did get a reproducer or a backtrace of this bug.  The underlying code was changed quite a bit to address https://bugzilla.redhat.com/show_bug.cgi?id=489557, and these *could* be the same bug.

Without backtrace or repro, we cannot be sure.

Comment 10 Frantisek Reznicek 2009-11-11 15:17:05 UTC
After quite an effort to make it reproduced, I'm still failing to reproduce.

Let me set NEEDINFO for this BZ for jthomas

Here are the items I'd like to get clarified:
- more precisely how to reproduce
- what version of packages were used (was it really 1.1 set or later?)
- brokers in cluster were on one machine or multiple ones?
  - broker cluster width
- condor configuration (personal condor or remote configuration?)
  - if remote configuration what was pool layout?
- name of the exception thrown from QMF agent
- core backtrace from condor_master crash (if possible)

I'm unable to go forward with verification until major part of above points are known, because it would be just guessing...

Comment 11 Jon Thomas 2009-11-11 15:32:57 UTC
forwarded req to support

Comment 12 Frantisek Reznicek 2009-11-18 13:19:18 UTC
Status: I'm still failing to reproduce (after few days of testing)

No more info gathered till now, rising NEEDINFO again!

Comment 14 Lana Brindley 2009-11-19 04:37:35 UTC
I can't do a relnote on this if it hasn't been fixed. Setting relnote to - for now. Please reset to ? if relnote required.

LKB

Comment 15 Lana Brindley 2009-11-19 04:37:35 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,9 @@
+Management bug fix
+
+C: The QMF agent can not contact the broker
+C: condor_master crashes
+F: NOT FIXED
+R:
+
+
 Corrected occurrences of the condor_master crashing when the QMF agent loses its broker connection (521138)

Comment 17 Frantisek Reznicek 2010-09-27 15:05:50 UTC
The issue was not reproduced. Original Issue Tracker was closed by initiator based on request for more information.

Moreover possibly linked problems like bug 489557, bug 578137, bug 534073, bug 596210, bug 625450 are all fixed.


-> CLOSED/INSUFFICIENT_DATA


Note You need to log in before you can comment on or make changes to this bug.