Bug 488998

Summary: carod cannot handle broker restart
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.1CC: iboverma, lans.carstensen, lbrindle, mkudlej, tao
Target Milestone: 1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Grid bug fix C: MRG Messaging Broker restarted while low-latency is running on a grid execute node C: The low-latency daemon (carod) would stop processing jobs and crash F: Fixed the daemon to check for disconnections and to attempt to reconnect R: The daemon no longer crashes and will resume processing jobs once the broker is running again If the MRG Messaging Broker was restarted while low-latency was running on a grid execute node, the low-latency daemon (carod) would stop processing jobs and crash. The daemon now checks for disconnections and attempts to reconnect. This prevents the daemon from crashing and will resume processing jobs once the broker is running again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-12-03 09:16:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 522467    
Bug Blocks: 527551    

Description Matthew Farrellee 2009-03-06 17:11:19 UTC
condor-low-latency-1.0-10.el5
condor-job-hooks-common-1.0-5.el5

Exception from carod when qpidd is restarted...

Exception in thread Thread-363:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/sbin/carod", line 418, in handle_reply_fetch
    send_AMQP_msg(broker_connection, saved_work.AMQP_msg, msg_props)
  File "/usr/sbin/carod", line 264, in send_AMQP_msg
    connection.message_transfer(destination=reply_to['exchange'], message=Message(msg_properties, delivery_props, data))
  File "/usr/lib/python2.4/site-packages/qpid/generator.py", line 25, in <lambda>
    method = lambda self, *args, **kwargs: self.invoke(inst, args, kwargs)
  File "/usr/lib/python2.4/site-packages/qpid/session.py", line 143, in invoke
    return self.do_invoke(type, args, kwargs)
  File "/usr/lib/python2.4/site-packages/qpid/session.py", line 152, in do_invoke
    raise SessionDetached()
SessionDetached

Comment 1 Matthew Farrellee 2009-03-06 19:00:33 UTC
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/sbin/carod", line 239, in lease_monitor
    item.unlock(False)
  File "/usr/sbin/carod", line 56, in unlock
    self.__access_lock__.release()
  File "/usr/lib64/python2.4/threading.py", line 113, in release
    assert self.__owner is me, "release() of un-acquire()d lock"
AssertionError: release() of un-acquire()d lock

Comment 2 Robert Rati 2009-08-17 19:45:54 UTC
Fixed in:
condor-low-latency-1.0-18

Comment 3 Martin Kudlej 2009-09-24 11:35:07 UTC
Tested on RHEL5.4 condor-7.4.0-0.5 and RHEL4.8 condor-7.4.0-0.4 i386/x86_64 and with condor-low-latency-1.0-19 and it works --> VERIFIED

Comment 4 Irina Boverman 2009-10-28 17:05:26 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
Carod is no longer crashing when broker is restarted (488998)

Comment 5 Lana Brindley 2009-11-05 02:06:43 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1,8 @@
-Carod is no longer crashing when broker is restarted (488998)+Grid bug fix
+
+C: MRG Messaging Broker restarted
+C: Carod would experience and exception and crash
+F: 
+R: Carod no longer crashes.
+
+NEED FURTHER INFORMATION FOR RELNOTE.

Comment 6 Robert Rati 2009-11-24 13:48:41 UTC
C: MRG Messaging Broker restarted while low-latency is running on a grid execute node
C: The low-latency daemon (carod) would stop processing jobs and crash
F: Fixed the daemon to check for disconnections and to attempt to reconnect
R: The daemon no longer crashes and will resume processing jobs once the broker is running again

Comment 7 Lana Brindley 2009-11-26 20:29:21 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1,8 +1,8 @@
 Grid bug fix
 
-C: MRG Messaging Broker restarted
-C: Carod would experience and exception and crash
-F: 
-R: Carod no longer crashes.
+C: MRG Messaging Broker restarted while low-latency is running on a grid execute node
+C: The low-latency daemon (carod) would stop processing jobs and crash
+F: Fixed the daemon to check for disconnections and to attempt to reconnect
+R: The daemon no longer crashes and will resume processing jobs once the broker is running again 
 
-NEED FURTHER INFORMATION FOR RELNOTE.+If the MRG Messaging Broker was restarted while low-latency was running on a grid execute node, the low-latency daemon (carod) would stop processing jobs and crash. The daemon now checks for disconnections and attempts to reconnect. This prevents the daemon from crashing and will resume processing jobs once the broker is running again.

Comment 8 errata-xmlrpc 2009-12-03 09:16:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1633.html