Bug 489880

Summary: execute directory missing under carod, handle_get_work
Product: Red Hat Enterprise MRG Reporter: Matthew Farrellee <matt>
Component: gridAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.1CC: jsarenik, mkudlej
Target Milestone: 1.1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-21 16:19:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2009-03-12 13:26:54 UTC
condor-low-latency-1.0-11.el5
condor-7.2.2-0.7.el5

handle_get_work: Checking if slot 1 is known
Exception in thread Thread-1444:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/sbin/carod", line 575, in handle_exit
    os.chdir(work_cwd)
OSError: [Errno 2] No such file or directory: '/var/lib/condor/execute/dir_2219' 

When processing a few hundred jobs this error was seen once, and has not been reproduced over a thousand jobs.

Comment 1 Robert Rati 2009-03-13 18:41:09 UTC
The error was actually in handle_exit, not handle_get_work.  Added additional exception handling and re-worked how the chdir is handled.  Now there is a check for the existence of work_cwd and if it doesn't exist will handle the case gracefully and release the message for it to be consumed by another run/machine.

Ran 20k+ messages through the system and never saw this error.

Fixed in:
condor-low-latency-1.0-12

Comment 3 Martin Kudlej 2009-04-07 07:29:13 UTC
I have already tested this with 40k messages in BZ489874 without exception in logs. See  https://bugzilla.redhat.com/show_bug.cgi?id=489874#c4

-->VERIFIED

Comment 5 errata-xmlrpc 2009-04-21 16:19:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html