Bug 489880 - execute directory missing under carod, handle_get_work
Summary: execute directory missing under carod, handle_get_work
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: grid
Version: 1.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: 1.1.1
: ---
Assignee: Robert Rati
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-03-12 13:26 UTC by Matthew Farrellee
Modified: 2009-04-21 16:19 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-04-21 16:19:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:0434 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging and Grid Version 1.1.1 2009-04-21 16:15:50 UTC

Description Matthew Farrellee 2009-03-12 13:26:54 UTC
condor-low-latency-1.0-11.el5
condor-7.2.2-0.7.el5

handle_get_work: Checking if slot 1 is known
Exception in thread Thread-1444:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/sbin/carod", line 575, in handle_exit
    os.chdir(work_cwd)
OSError: [Errno 2] No such file or directory: '/var/lib/condor/execute/dir_2219' 

When processing a few hundred jobs this error was seen once, and has not been reproduced over a thousand jobs.

Comment 1 Robert Rati 2009-03-13 18:41:09 UTC
The error was actually in handle_exit, not handle_get_work.  Added additional exception handling and re-worked how the chdir is handled.  Now there is a check for the existence of work_cwd and if it doesn't exist will handle the case gracefully and release the message for it to be consumed by another run/machine.

Ran 20k+ messages through the system and never saw this error.

Fixed in:
condor-low-latency-1.0-12

Comment 3 Martin Kudlej 2009-04-07 07:29:13 UTC
I have already tested this with 40k messages in BZ489874 without exception in logs. See  https://bugzilla.redhat.com/show_bug.cgi?id=489874#c4

-->VERIFIED

Comment 5 errata-xmlrpc 2009-04-21 16:19:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html


Note You need to log in before you can comment on or make changes to this bug.