condor-low-latency-1.0-11.el5 condor-7.2.2-0.7.el5 handle_get_work: Checking if slot 1 is known Exception in thread Thread-1444: Traceback (most recent call last): File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap self.run() File "/usr/lib64/python2.4/threading.py", line 422, in run self.__target(*self.__args, **self.__kwargs) File "/usr/sbin/carod", line 575, in handle_exit os.chdir(work_cwd) OSError: [Errno 2] No such file or directory: '/var/lib/condor/execute/dir_2219' When processing a few hundred jobs this error was seen once, and has not been reproduced over a thousand jobs.
The error was actually in handle_exit, not handle_get_work. Added additional exception handling and re-worked how the chdir is handled. Now there is a check for the existence of work_cwd and if it doesn't exist will handle the case gracefully and release the message for it to be consumed by another run/machine. Ran 20k+ messages through the system and never saw this error. Fixed in: condor-low-latency-1.0-12
I have already tested this with 40k messages in BZ489874 without exception in logs. See https://bugzilla.redhat.com/show_bug.cgi?id=489874#c4 -->VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0434.html