Bug 718265 - low-latency not expiring work
Summary: low-latency not expiring work
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-low-latency
Version: 1.3
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: 2.0.1
: ---
Assignee: Robert Rati
QA Contact: Lubos Trilety
URL:
Whiteboard:
Depends On: 723971
Blocks: 723887
TreeView+ depends on / blocked
 
Reported: 2011-07-01 15:42 UTC by Robert Rati
Modified: 2011-09-07 16:43 UTC (History)
4 users (show)

Fixed In Version: condor-low-latency-1.2-1
Doc Type: Bug Fix
Doc Text:
C: A job that causes the condor_starter to exit quickly, such as a job where the starter is unable to execute the program in the job, the low-latency daemon will not expire the low-latency job C: The slot running the job that should have been expired will not be allowed to do any more work by the low-latency daemon until the daemon has been restarted. F: Fixed issues with message expiration R: Messages are expired as expected
Clone Of:
Environment:
Last Closed: 2011-09-07 16:43:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1249 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.0 security, bug fix and enhancement update 2011-09-07 16:40:45 UTC

Description Robert Rati 2011-07-01 15:42:18 UTC
Description of problem:
Running the file_no_perms.py test results in the started excepting w/o calling the exit hook.  Carod's lease checking thread isn't expiring a job that isn't being updated, so the result is that carod won't allow that thread to process more work.  The only fix is to restart carod.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Run carod with debug logging
2. run file_no_perms.py
3. Watch CaroLog, see that the slot is continually thought to be doing work.

  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2011-07-01 19:26:04 UTC
There were 2 issues with message expiration:
1) The messages were never being expired because the check for a slot being in use was resetting the access time
2) Once a message was expired, it was unable to be removed from the work queue thus keeping a slot "busy" that was actually empty

Fixed on BZ718265-no-expiration

Comment 2 Robert Rati 2011-07-01 20:39:31 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: A job that causes the condor_starter to exit quickly, such as a job where the starter is unable to execute the program in the job, the low-latency daemon will not expire the low-latency job
C: The slot running the job that should have been expired will not be allowed to do any more work by the low-latency daemon until the daemon has been restarted.
F: Fixed issues with message expiration
R: Messages are expired as expected

Comment 3 Martin Kudlej 2011-07-21 10:33:45 UTC
Tested on RHEL5.6/6.1 x x86_64/i386 with condor-low-latency-1.1-3 and it doesn't work.

Comment 5 Lubos Trilety 2011-08-02 11:29:15 UTC
Job expired after some time and slot was released.
(There is an issue with exit hook, see bug 726761.)

Tested with:
condor-low-latency-1.2-2

Tested on:
RHEL6 x86_64,i386
RHEL5 x86_64,i386

>>> VERIFIED

Comment 6 errata-xmlrpc 2011-09-07 16:43:29 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html


Note You need to log in before you can comment on or make changes to this bug.