Bug 718265

Summary: low-latency not expiring work
Product: Red Hat Enterprise MRG Reporter: Robert Rati <rrati>
Component: condor-low-latencyAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: high Docs Contact:
Priority: medium    
Version: 1.3CC: jneedle, ltrilety, matt, mkudlej
Target Milestone: 2.0.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-low-latency-1.2-1 Doc Type: Bug Fix
Doc Text:
C: A job that causes the condor_starter to exit quickly, such as a job where the starter is unable to execute the program in the job, the low-latency daemon will not expire the low-latency job C: The slot running the job that should have been expired will not be allowed to do any more work by the low-latency daemon until the daemon has been restarted. F: Fixed issues with message expiration R: Messages are expired as expected
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-07 16:43:29 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 723971    
Bug Blocks: 723887    

Description Robert Rati 2011-07-01 15:42:18 UTC
Description of problem:
Running the file_no_perms.py test results in the started excepting w/o calling the exit hook.  Carod's lease checking thread isn't expiring a job that isn't being updated, so the result is that carod won't allow that thread to process more work.  The only fix is to restart carod.

Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Run carod with debug logging
2. run file_no_perms.py
3. Watch CaroLog, see that the slot is continually thought to be doing work.

  
Actual results:


Expected results:


Additional info:

Comment 1 Robert Rati 2011-07-01 19:26:04 UTC
There were 2 issues with message expiration:
1) The messages were never being expired because the check for a slot being in use was resetting the access time
2) Once a message was expired, it was unable to be removed from the work queue thus keeping a slot "busy" that was actually empty

Fixed on BZ718265-no-expiration

Comment 2 Robert Rati 2011-07-01 20:39:31 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: A job that causes the condor_starter to exit quickly, such as a job where the starter is unable to execute the program in the job, the low-latency daemon will not expire the low-latency job
C: The slot running the job that should have been expired will not be allowed to do any more work by the low-latency daemon until the daemon has been restarted.
F: Fixed issues with message expiration
R: Messages are expired as expected

Comment 3 Martin Kudlej 2011-07-21 10:33:45 UTC
Tested on RHEL5.6/6.1 x x86_64/i386 with condor-low-latency-1.1-3 and it doesn't work.

Comment 5 Lubos Trilety 2011-08-02 11:29:15 UTC
Job expired after some time and slot was released.
(There is an issue with exit hook, see bug 726761.)

Tested with:
condor-low-latency-1.2-2

Tested on:
RHEL6 x86_64,i386
RHEL5 x86_64,i386

>>> VERIFIED

Comment 6 errata-xmlrpc 2011-09-07 16:43:29 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1249.html