Bug 620455

Summary: condor_rm - could not remove all jobs
Product: Red Hat Enterprise MRG Reporter: Lubos Trilety <ltrilety>
Component: condorAssignee: Robert Rati <rrati>
Status: CLOSED ERRATA QA Contact: Tomas Rusnak <trusnak>
Severity: low Docs Contact:
Priority: low    
Version: 1.0CC: iboverma, ltoscano, matt, trusnak
Target Milestone: 2.0   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: condor-7.5.6-0.1 Doc Type: Bug Fix
Doc Text:
C: Executing 'condor_rm -all' when there are no jobs in the queue C: An error message is printed F: The condor_rm tool better understands when there are no jobs in the queue R: The condor_rm command now returns a different message (no jobs in queue) and a successful return code
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-23 15:41:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 693778    

Description Lubos Trilety 2010-08-02 15:05:25 UTC
Description of problem:
When there is no job in condor and then the command 'condor_rm -all' is run, it ends with error message 'Could not remove all jobs.' and returns 1 as exit value. 

Version-Release number of selected component (if applicable):
condor-7.4.4-0.4

How reproducible:
100%

Steps to Reproduce:
1. remove all jobs from condor
2. run 'condor_rm -all'
  
Actual results:
condor_rm prints error message 'Could not remove all jobs.' and returns 1

Expected results:
condor_rm ends successfully without error or at least it prints more precise error message
The actual message is misleading it looks like there are still some jobs in condor, which cannot be removed

Additional info:

Comment 1 Matthew Farrellee 2010-08-02 15:23:33 UTC
<ltrilety> I found only this line 'actOnJobs: didn't do any work, aborting' in SchedLog

Comment 2 Matthew Farrellee 2010-08-02 15:29:07 UTC
With SCHEDD_DEBUG = D_FULLDEBUG -

11:27:07am $ condor_q                              
-- Submitter: localhost.localdomain : <127.0.0.1:53683> : localhost.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
0 jobs; 0 idle, 0 running, 0 held

11:27:10am $ condor_rm -a
Could not remove all jobs.

11:27:13am $ condor_rm -const TRUE
Couldn't find/remove all jobs matching constraint (TRUE)

11:27:17am $ condor_rm -const FALSE
Couldn't find/remove all jobs matching constraint (FALSE)

11:27:19am $ condor_rm -const 1!=1 
Couldn't find/remove all jobs matching constraint (1!=1)

11:27:25am $ condor_rm -const 1==1
Couldn't find/remove all jobs matching constraint (1==1)

11:27:28am $ grep actOnJobs /var/log/condor/SchedLog 
08/02 11:27:13 actOnJobs: didn't do any work, aborting
08/02 11:27:17 actOnJobs: didn't do any work, aborting
08/02 11:27:19 actOnJobs: didn't do any work, aborting
08/02 11:27:25 actOnJobs: didn't do any work, aborting
08/02 11:27:28 actOnJobs: didn't do any work, aborting

Comment 3 Matthew Farrellee 2010-08-02 15:43:10 UTC
schedd.cpp -

...
		// Set a single attribute which says if the action succeeded
		// on at least one job or if it was a total failure
	response_ad->Assign( ATTR_ACTION_RESULT, num_matches ? 1:0 );
...
	if( num_matches == 0 ) {
			// We didn't do anything, so we want to bail out now...
		dprintf( D_FULLDEBUG, 
				 "actOnJobs: didn't do any work, aborting\n" );
		if( needs_transaction ) {
			AbortTransaction();
		}
		unsetQSock();
		return FALSE;
	}
...

Comment 4 Matthew Farrellee 2010-08-02 15:45:35 UTC
rm.cpp -

 -all is implemented with constrain: ClusterId >= 0

...
		int result = FALSE;
		if( !ad->LookupInteger(ATTR_ACTION_RESULT, result) || !result ) {
			had_error = true;
			rval = false;
		}
...

Comment 5 Matthew Farrellee 2010-08-02 15:53:14 UTC
The schedd is not returning enough information for rm to respond to the user appropriately. It is currently the case that an ATTR_ACTION_RESULT = 0 really just means that no jobs were modified, and rm could rely on that fact.

A proper solution is to enhance the information the schedd sends to rm with the number of changed jobs. A downside to this is a wire protocol change, meaning a new rm will need backward compatibility to deal with an older schedd, and the user-friendly nature of rm will be dictated by the version of the schedd it is interacting with.

Comment 6 Matthew Farrellee 2011-02-01 19:41:36 UTC
Actually, rm.cpp has doWorkByConstraint, which has the option to provide more useful information. It even has a comment from 2002-03-29 (3930f2d2) stating,

// For now, just return true if the constraint worked on at least
// one job, false if not.  Someday, we can fix up the tool to take
// advantage of all the slick info the schedd gives us back about this
// request.

Comment 7 Robert Rati 2011-02-21 18:34:20 UTC
The condor_rm command now returns a different message (no jobs in queue) and a success rather than an error message and 1 when run against a schedd with no jobs.

It should be noted that the schedd updates its internal statistics on number of jobs run every 10 seconds or so, and it is possible to receive the old error message during this time.  

Fixed on branch V7_5-BZ620455-rm_all-result-cleanup

Comment 8 Robert Rati 2011-03-15 17:18:39 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Executing 'condor_rm -all' when there are no jobs in the queue
C: An error message is printed
F: The condor_rm tool better understands when there are no jobs in the queue
R: The condor_rm command now returns a different message (no jobs in queue) and a successful return code

Comment 10 Tomas Rusnak 2011-05-04 11:21:32 UTC
Reproduced on RHEL5,x86_64:

$CondorVersion: 7.4.5 Feb  4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

# condor_rm -all
Could not remove all jobs.
# echo $?
1

Retested over current version on all supported platforms x86,x86_64/RHEL5,RHEL6:

condor-7.6.1-0.4

# condor_rm -all
condor_rm:0:There are no jobs in the queue
# echo $?
0

Removing all jobs from queue, where no jobs are submitted, return no error, better info message and ended with 0 return code.

>>> VERIFIED

Comment 11 errata-xmlrpc 2011-06-23 15:41:38 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html