Bug 620455 - condor_rm - could not remove all jobs
condor_rm - could not remove all jobs
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
1.0
All Linux
low Severity low
: 2.0
: ---
Assigned To: Robert Rati
Tomas Rusnak
:
Depends On:
Blocks: 693778
  Show dependency treegraph
 
Reported: 2010-08-02 11:05 EDT by Lubos Trilety
Modified: 2011-06-23 11:41 EDT (History)
4 users (show)

See Also:
Fixed In Version: condor-7.5.6-0.1
Doc Type: Bug Fix
Doc Text:
C: Executing 'condor_rm -all' when there are no jobs in the queue C: An error message is printed F: The condor_rm tool better understands when there are no jobs in the queue R: The condor_rm command now returns a different message (no jobs in queue) and a successful return code
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-06-23 11:41:38 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lubos Trilety 2010-08-02 11:05:25 EDT
Description of problem:
When there is no job in condor and then the command 'condor_rm -all' is run, it ends with error message 'Could not remove all jobs.' and returns 1 as exit value. 

Version-Release number of selected component (if applicable):
condor-7.4.4-0.4

How reproducible:
100%

Steps to Reproduce:
1. remove all jobs from condor
2. run 'condor_rm -all'
  
Actual results:
condor_rm prints error message 'Could not remove all jobs.' and returns 1

Expected results:
condor_rm ends successfully without error or at least it prints more precise error message
The actual message is misleading it looks like there are still some jobs in condor, which cannot be removed

Additional info:
Comment 1 Matthew Farrellee 2010-08-02 11:23:33 EDT
<ltrilety> I found only this line 'actOnJobs: didn't do any work, aborting' in SchedLog
Comment 2 Matthew Farrellee 2010-08-02 11:29:07 EDT
With SCHEDD_DEBUG = D_FULLDEBUG -

11:27:07am $ condor_q                              
-- Submitter: localhost.localdomain : <127.0.0.1:53683> : localhost.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
0 jobs; 0 idle, 0 running, 0 held

11:27:10am $ condor_rm -a
Could not remove all jobs.

11:27:13am $ condor_rm -const TRUE
Couldn't find/remove all jobs matching constraint (TRUE)

11:27:17am $ condor_rm -const FALSE
Couldn't find/remove all jobs matching constraint (FALSE)

11:27:19am $ condor_rm -const 1!=1 
Couldn't find/remove all jobs matching constraint (1!=1)

11:27:25am $ condor_rm -const 1==1
Couldn't find/remove all jobs matching constraint (1==1)

11:27:28am $ grep actOnJobs /var/log/condor/SchedLog 
08/02 11:27:13 actOnJobs: didn't do any work, aborting
08/02 11:27:17 actOnJobs: didn't do any work, aborting
08/02 11:27:19 actOnJobs: didn't do any work, aborting
08/02 11:27:25 actOnJobs: didn't do any work, aborting
08/02 11:27:28 actOnJobs: didn't do any work, aborting
Comment 3 Matthew Farrellee 2010-08-02 11:43:10 EDT
schedd.cpp -

...
		// Set a single attribute which says if the action succeeded
		// on at least one job or if it was a total failure
	response_ad->Assign( ATTR_ACTION_RESULT, num_matches ? 1:0 );
...
	if( num_matches == 0 ) {
			// We didn't do anything, so we want to bail out now...
		dprintf( D_FULLDEBUG, 
				 "actOnJobs: didn't do any work, aborting\n" );
		if( needs_transaction ) {
			AbortTransaction();
		}
		unsetQSock();
		return FALSE;
	}
...
Comment 4 Matthew Farrellee 2010-08-02 11:45:35 EDT
rm.cpp -

 -all is implemented with constrain: ClusterId >= 0

...
		int result = FALSE;
		if( !ad->LookupInteger(ATTR_ACTION_RESULT, result) || !result ) {
			had_error = true;
			rval = false;
		}
...
Comment 5 Matthew Farrellee 2010-08-02 11:53:14 EDT
The schedd is not returning enough information for rm to respond to the user appropriately. It is currently the case that an ATTR_ACTION_RESULT = 0 really just means that no jobs were modified, and rm could rely on that fact.

A proper solution is to enhance the information the schedd sends to rm with the number of changed jobs. A downside to this is a wire protocol change, meaning a new rm will need backward compatibility to deal with an older schedd, and the user-friendly nature of rm will be dictated by the version of the schedd it is interacting with.
Comment 6 Matthew Farrellee 2011-02-01 14:41:36 EST
Actually, rm.cpp has doWorkByConstraint, which has the option to provide more useful information. It even has a comment from 2002-03-29 (3930f2d2) stating,

// For now, just return true if the constraint worked on at least
// one job, false if not.  Someday, we can fix up the tool to take
// advantage of all the slick info the schedd gives us back about this
// request.
Comment 7 Robert Rati 2011-02-21 13:34:20 EST
The condor_rm command now returns a different message (no jobs in queue) and a success rather than an error message and 1 when run against a schedd with no jobs.

It should be noted that the schedd updates its internal statistics on number of jobs run every 10 seconds or so, and it is possible to receive the old error message during this time.  

Fixed on branch V7_5-BZ620455-rm_all-result-cleanup
Comment 8 Robert Rati 2011-03-15 13:18:39 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Executing 'condor_rm -all' when there are no jobs in the queue
C: An error message is printed
F: The condor_rm tool better understands when there are no jobs in the queue
R: The condor_rm command now returns a different message (no jobs in queue) and a successful return code
Comment 10 Tomas Rusnak 2011-05-04 07:21:32 EDT
Reproduced on RHEL5,x86_64:

$CondorVersion: 7.4.5 Feb  4 2011 BuildID: RH-7.4.5-0.8.el5 PRE-RELEASE $
$CondorPlatform: X86_64-LINUX_RHEL5 $

# condor_rm -all
Could not remove all jobs.
# echo $?
1

Retested over current version on all supported platforms x86,x86_64/RHEL5,RHEL6:

condor-7.6.1-0.4

# condor_rm -all
condor_rm:0:There are no jobs in the queue
# echo $?
0

Removing all jobs from queue, where no jobs are submitted, return no error, better info message and ended with 0 return code.

>>> VERIFIED
Comment 11 errata-xmlrpc 2011-06-23 11:41:38 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0889.html

Note You need to log in before you can comment on or make changes to this bug.