Bug 810519 - Wrong deltacloud hold jobs are not removed
Wrong deltacloud hold jobs are not removed
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-deltacloud-gahp (Show other bugs)
Development
x86_64 Linux
high Severity high
: 2.2
: ---
Assigned To: Timothy St. Clair
Luigi Toscano
done
:
Depends On: 812349
Blocks: 803895 828434
  Show dependency treegraph
 
Reported: 2012-04-06 10:17 EDT by Luigi Toscano
Modified: 2012-09-19 14:03 EDT (History)
5 users (show)

See Also:
Fixed In Version: condor-7.6.5-0.15
Doc Type: Bug Fix
Doc Text:
C: Try to hold a remove a job which had invalid input parameters and was never run. C: Job would transition back to hold and condor_rm would fail. F: Fix the state machine so that a failed instance_find would result in a clean exit condition, vs. a hold transition. R: condor_rm behaves correctly when an job has been placed on hold due to invalid input parameters.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-09-19 13:43:26 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Condor and deltacloud-core logs. (19.75 KB, application/x-gzip)
2012-04-16 05:43 EDT, Daniel Horák
no flags Details
Condor and deltacloud-core logs. (34.98 KB, application/x-gzip)
2012-04-17 03:30 EDT, Daniel Horák
no flags Details

  None (edit)
Description Luigi Toscano 2012-04-06 10:17:49 EDT
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Luigi Toscano 2012-04-06 10:25:14 EDT
(please ignore the description)

Description of problem:
A deltacloud job which is hold because of an error is correctly moved to the H state. When it is removed with condor_q it moves to X state, but after a while the job moves back to the H state. 

Version-Release number of selected component (if applicable):
condor-7.6.7-0.9
condor-deltacloud-gahp-7.6.7-0.9

How reproducible:
Configure a job which requests a non existing image and follow the abovementioned steps.
Comment 2 Timothy St. Clair 2012-04-10 10:43:23 EDT
Please attach logs.
Comment 3 Timothy St. Clair 2012-04-10 10:45:19 EDT
also please include hold message on second hold.
Comment 4 Daniel Horák 2012-04-16 05:40:46 EDT
Created attachment 577675 [details]
Condor and deltacloud-core logs.

# rpm -qa | grep condor
  condor-7.6.7-0.10.el6.x86_64
  condor-classads-7.6.7-0.10.el6.x86_64
  condor-deltacloud-gahp-7.6.7-0.10.el6.x86_64
# rpm -qa | grep deltacloud
  deltacloud-core-rhevm-0.5.0-5.el6.noarch
  deltacloud-core-0.5.0-5.el6.noarch
  libdeltacloud-0.9-1.el6.x86_64
  condor-deltacloud-gahp-7.6.7-0.10.el6.x86_64

$ cat deltacloud.job 
  universe = grid
  grid_resource = deltacloud http://HOSTNAME:3002/api
  executable = rhevm_job
  deltacloud_username = USER
  deltacloud_password_file = rhevm_user_pwd

  deltacloud_image_id = abecedee-c3a9-4dff-8ea5-1e355c1abece
  deltacloud_realm_id = efghijkl-cff6-11e0-9267-5254001abece

  log = job_deltacloud_basic.log
  notification = NEVER
  queue

$ condor_submit deltacloud.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ---
  001.000:  Request has not yet been considered by the matchmaker.

$ condor_q
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test            4/16 11:03   0+00:00:00 H  0   0.0  rhevm_job         

  1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ---
  001.000:  Request is held.

  Hold reason: Create_Instance_Failure: Resource not found

$ condor_rm 1.0
  Job 1.0 marked for removal

$ condor_q
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test            4/16 11:03   0+00:00:00 X  0   0.0  rhevm_job         

  1 jobs; 0 completed, 1 removed, 0 idle, 0 running, 0 held, 0 suspended
$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ---
  001.000:  Request is removed.

$ condor_q
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
     1.0   test            4/16 11:03   0+00:00:00 H  0   0.0  rhevm_job         

  1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ---
  001.000:  Request is held.

  Hold reason: Create_Instance_Failure: Resource not found

$ condor_rm 1.0
  Job 1.0 marked for removal

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ---
  001.000:  Request is removed.

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME
  ---
  001.000:  Request is held.

  Hold reason: Create_Instance_Failure: Resource not found
Comment 5 Daniel Horák 2012-04-16 05:43:26 EDT
Created attachment 577676 [details]
Condor and deltacloud-core logs.
Comment 7 Daniel Horák 2012-04-17 03:30:06 EDT
Created attachment 577928 [details]
Condor and deltacloud-core logs.

I'm sorry that I forgot attach D_FULLDEBUG logs.
With condor_rm -forcex is job removed and newer more returned to held state.

$ condor_submit deltacloud.job 
  Submitting job(s).
  1 job(s) submitted to cluster 1.

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME
  ---
  001.000:  Request has not yet been considered by the matchmaker.

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME
  ---
  001.000:  Request is held.

  Hold reason: Create_Instance_Failure: Resource not found

$ condor_rm 1.0
  Job 1.0 marked for removal

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME
  ---
  001.000:  Request is removed.

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME
  ---
001.000:  Request is held.

  Hold reason: Create_Instance_Failure: Resource not found

$ condor_rm 1.0
  Job 1.0 marked for removal

$ condor_q -bet
  -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME
  ---
  001.000:  Request is removed.

$ condor_rm -forcex 1.0
  Job 1.0 removed locally (remote state unknown)

$ condor_q
  -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME
   ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

  0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

$ condor_history 
   ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            
     1.0   test            4/17 08:37   0+00:00:00 X   ???        rhevm_job
Comment 8 Timothy St. Clair 2012-04-19 22:19:40 EDT
Fixed upstream
Comment 9 Timothy St. Clair 2012-04-25 09:14:17 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Try to hold a remove a job which had invalid input parameters and was never run. 
C: Job would transition back to hold and condor_rm would fail.
F: Fix the state machine so that a failed instance_find would result in a clean exit condition, vs. a hold transition.
R: condor_rm behaves correctly when an job has been placed on hold due to invalid input parameters.
Comment 11 Luigi Toscano 2012-07-17 14:59:42 EDT
The jobs in hold state (because of a wrong image name has been specified) are immediately removed from the queue when condor_rm is invoked on them.

Verified on RHEL6.3/x86_64
condor-7.6.5-0.18
condor-classads-7.6.5-0.18
condor-deltacloud-gahp-7.6.5-0.18
Comment 14 errata-xmlrpc 2012-09-19 13:43:26 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1278.html

Note You need to log in before you can comment on or make changes to this bug.