Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
(please ignore the description) Description of problem: A deltacloud job which is hold because of an error is correctly moved to the H state. When it is removed with condor_q it moves to X state, but after a while the job moves back to the H state. Version-Release number of selected component (if applicable): condor-7.6.7-0.9 condor-deltacloud-gahp-7.6.7-0.9 How reproducible: Configure a job which requests a non existing image and follow the abovementioned steps.
Please attach logs.
also please include hold message on second hold.
Created attachment 577675 [details] Condor and deltacloud-core logs. # rpm -qa | grep condor condor-7.6.7-0.10.el6.x86_64 condor-classads-7.6.7-0.10.el6.x86_64 condor-deltacloud-gahp-7.6.7-0.10.el6.x86_64 # rpm -qa | grep deltacloud deltacloud-core-rhevm-0.5.0-5.el6.noarch deltacloud-core-0.5.0-5.el6.noarch libdeltacloud-0.9-1.el6.x86_64 condor-deltacloud-gahp-7.6.7-0.10.el6.x86_64 $ cat deltacloud.job universe = grid grid_resource = deltacloud http://HOSTNAME:3002/api executable = rhevm_job deltacloud_username = USER deltacloud_password_file = rhevm_user_pwd deltacloud_image_id = abecedee-c3a9-4dff-8ea5-1e355c1abece deltacloud_realm_id = efghijkl-cff6-11e0-9267-5254001abece log = job_deltacloud_basic.log notification = NEVER queue $ condor_submit deltacloud.job Submitting job(s). 1 job(s) submitted to cluster 1. $ condor_q -bet -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME --- 001.000: Request has not yet been considered by the matchmaker. $ condor_q -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 4/16 11:03 0+00:00:00 H 0 0.0 rhevm_job 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended $ condor_q -bet -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME --- 001.000: Request is held. Hold reason: Create_Instance_Failure: Resource not found $ condor_rm 1.0 Job 1.0 marked for removal $ condor_q -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 4/16 11:03 0+00:00:00 X 0 0.0 rhevm_job 1 jobs; 0 completed, 1 removed, 0 idle, 0 running, 0 held, 0 suspended $ condor_q -bet -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME --- 001.000: Request is removed. $ condor_q -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 test 4/16 11:03 0+00:00:00 H 0 0.0 rhevm_job 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended $ condor_q -bet -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME --- 001.000: Request is held. Hold reason: Create_Instance_Failure: Resource not found $ condor_rm 1.0 Job 1.0 marked for removal $ condor_q -bet -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME --- 001.000: Request is removed. $ condor_q -bet -- Submitter: HOSTNAME : <IP:32824> : HOSTNAME --- 001.000: Request is held. Hold reason: Create_Instance_Failure: Resource not found
Created attachment 577676 [details] Condor and deltacloud-core logs.
Created attachment 577928 [details] Condor and deltacloud-core logs. I'm sorry that I forgot attach D_FULLDEBUG logs. With condor_rm -forcex is job removed and newer more returned to held state. $ condor_submit deltacloud.job Submitting job(s). 1 job(s) submitted to cluster 1. $ condor_q -bet -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME --- 001.000: Request has not yet been considered by the matchmaker. $ condor_q -bet -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME --- 001.000: Request is held. Hold reason: Create_Instance_Failure: Resource not found $ condor_rm 1.0 Job 1.0 marked for removal $ condor_q -bet -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME --- 001.000: Request is removed. $ condor_q -bet -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME --- 001.000: Request is held. Hold reason: Create_Instance_Failure: Resource not found $ condor_rm 1.0 Job 1.0 marked for removal $ condor_q -bet -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME --- 001.000: Request is removed. $ condor_rm -forcex 1.0 Job 1.0 removed locally (remote state unknown) $ condor_q -- Submitter: HOSTNAME : <IP:46862> : HOSTNAME ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended $ condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 1.0 test 4/17 08:37 0+00:00:00 X ??? rhevm_job
Fixed upstream
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: Try to hold a remove a job which had invalid input parameters and was never run. C: Job would transition back to hold and condor_rm would fail. F: Fix the state machine so that a failed instance_find would result in a clean exit condition, vs. a hold transition. R: condor_rm behaves correctly when an job has been placed on hold due to invalid input parameters.
The jobs in hold state (because of a wrong image name has been specified) are immediately removed from the queue when condor_rm is invoked on them. Verified on RHEL6.3/x86_64 condor-7.6.5-0.18 condor-classads-7.6.5-0.18 condor-deltacloud-gahp-7.6.5-0.18
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1278.html