Bug 773680

Summary: Released job doesn't start
Product: Red Hat Enterprise MRG Reporter: Stanislav Graf <sgraf>
Component: condor-qmfAssignee: Pete MacKinnon <pmackinn>
Status: CLOSED ERRATA QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: DevelopmentCC: iboverma, jneedle, ltoscano, matt, pmackinn, tstclair
Target Milestone: 2.1.1Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: condor-7.6.5-0.12 Doc Type: Bug Fix
Doc Text:
Cause: Hold of a job using the Aviary or QMF job control API. Consequence: condor_q, Aviary and QMF API call to check job status indicates that the job remains marked as IDLE after release, despite being restarted by the scheduler. Fix: The condor_schedd code that represents an internal API for use by Aviary and QMF implementations was updated to ensure that the held job's state was correctly adjusted. Result: Once job is held using Aviary or QMF API, condor_q, Aviary and QMF API call to check job status indicates correct job transition of HELD->IDLE->RUNNING after release.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-06 18:19:14 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 739658, 765607    

Description Stanislav Graf 2012-01-12 15:08:16 UTC
Description of problem:
I was verifying Bug 739658, comment 15
In MRG 2.0.x (current RHN) hold/release of job was working.
I can do in cumin:
submit (running) hold (held) release (running)

After clean install of MRG 2.1.x and the same configuration applied, management of jobs in cumin do:
submit (running) hold (held) release (idle)
And job never gets to running again.

While from command line:
[root@rhel6x ~]# condor_q | grep 5.0
   5.0   cumin           1/12 16:00   0+00:00:19 R  0   0.0  sleep 3600        
ecode=0
[root@rhel6x ~]# condor_hold 5.0
Job 5.0 held
ecode=0
[root@rhel6x ~]# condor_q | grep 5.0
   5.0   cumin           1/12 16:00   0+00:00:33 H  0   4.2  sleep 3600        
ecode=0
[root@rhel6x ~]# condor_release 5.0
Job 5.0 released
ecode=0
[root@rhel6x ~]# condor_q | grep 5.0
   5.0   cumin           1/12 16:00   0+00:01:24 R  0   4.2  sleep 3600        
ecode=0

It looks like fault of qmf plugin, because condor doesn't release slot after holding job. But when we use command line all is ok. But at the same time there was change in cumin making these operations asynchronous.

Version-Release number of selected component (if applicable):
RHEL5 i386/x86_64
cumin-0.1.5184-1.el5.noarch
condor-qmf-7.6.5-0.11.el5.x86_64
python-qpid-qmf-0.10-11.el5.x86_64
qpid-qmf-0.10-11.el5.x86_64
qpid-qmf-devel-0.10-11.el5.x86_64
ruby-qpid-qmf-0.10-11.el5.x86_64

RHEL6 i386/x86_64
cumin-0.1.5184-1.el6.noarch
condor-qmf-7.6.5-0.11.el6.x86_64
python-qpid-qmf-0.12-6.el6.x86_64
qpid-qmf-0.12-6.el6.x86_64
ruby-qpid-qmf-0.12-6.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install cumin+condor+qmf
2. Submit job
3. Hold, wait till held
4. Release, wait if starts
  
Actual results:
submit (running) hold (held) release (idle) 
-> never gets to running again

Expected results:
submit (running) hold (held) release (running)

Additional info:

Comment 1 Pete MacKinnon 2012-01-13 14:34:37 UTC
The released job does actually run but its job state has not been correctly set back to RUNNING for some reason.

The root of the problem is that there has been some changes that has affected Scheduler::holdJobRaw resulting in a regression. The corrective action is for the QMF & Aviary layers to call an additional scheduler.enqueueActOnJobMyself(id,JA_HOLD_JOBS,true) after the holdJob transaction.

Comment 6 Stanislav Graf 2012-01-19 13:08:32 UTC
VERIFY QMF part:
# rpm -q condor-aviary
package condor-aviary is not installed

RHEL5 i386
condor-qmf-7.6.5-0.12.el5.i386
cumin-0.1.5184-1.el5.noarch

RHEL5 x86_64
condor-qmf-7.6.5-0.12.el5.x86_64
cumin-0.1.5184-1.el5.noarch

RHEL6 i386
condor-qmf-7.6.5-0.12.el6.i686
cumin-0.1.5184-1.el6.noarch

RHEL6 x86_64
condor-qmf-7.6.5-0.12.el6.x86_64
cumin-0.1.5184-1.el6.noarch


---CUMIN PART---
-Grid::Submission::Submit job (aaa, /bin/sleep 3600, true, /tmp)
-Grid::Submission::aaa
-Wait until job status is "Running"
-verify status also with condor_q - R
-Select job and click on "Hold"
-Wait until job status is "Held"
-verify status also with condor_q - H
-click on "Release"
-Wait until job status is "Running"
-verify status also with condor_q - R
---CONDOR PART---
-condor_hold
-Wait until job status is "Held" in cumin
-verify status also with condor_q - H
-condor_release
-Wait until job status is "Running" in cumin
-verify status also with condor_q - R
---CUMIN PART---
-click on "Remove"
-You should be now in Grid::Submission
-Wait until job disappears
-verify also with condor_q

Comment 7 Stanislav Graf 2012-01-19 14:05:37 UTC
VERIFY AVIARY part:

RHEL5 i386
condor-aviary-7.6.5-0.12.el5.i386
condor-qmf-7.6.5-0.12.el5.i386
cumin-0.1.5184-1.el5.noarch

RHEL5 x86_64
condor-aviary-7.6.5-0.12.el5.x86_64
condor-qmf-7.6.5-0.12.el5.x86_64
cumin-0.1.5184-1.el5.noarch

RHEL6 i386
condor-aviary-7.6.5-0.12.el6.i686
condor-qmf-7.6.5-0.12.el6.i686
cumin-0.1.5184-1.el6.noarch

RHEL6 x86_64
condor-aviary-7.6.5-0.12.el6.x86_64
condor-qmf-7.6.5-0.12.el6.x86_64
cumin-0.1.5184-1.el6.noarch

Add to cumin.conf:
aviary-job-servers: http://localhost:9090
aviary-query-servers: http://localhost:9091 
aviary-suds-logs: True 
log-level: debug

# grep Aviary /var/log/cumin/web.log
DEBUG AviaryOperations: suds logging on
INFO AviaryOperations: no root certificate file specified, using client validation only for ssl connections.
INFO Enabled Aviary interface for job submission and control.
INFO Enabled Aviary interface for query operations.

---CUMIN PART---
-Grid::Submission::Submit job (aaa, /bin/sleep 3600, true, /tmp)
-Grid::Submission::aaa
-Wait until job status is "Running"
-verify status also with condor_q - R
-Select job and click on "Hold"
-Wait until job status is "Held"
-verify status also with condor_q - H
-click on "Release"
-Wait until job status is "Running"
-verify status also with condor_q - R
---CONDOR PART---
-condor_hold
-Wait until job status is "Held" in cumin
-verify status also with condor_q - H
-condor_release
-Wait until job status is "Running" in cumin
-verify status also with condor_q - R
---CUMIN PART---
-click on "Remove"
-Now I hit Bug 783139 (because I used the same name for the job as in QMF test)
-When I use different name, the test passes.
-You should be now in Grid::Submission
-Wait until job disappears
-verify also with condor_q

Comment 8 Stanislav Graf 2012-01-19 15:37:04 UTC
VERIFIED
comment 6
comment 7

Comment 9 Pete MacKinnon 2012-01-31 14:38:54 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Hold of a job using the Aviary or QMF job control API.

Consequence: condor_q, Aviary and QMF API call to check job status indicates that the job remains marked as IDLE after release, despite being restarted by the scheduler.
 
Fix: The condor_schedd code that represents an internal API for use by Aviary and QMF implementations was updated to ensure that the held job's state was correctly adjusted.

Result: Once job is held using Aviary or QMF API, condor_q, Aviary and QMF API call to check job status indicates correct job transition of HELD->IDLE->RUNNING after release.

Comment 10 errata-xmlrpc 2012-02-06 18:19:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html