Bug 773680 - Released job doesn't start
Summary: Released job doesn't start
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor-qmf
Version: Development
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: 2.1.1
: ---
Assignee: Pete MacKinnon
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks: 739658 765607
TreeView+ depends on / blocked
 
Reported: 2012-01-12 15:08 UTC by Stanislav Graf
Modified: 2012-02-07 09:56 UTC (History)
6 users (show)

Fixed In Version: condor-7.6.5-0.12
Doc Type: Bug Fix
Doc Text:
Cause: Hold of a job using the Aviary or QMF job control API. Consequence: condor_q, Aviary and QMF API call to check job status indicates that the job remains marked as IDLE after release, despite being restarted by the scheduler. Fix: The condor_schedd code that represents an internal API for use by Aviary and QMF implementations was updated to ensure that the held job's state was correctly adjusted. Result: Once job is held using Aviary or QMF API, condor_q, Aviary and QMF API call to check job status indicates correct job transition of HELD->IDLE->RUNNING after release.
Clone Of:
Environment:
Last Closed: 2012-02-06 18:19:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 783139 0 medium CLOSED Remove job using aviary isn't handled properly 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2012:0100 0 normal SHIPPED_LIVE Moderate: MRG Grid security, bug fix, and enhancement update 2012-02-06 23:15:47 UTC

Internal Links: 783139

Description Stanislav Graf 2012-01-12 15:08:16 UTC
Description of problem:
I was verifying Bug 739658, comment 15
In MRG 2.0.x (current RHN) hold/release of job was working.
I can do in cumin:
submit (running) hold (held) release (running)

After clean install of MRG 2.1.x and the same configuration applied, management of jobs in cumin do:
submit (running) hold (held) release (idle)
And job never gets to running again.

While from command line:
[root@rhel6x ~]# condor_q | grep 5.0
   5.0   cumin           1/12 16:00   0+00:00:19 R  0   0.0  sleep 3600        
ecode=0
[root@rhel6x ~]# condor_hold 5.0
Job 5.0 held
ecode=0
[root@rhel6x ~]# condor_q | grep 5.0
   5.0   cumin           1/12 16:00   0+00:00:33 H  0   4.2  sleep 3600        
ecode=0
[root@rhel6x ~]# condor_release 5.0
Job 5.0 released
ecode=0
[root@rhel6x ~]# condor_q | grep 5.0
   5.0   cumin           1/12 16:00   0+00:01:24 R  0   4.2  sleep 3600        
ecode=0

It looks like fault of qmf plugin, because condor doesn't release slot after holding job. But when we use command line all is ok. But at the same time there was change in cumin making these operations asynchronous.

Version-Release number of selected component (if applicable):
RHEL5 i386/x86_64
cumin-0.1.5184-1.el5.noarch
condor-qmf-7.6.5-0.11.el5.x86_64
python-qpid-qmf-0.10-11.el5.x86_64
qpid-qmf-0.10-11.el5.x86_64
qpid-qmf-devel-0.10-11.el5.x86_64
ruby-qpid-qmf-0.10-11.el5.x86_64

RHEL6 i386/x86_64
cumin-0.1.5184-1.el6.noarch
condor-qmf-7.6.5-0.11.el6.x86_64
python-qpid-qmf-0.12-6.el6.x86_64
qpid-qmf-0.12-6.el6.x86_64
ruby-qpid-qmf-0.12-6.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Install cumin+condor+qmf
2. Submit job
3. Hold, wait till held
4. Release, wait if starts
  
Actual results:
submit (running) hold (held) release (idle) 
-> never gets to running again

Expected results:
submit (running) hold (held) release (running)

Additional info:

Comment 1 Pete MacKinnon 2012-01-13 14:34:37 UTC
The released job does actually run but its job state has not been correctly set back to RUNNING for some reason.

The root of the problem is that there has been some changes that has affected Scheduler::holdJobRaw resulting in a regression. The corrective action is for the QMF & Aviary layers to call an additional scheduler.enqueueActOnJobMyself(id,JA_HOLD_JOBS,true) after the holdJob transaction.

Comment 6 Stanislav Graf 2012-01-19 13:08:32 UTC
VERIFY QMF part:
# rpm -q condor-aviary
package condor-aviary is not installed

RHEL5 i386
condor-qmf-7.6.5-0.12.el5.i386
cumin-0.1.5184-1.el5.noarch

RHEL5 x86_64
condor-qmf-7.6.5-0.12.el5.x86_64
cumin-0.1.5184-1.el5.noarch

RHEL6 i386
condor-qmf-7.6.5-0.12.el6.i686
cumin-0.1.5184-1.el6.noarch

RHEL6 x86_64
condor-qmf-7.6.5-0.12.el6.x86_64
cumin-0.1.5184-1.el6.noarch


---CUMIN PART---
-Grid::Submission::Submit job (aaa, /bin/sleep 3600, true, /tmp)
-Grid::Submission::aaa
-Wait until job status is "Running"
-verify status also with condor_q - R
-Select job and click on "Hold"
-Wait until job status is "Held"
-verify status also with condor_q - H
-click on "Release"
-Wait until job status is "Running"
-verify status also with condor_q - R
---CONDOR PART---
-condor_hold
-Wait until job status is "Held" in cumin
-verify status also with condor_q - H
-condor_release
-Wait until job status is "Running" in cumin
-verify status also with condor_q - R
---CUMIN PART---
-click on "Remove"
-You should be now in Grid::Submission
-Wait until job disappears
-verify also with condor_q

Comment 7 Stanislav Graf 2012-01-19 14:05:37 UTC
VERIFY AVIARY part:

RHEL5 i386
condor-aviary-7.6.5-0.12.el5.i386
condor-qmf-7.6.5-0.12.el5.i386
cumin-0.1.5184-1.el5.noarch

RHEL5 x86_64
condor-aviary-7.6.5-0.12.el5.x86_64
condor-qmf-7.6.5-0.12.el5.x86_64
cumin-0.1.5184-1.el5.noarch

RHEL6 i386
condor-aviary-7.6.5-0.12.el6.i686
condor-qmf-7.6.5-0.12.el6.i686
cumin-0.1.5184-1.el6.noarch

RHEL6 x86_64
condor-aviary-7.6.5-0.12.el6.x86_64
condor-qmf-7.6.5-0.12.el6.x86_64
cumin-0.1.5184-1.el6.noarch

Add to cumin.conf:
aviary-job-servers: http://localhost:9090
aviary-query-servers: http://localhost:9091 
aviary-suds-logs: True 
log-level: debug

# grep Aviary /var/log/cumin/web.log
DEBUG AviaryOperations: suds logging on
INFO AviaryOperations: no root certificate file specified, using client validation only for ssl connections.
INFO Enabled Aviary interface for job submission and control.
INFO Enabled Aviary interface for query operations.

---CUMIN PART---
-Grid::Submission::Submit job (aaa, /bin/sleep 3600, true, /tmp)
-Grid::Submission::aaa
-Wait until job status is "Running"
-verify status also with condor_q - R
-Select job and click on "Hold"
-Wait until job status is "Held"
-verify status also with condor_q - H
-click on "Release"
-Wait until job status is "Running"
-verify status also with condor_q - R
---CONDOR PART---
-condor_hold
-Wait until job status is "Held" in cumin
-verify status also with condor_q - H
-condor_release
-Wait until job status is "Running" in cumin
-verify status also with condor_q - R
---CUMIN PART---
-click on "Remove"
-Now I hit Bug 783139 (because I used the same name for the job as in QMF test)
-When I use different name, the test passes.
-You should be now in Grid::Submission
-Wait until job disappears
-verify also with condor_q

Comment 8 Stanislav Graf 2012-01-19 15:37:04 UTC
VERIFIED
comment 6
comment 7

Comment 9 Pete MacKinnon 2012-01-31 14:38:54 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Hold of a job using the Aviary or QMF job control API.

Consequence: condor_q, Aviary and QMF API call to check job status indicates that the job remains marked as IDLE after release, despite being restarted by the scheduler.
 
Fix: The condor_schedd code that represents an internal API for use by Aviary and QMF implementations was updated to ensure that the held job's state was correctly adjusted.

Result: Once job is held using Aviary or QMF API, condor_q, Aviary and QMF API call to check job status indicates correct job transition of HELD->IDLE->RUNNING after release.

Comment 10 errata-xmlrpc 2012-02-06 18:19:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0100.html


Note You need to log in before you can comment on or make changes to this bug.