Description of problem: I was verifying Bug 739658, comment 15 In MRG 2.0.x (current RHN) hold/release of job was working. I can do in cumin: submit (running) hold (held) release (running) After clean install of MRG 2.1.x and the same configuration applied, management of jobs in cumin do: submit (running) hold (held) release (idle) And job never gets to running again. While from command line: [root@rhel6x ~]# condor_q | grep 5.0 5.0 cumin 1/12 16:00 0+00:00:19 R 0 0.0 sleep 3600 ecode=0 [root@rhel6x ~]# condor_hold 5.0 Job 5.0 held ecode=0 [root@rhel6x ~]# condor_q | grep 5.0 5.0 cumin 1/12 16:00 0+00:00:33 H 0 4.2 sleep 3600 ecode=0 [root@rhel6x ~]# condor_release 5.0 Job 5.0 released ecode=0 [root@rhel6x ~]# condor_q | grep 5.0 5.0 cumin 1/12 16:00 0+00:01:24 R 0 4.2 sleep 3600 ecode=0 It looks like fault of qmf plugin, because condor doesn't release slot after holding job. But when we use command line all is ok. But at the same time there was change in cumin making these operations asynchronous. Version-Release number of selected component (if applicable): RHEL5 i386/x86_64 cumin-0.1.5184-1.el5.noarch condor-qmf-7.6.5-0.11.el5.x86_64 python-qpid-qmf-0.10-11.el5.x86_64 qpid-qmf-0.10-11.el5.x86_64 qpid-qmf-devel-0.10-11.el5.x86_64 ruby-qpid-qmf-0.10-11.el5.x86_64 RHEL6 i386/x86_64 cumin-0.1.5184-1.el6.noarch condor-qmf-7.6.5-0.11.el6.x86_64 python-qpid-qmf-0.12-6.el6.x86_64 qpid-qmf-0.12-6.el6.x86_64 ruby-qpid-qmf-0.12-6.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Install cumin+condor+qmf 2. Submit job 3. Hold, wait till held 4. Release, wait if starts Actual results: submit (running) hold (held) release (idle) -> never gets to running again Expected results: submit (running) hold (held) release (running) Additional info:
The released job does actually run but its job state has not been correctly set back to RUNNING for some reason. The root of the problem is that there has been some changes that has affected Scheduler::holdJobRaw resulting in a regression. The corrective action is for the QMF & Aviary layers to call an additional scheduler.enqueueActOnJobMyself(id,JA_HOLD_JOBS,true) after the holdJob transaction.
VERIFY QMF part: # rpm -q condor-aviary package condor-aviary is not installed RHEL5 i386 condor-qmf-7.6.5-0.12.el5.i386 cumin-0.1.5184-1.el5.noarch RHEL5 x86_64 condor-qmf-7.6.5-0.12.el5.x86_64 cumin-0.1.5184-1.el5.noarch RHEL6 i386 condor-qmf-7.6.5-0.12.el6.i686 cumin-0.1.5184-1.el6.noarch RHEL6 x86_64 condor-qmf-7.6.5-0.12.el6.x86_64 cumin-0.1.5184-1.el6.noarch ---CUMIN PART--- -Grid::Submission::Submit job (aaa, /bin/sleep 3600, true, /tmp) -Grid::Submission::aaa -Wait until job status is "Running" -verify status also with condor_q - R -Select job and click on "Hold" -Wait until job status is "Held" -verify status also with condor_q - H -click on "Release" -Wait until job status is "Running" -verify status also with condor_q - R ---CONDOR PART--- -condor_hold -Wait until job status is "Held" in cumin -verify status also with condor_q - H -condor_release -Wait until job status is "Running" in cumin -verify status also with condor_q - R ---CUMIN PART--- -click on "Remove" -You should be now in Grid::Submission -Wait until job disappears -verify also with condor_q
VERIFY AVIARY part: RHEL5 i386 condor-aviary-7.6.5-0.12.el5.i386 condor-qmf-7.6.5-0.12.el5.i386 cumin-0.1.5184-1.el5.noarch RHEL5 x86_64 condor-aviary-7.6.5-0.12.el5.x86_64 condor-qmf-7.6.5-0.12.el5.x86_64 cumin-0.1.5184-1.el5.noarch RHEL6 i386 condor-aviary-7.6.5-0.12.el6.i686 condor-qmf-7.6.5-0.12.el6.i686 cumin-0.1.5184-1.el6.noarch RHEL6 x86_64 condor-aviary-7.6.5-0.12.el6.x86_64 condor-qmf-7.6.5-0.12.el6.x86_64 cumin-0.1.5184-1.el6.noarch Add to cumin.conf: aviary-job-servers: http://localhost:9090 aviary-query-servers: http://localhost:9091 aviary-suds-logs: True log-level: debug # grep Aviary /var/log/cumin/web.log DEBUG AviaryOperations: suds logging on INFO AviaryOperations: no root certificate file specified, using client validation only for ssl connections. INFO Enabled Aviary interface for job submission and control. INFO Enabled Aviary interface for query operations. ---CUMIN PART--- -Grid::Submission::Submit job (aaa, /bin/sleep 3600, true, /tmp) -Grid::Submission::aaa -Wait until job status is "Running" -verify status also with condor_q - R -Select job and click on "Hold" -Wait until job status is "Held" -verify status also with condor_q - H -click on "Release" -Wait until job status is "Running" -verify status also with condor_q - R ---CONDOR PART--- -condor_hold -Wait until job status is "Held" in cumin -verify status also with condor_q - H -condor_release -Wait until job status is "Running" in cumin -verify status also with condor_q - R ---CUMIN PART--- -click on "Remove" -Now I hit Bug 783139 (because I used the same name for the job as in QMF test) -When I use different name, the test passes. -You should be now in Grid::Submission -Wait until job disappears -verify also with condor_q
VERIFIED comment 6 comment 7
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Hold of a job using the Aviary or QMF job control API. Consequence: condor_q, Aviary and QMF API call to check job status indicates that the job remains marked as IDLE after release, despite being restarted by the scheduler. Fix: The condor_schedd code that represents an internal API for use by Aviary and QMF implementations was updated to ensure that the held job's state was correctly adjusted. Result: Once job is held using Aviary or QMF API, condor_q, Aviary and QMF API call to check job status indicates correct job transition of HELD->IDLE->RUNNING after release.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0100.html