Bug 575784
| Summary: | improper RELEASE_CLAIM after REQUEST_CLAIM rejection | ||
|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | Matthew Farrellee <matt> |
| Component: | condor | Assignee: | Matthew Farrellee <matt> |
| Status: | CLOSED ERRATA | QA Contact: | MRG Quality Engineering <mrgqe-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 1.2 | CC: | fnadge, mkudlej |
| Target Milestone: | 1.3 | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2010-10-14 16:01:07 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Matthew Farrellee
2010-03-22 12:39:43 UTC
Reproduce with:
config -
SCHEDD_0 = $(SCHEDD)
SCHEDD_0_ARGS = -f -local-name 0
SCHEDD.0.SCHEDD_NAME = 0@
SCHEDD.0.SPOOL = $(SPOOL)/0
SCHEDD.0.SCHEDD_LOG = $(SCHEDD_LOG).0
SCHEDD_1 = $(SCHEDD)
SCHEDD_1_ARGS = -f -local-name 1
SCHEDD.1.SCHEDD_NAME = 1@
SCHEDD.1.SPOOL = $(SPOOL)/1
SCHEDD.1.SCHEDD_LOG = $(SCHEDD_LOG).1
NUM_CPUS = 1
DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD_0, SCHEDD_1
#ALLOW_ADVERTISE_STARTD = NONE
setup -
$ mkdir /var/lib/condor/spool/0
$ chown condor.condor /var/lib/condor/spool/0
$ chmod a+rx /var/lib/condor/spool/0
$ mkdir /var/lib/condor/spool/1
$ chown condor.condor /var/lib/condor/spool/1
$ chmod a+rx /var/lib/condor/spool/1
run -
$ service condor start
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
robin.local LINUX INTEL Unclaimed Idle 0.350 2005 0+00:00:04
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 1 0 0 1 0 0 0
Total 1 0 0 1 0 0 0
$ uncomment ALLOW_ADVERTISE_STARTD in config
$ condor_reconfig -collector -full
$ echo "cmd=/bin/sleep\nargs=15m\nqueue" | condor_submit -name 0@
Submitting job(s).
1 job(s) submitted to cluster 1.
NOTE: Wait for job to start
$ condor_status -sched
Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs
0@ robin.loca 1 0 0
1@ robin.loca 0 0 0
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 1 0 0
NOTE: Key that the slot ad was not updates
$ condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
robin.local LINUX INTEL Unclaimed Idle 0.350 2005 0+00:00:04
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 1 0 0 1 0 0 0
Total 1 0 0 1 0 0 0
$ echo "cmd=/bin/sleep\nargs=15m\nqueue" | condor_submit -name 1@
Submitting job(s).
1 job(s) submitted to cluster 2.
$ condor_status -sched
Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs
0@ robin.loca 1 0 0
1@ robin.loca 0 1 0
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 1 1 0
NOTE: Wait for negotiation cycle
$ condor_q -name 0@
-- Schedd: 0@ : <127.0.0.1:38615>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 matt 3/22 09:06 0+00:00:40 I 0 2.0 sleep 15m
1 jobs; 1 idle, 0 running, 0 held
$ condor_q -name 1@
-- Schedd: 1@ : <127.0.0.1:39685>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 matt 3/22 09:07 0+00:00:00 I 0 0.0 sleep 15m
1 jobs; 1 idle, 0 running, 0 held
$ grep "exited with status" /var/log/condor/SchedLog.0
03/22 09:07:33 Shadow pid 1331 for job 1.0 exited with status 107
In a successful run the job on 0@ will continue to run and the SchedLog.1 will show "Request was NOT accepted for claim..." without a message about RELEASE_CLAIM. condor_status -direct can be used to query the actual status of the startd and verify if jobs are running or not. Fixed in 7.4.3-0.6 Tested on RHEL 5.5/4.8 x x86_64/i386 with condor-7.4.4-0.4 and it works as expected. --> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a
RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -1,2 +1 @@
-Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a
+Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.-RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html |