Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 575784

Summary:	improper RELEASE_CLAIM after REQUEST_CLAIM rejection
Product:	Red Hat Enterprise MRG	Reporter:	Matthew Farrellee <matt>
Component:	condor	Assignee:	Matthew Farrellee <matt>
Status:	CLOSED ERRATA	QA Contact:	MRG Quality Engineering <mrgqe-bugs>
Severity:	medium	Docs Contact:
Priority:	urgent
Version:	1.2	CC:	fnadge, mkudlej
Target Milestone:	1.3
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-10-14 16:01:07 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Matthew Farrellee 2010-03-22 12:39:43 UTC

Description of problem:

High rate of shadow exiting with status 107 (evicted) in an environment with no preemption.

With 20K slots and multi-tier Collector setup [1] it is possible for the Negotiator to hand out a single claim multiple times. The second time the Schedd's REQUEST_CLAIM will be rejected. The Schedd will then respond with a RELEASE_CLAIM, which evicts the already running job. The second Schedd will then not try to use the claim.

The second Schedd should not send a RELEASE_CLAIM when REQUEST_CLAIM is explicitly rejected.

[1] http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors


Version-Release number of selected component (if applicable):

7.4.3-0.5


How reproducible:

100% when a claim is handed out twice, getting a claim to be handed out twice is a low probability event that only occurs naturally in large pools


Steps to Reproduce:
1. submit 10K jobs to a pool of 20K slots
2. watch SchedLog for "exited with status 107"
3. investigate reason for 107 error (see "Log trace" below)

The key piece to reproducing this is to prevent the Startd's MATCHED and CLAIMED advertisements from reaching the Collector. This may be achievable by allowing the Startd to advertise initially to the Collector, then change the Collector's security configuration to prevent further updates. Something like:

1. normal two node setup, startd machine != collector machine
2. see startd machine slot (one slot!) in condor_status
3. change ALLOW_STARTD_ADVERTISE = $(FULL_HOSTNAME) on collector machine
4. submit two jobs
...

It may be necessary to run two Schedds and submit one job to each.

Log trace:

1) negotiator matches job 170429.0 with machine 00081
2) schedd9 claims and starts job on 00081
3) * * * (time passes and the collector never gets an update about 00081's new running state)
4) negotiator matches job 175396.0 with machine 00081
5) schedd6 tries to claim 00081 but is rejected
6) schedd6 notices this fact and sends a RELEASE_CLAIM
7) 00081 accepts the release claim!
8) job 170429.0 is evicted

NegotiatorLog:

03/19 19:32:18     Request 170429.00000:
03/19 19:32:18 matchmakingAlgorithm: limit 550.000000 used 37.000000 
pieLeft 8623.000000
03/19 19:32:18 Attempting to use cached MatchList: Succeeded. (MatchList 
length: 1426, Autocluster: 0, Schedd Name: billy, 
Schedd Address: <192.168.54.238:49898>)
03/19 19:32:18       Connecting to startd slot1.org 
at <10.64.46.121:50511>
03/19 19:32:18       Sending MATCH_INFO/claim id to 
slot1.org
03/19 19:32:18       (Claim ID is"<10.64.46.121:50511>#1268865314#352#..." )
03/19 19:32:18       Sending PERMISSION, claim id, startdAd to schedd
03/19 19:32:18       Matched 170429.0 billy 
<192.168.54.238:49898> preempting none <10.64.46.121:50511> 
slot1.org
03/19 19:32:18       Notifying the accountant
03/19 19:32:18       Successfully matched with slot1.org
03/19 19:32:18     Sending SEND_JOB_INFO/eom
03/19 19:32:18     Getting reply from schedd ...
03/19 19:32:18     Got JOB_INFO command; getting classad/eom
...
03/19 19:33:50     Request 175396.00000:
03/19 19:33:50 matchmakingAlgorithm: limit 559.000000 used 138.000000 
pieLeft 8091.000000
03/19 19:33:50 Attempting to use cached MatchList: Succeeded. (MatchList 
length: 1098, Autocluster: 0, Schedd Name: billy, 
Schedd Address: <192.168.54.238:33547>)
03/19 19:33:50       Connecting to startd slot1.org 
at <10.64.46.121:50511>
03/19 19:33:50       Sending MATCH_INFO/claim id to 
slot1.org
03/19 19:33:50       (Claim ID is 
"<10.64.46.121:50511>#1268865314#352#..." )
03/19 19:33:50       Sending PERMISSION, claim id, startdAd to schedd
03/19 19:33:50       Matched 175396.0 billy 
<192.168.54.238:33547> preempting none <10.64.46.121:50511> 
slot1.org
03/19 19:33:50       Notifying the accountant
03/19 19:33:50       Successfully matched with slot1.org
03/19 19:33:50     Sending SEND_JOB_INFO/eom
03/19 19:33:50     Getting reply from schedd ...
03/19 19:33:50     Got JOB_INFO command; getting classad/eom


SchedLog:
03/19 19:32:19 (pid:20832) SECMAN: command 442 REQUEST_CLAIM to startd 
slot1.org <10.64.46.121:50511> for 
billy from TCP port 53015 (non-blocking).
03/19 19:32:19 (pid:20832) SECMAN: waiting for TCP connection to startd 
slot1.org <10.64.46.121:50511> for 
billy.
...
03/19 19:32:22 (pid:20832) Starting add_shadow_birthdate(170429.0)
03/19 19:32:22 (pid:20832) Started shadow for job 170429.0 on 
slot1.org <10.64.46.121:50511> for 
billy, (shadow pid = 11352)
...
03/19 19:32:20 (pid:20832) SECMAN: resuming command 442 REQUEST_CLAIM to 
startd slot1.org <10.64.46.121:50511> for 
billy from TCP port 53015 (non-blocking).
03/19 19:32:20 (pid:20832) SECMAN: using session 
00081:811:1269047156:332 for {<10.64.46.121:50511>,<442>}.
03/19 19:32:20 (pid:20832) SECMAN: resume, other side is $CondorVersion: 
7.4.3 Mar 16 2010 BuildID: RH-7.4.3-0.5.el5 PRE-RELEASE $, NOT 
reauthenticating.
03/19 19:32:20 (pid:20832) SECMAN: successfully enabled message 
authenticator!
03/19 19:32:20 (pid:20832) SECMAN: successfully enabled encryption!
03/19 19:32:20 (pid:20832) SECMAN: startCommand succeeded.
...
03/19 19:32:20 (pid:20832) Completed REQUEST_CLAIM to startd 
slot1.org <10.64.46.121:50511> for 
billy
...
03/19 19:32:22 (pid:20832) Starting add_shadow_birthdate(170429.0)
03/19 19:32:22 (pid:20832) Started shadow for job 170429.0 on 
slot1.org <10.64.46.121:50511> for 
billy, (shadow pid = 1132): generating 3DES key for 
session schedd:20832:1269052342:67979...
...
3/19 19:33:56 (pid:20832) Shadow pid 11352 for job 170429.0 exited with 
status 107
03/19 19:33:56 (pid:20832) SECMAN: command 443 RELEASE_CLAIM to startd 
at <10.64.46.121:50511> from UDP port 45510 (non-blocking).
03/19 19:33:56 (pid:20832) SECMAN: using session 
00081:811:1269047156:332 for {<10.64.46.121:50511>,<443>}.
03/19 19:33:56 (pid:20832) SECMAN: UDP, m_have_session == 1, 
peer_can_negotiate == 1
03/19 19:33:56 (pid:20832) SECMAN: successfully enabled message 
authenticator!
03/19 19:33:56 (pid:20832) SECMAN: successfully enabled encryption.
03/19 19:33:56 (pid:20832) SECMAN: startCommand succeeded.
03/19 19:33:56 (pid:20832) Completed RELEASE_CLAIM to startd at 
<10.64.46.121:50511>
03/19 19:33:56 (pid:20832) Match record (slot1.org 
<10.64.46.121:50511> for billy, 170429.0) deleted
03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level 
DAEMON for 10.64.46.121 now 1
03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level WRITE 
for 10.64.46.121 now 1
03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level READ 
for 10.64.46.121 now 3
03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level READ 
for 10.64.46.121 now 2


SchedLog:
03/19 19:33:53 (pid:20774) SECMAN: command 442 REQUEST_CLAIM to startd 
slot1.org <10.64.46.121:50511> for 
billy from TCP port 46658 (non-blocking).
03/19 19:33:53 (pid:20774) SECMAN: waiting for TCP connection to startd 
slot1.org <10.64.46.121:50511> for 
billy.
...
03/19 19:33:54 (pid:20774) SECMAN: resuming command 442 REQUEST_CLAIM to 
startd lot1.org <10.64.46.121:50511> for 
billy rom TCP port 46658 (non-blocking).
...
03/19 19:33:54 (pid:20774) SECMAN: resuming command 442 REQUEST_CLAIM to 
startd slot1.org <10.64.46.121:50511> for 
billy from TCP port 46658 (non-blocking).
...
03/19 19:33:55 (pid:20774) SECMAN: resuming command 442 REQUEST_CLAIM to 
startd slot1.org <10.64.46.121:50511> for 
billy from TCP port 46658 (non-blocking).
03/19 19:33:55 (pid:20774) SECMAN: added session 
00081:811:1269052434:371 to cache for 86400 seconds (3600s lease).
03/19 19:33:55 (pid:20774) SECMAN: startCommand succeeded.
...
03/19 19:33:55 (pid:20774) Request was NOT accepted for claim 
slot1.org <10.64.46.121:50511> for 
billy 175396.0
03/19 19:33:55 (pid:20774) Completed REQUEST_CLAIM to startd 
slot1.org <10.64.46.121:50511> for 
billy
03/19 19:33:55 (pid:20774) SECMAN: command 443 RELEASE_CLAIM to startd 
at <10.64.46.121:50511> from UDP port 46247 (non-blocking).
03/19 19:33:55 (pid:20774) SECMAN: using session 
00081:811:1269052434:371 for {<10.64.46.121:50511>,<443>}.
03/19 19:33:55 (pid:20774) SECMAN: UDP, m_have_session == 1, 
peer_can_negotiate == 1
03/19 19:33:55 (pid:20774) SECMAN: successfully enabled message 
authenticator!
03/19 19:33:55 (pid:20774) SECMAN: successfully enabled encryption.
03/19 19:33:55 (pid:20774) SECMAN: startCommand succeeded.
03/19 19:33:55 (pid:20774) Completed RELEASE_CLAIM to startd at 
<10.64.46.121:50511>
03/19 19:33:55 (pid:20774) Match record (slot1.org 
<10.64.46.121:50511> for billy, 175396.0) deleted


ShadowLog:
03/19 19:32:22 ******************************************************
03/19 19:32:22 ** condor_shadow (CONDOR_SHADOW) STARTING UP
03/19 19:32:22 ** /usr/sbin/condor_shadow
03/19 19:32:22 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
03/19 19:32:22 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
03/19 19:32:22 ** $CondorVersion: 7.4.3 Mar 16 2010 BuildID: 
RH-7.4.3-0.5.el5 PRE-RELEASE $
03/19 19:32:22 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
03/19 19:32:22 ** PID = 11352
03/19 19:32:22 ** Log last touched 3/19 19:32:22
03/19 19:32:22 ******************************************************
03/19 19:32:22 Using config source: /etc/condor/condor_config
03/19 19:32:22 Using local config sources:
03/19 19:32:22    /var/lib/condor/condor_config.local
03/19 19:32:22    /opt/condor/bin/condor_config /opt/condor |
03/19 19:32:22 DaemonCore: Command Socket at <192.168.54.238:43235>
03/19 19:32:22 Initializing a VANILLA shadow for job 170429.0
03/19 19:32:22 (170429.0) (11352): Request to run on 
slot1.org <10.64.46.121:50511> was ACCEPTED
03/19 19:33:55 (170429.0) (11352): Job 170429.0 is being evicted from 
slot1.org
03/19 19:33:56 (170429.0) (11352): **** condor_shadow (condor_SHADOW) 
pid 11352 EXITING WITH STATUS 107


StartLog:
03/19 19:32:18 slot1: match_info called
03/19 19:32:18 slot1: Received match <10.64.46.121:50511>#1268865314#352#...
03/19 19:32:18 slot1: Started match timer (7207) for 120 seconds.
03/19 19:32:18 slot1: State change: match notification protocol successful
03/19 19:32:18 slot1: Changing state: Unclaimed -> Matched
03/19 19:32:20 slot1: Canceled match timer (7207)
03/19 19:32:20 slot1: Schedd addr = <192.168.54.238:49898>
03/19 19:32:20 slot1: Alive interval = 300
03/19 19:32:20 slot1: Received ClaimId from schedd 
(<10.64.46.121:50511>#1268865314#352#...)
03/19 19:32:20 slot1: Match requesting resources: cpus=1 memory=1 disk=1%
03/19 19:32:20 6471036 kbytes available for "/opt/condor/execute"
03/19 19:32:20 slot1: Total execute space: 6465916
03/19 19:32:20 slot1_2: New machine resource of type -1 allocated
...
03/19 19:32:20 slot1_2: Rank of this claim is: 0.000000
03/19 19:32:20 slot1_2: Request accepted.
03/19 19:32:20 slot1_2: Remote owner is billy
03/19 19:32:20 slot1_2: State change: claiming protocol successful
03/19 19:32:20 slot1_2: Changing state: Owner -> Claimed
03/19 19:32:20 slot1_2: Started ClaimLease timer (7209) w/ 1800 second 
lease duration
...
03/19 19:32:22 slot1_2: Got activate_claim request from shadow 
(<192.168.54.238:43606>)
03/19 19:32:22 slot1_2: Read request ad and starter from shadow.
...
03/19 19:32:22 slot1_2: Remote job ID is 170429.0
03/19 19:32:22 slot1_2: Remote global job ID is schedd9@#170429.0#1269051913
03/19 19:32:22 slot1_2: JobLeaseDuration defined in job ClassAd: 1200
03/19 19:32:22 slot1_2: Resetting ClaimLease timer (7209) with new duration
03/19 19:32:22 slot1_2: Sending Machine Ad to Starter
03/19 19:32:22 slot1_2: About to Create_Process "condor_starter -f -a 
slot1_2 schedd.condorproject.org"
03/19 19:32:22 Create_Process: using fast clone() to create child process.
03/19 19:32:22 slot1_2: Got RemoteUser (billy) from 
request classad
03/19 19:32:22 slot1_2: Got universe "VANILLA" (5) from request classad
03/19 19:32:22 slot1_2: State change: claim-activation protocol successful
03/19 19:32:22 slot1_2: Changing activity: Idle -> Busy
...
03/19 19:32:24 slot1_2: Sent update to 2 collector(s)
...
03/19 19:32:32 slot1_2: Received job ClassAd update from starter.
...
03/19 19:33:55 slot1_2: Got REQUEST_CLAIM while in Claimed state, ignoring.
03/19 19:33:55 slot1_2: State change: received RELEASE_CLAIM command
03/19 19:33:55 slot1_2: Changing state and activity: Claimed/Busy -> 
Preempting/Vacating
03/19 19:33:55 slot1_2: In Starter::kill() with pid 13616, sig 15 (SIGTERM)
03/19 19:33:55 Send_Signal(): Doing kill(13616,15) [SIGTERM]
03/19 19:33:55 slot1_2: Received job ClassAd update from starter.
03/19 19:33:55 slot1_2: Closing job ClassAd update socket from starter.
03/19 19:33:55 DaemonCore: No more children processes to reap.
03/19 19:33:55 Starter pid 13616 exited with status 0
03/19 19:33:55 slot1_2: State change: starter exited
03/19 19:33:55 slot1_2: Canceled ClaimLease timer (7209)
03/19 19:33:55 slot1_2: State change: No preempting claim, returning to 
owner
03/19 19:33:55 slot1_2: Changing state and activity: Preempting/Vacating 
-> Owner/Idle
03/19 19:33:55 slot1_2: State change: IS_OWNER is false
03/19 19:33:55 slot1_2: Changing state: Owner -> Unclaimed
03/19 19:33:55 slot1_2: Changing state: Unclaimed -> Delete
03/19 19:33:55 StartdPluginManager::Invalidate
03/19 19:33:55 Trying to update collector <192.168.54.238:10002>
03/19 19:33:55 Attempting to send update via TCP to collector 
schedd.condorproject.org <192.168.54.238:10002>
03/19 19:33:55 Trying to update collector <192.168.54.105:10002>
03/19 19:33:55 Attempting to send update via TCP to collector 
schedd2.condorproject.org <192.168.54.105:10002>
03/19 19:33:55 slot1_2: Resource no longer needed, deleting


StarterLog.slot1_2:
03/19 19:32:23 ******************************************************
03/19 19:32:23 ** condor_starter (CONDOR_STARTER) STARTING UP
03/19 19:32:23 ** /usr/sbin/condor_starter
03/19 19:32:23 ** SubsystemInfo: name=STARTER type=STARTER(8) 
class=DAEMON(1)
03/19 19:32:23 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
03/19 19:32:23 ** $CondorVersion: 7.4.3 Mar 16 2010 BuildID: 
RH-7.4.3-0.5.el5 PRE-RELEASE $
03/19 19:32:23 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
03/19 19:32:23 ** PID = 13616
03/19 19:32:23 ** Log last touched 3/19 19:29:14
03/19 19:32:23 ******************************************************
03/19 19:32:23 Using config source: /etc/condor/condor_config
03/19 19:32:23 Using local config sources:
03/19 19:32:23    /var/lib/condor/condor_config.local
03/19 19:32:23    /opt/condor/bin/condor_config /opt/condor |
03/19 19:32:23 DaemonCore: Command Socket at <10.64.46.121:59188>
03/19 19:32:23 Done setting resource limits
03/19 19:32:23 Communicating with shadow <192.168.54.238:43235>
03/19 19:32:23 Submitting machine is "schedd.condorproject.org"
03/19 19:32:23 setting the orig job name in starter
03/19 19:32:23 setting the orig job iwd in starter
03/19 19:32:23 Job 170429.0 set to execute immediately
03/19 19:32:23 Starting a VANILLA universe job with ID: 170429.0
03/19 19:32:23 IWD: /nfs/data/log/submissions/102185765
03/19 19:32:23 Using wrapper /opt/condor/bin/condor_wrap to exec 392
03/19 19:32:23 Create_Process succeeded, pid=13623
03/19 19:33:55 Got SIGTERM. Performing graceful shutdown.
03/19 19:33:55 ShutdownGraceful all jobs.
03/19 19:33:55 Process exited, pid=13623, signal=15
03/19 19:33:55 Last process exited, now Starter is exiting
03/19 19:33:55 **** condor_starter (condor_STARTER) pid 13616 EXITING 
WITH STATUS 0

Comment 1 Matthew Farrellee 2010-03-22 13:16:18 UTC

Reproduce with:

config -

SCHEDD_0 = $(SCHEDD)
SCHEDD_0_ARGS = -f -local-name 0
SCHEDD.0.SCHEDD_NAME = 0@
SCHEDD.0.SPOOL = $(SPOOL)/0
SCHEDD.0.SCHEDD_LOG = $(SCHEDD_LOG).0

SCHEDD_1 = $(SCHEDD)
SCHEDD_1_ARGS = -f -local-name 1
SCHEDD.1.SCHEDD_NAME = 1@
SCHEDD.1.SPOOL = $(SPOOL)/1
SCHEDD.1.SCHEDD_LOG = $(SCHEDD_LOG).1

NUM_CPUS = 1

DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD_0, SCHEDD_1

#ALLOW_ADVERTISE_STARTD = NONE

setup -

$ mkdir /var/lib/condor/spool/0
$ chown condor.condor /var/lib/condor/spool/0
$ chmod a+rx /var/lib/condor/spool/0
$ mkdir /var/lib/condor/spool/1
$ chown condor.condor /var/lib/condor/spool/1
$ chmod a+rx /var/lib/condor/spool/1

run -

$ service condor start

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
robin.local        LINUX      INTEL  Unclaimed Idle     0.350  2005  0+00:00:04
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
         INTEL/LINUX     1     0       0         1       0          0        0
               Total     1     0       0         1       0          0        0

$ uncomment ALLOW_ADVERTISE_STARTD in config

$ condor_reconfig -collector -full

$ echo "cmd=/bin/sleep\nargs=15m\nqueue" | condor_submit -name 0@   
Submitting job(s).
1 job(s) submitted to cluster 1.

NOTE: Wait for job to start

$ condor_status -sched
Name                 Machine    TotalRunningJobs TotalIdleJobs TotalHeldJobs 
0@                   robin.loca                1             0              0
1@                   robin.loca                0             0              0
                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs
               Total                 1                  0                  0

NOTE: Key that the slot ad was not updates

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
robin.local        LINUX      INTEL  Unclaimed Idle     0.350  2005  0+00:00:04
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
         INTEL/LINUX     1     0       0         1       0          0        0
               Total     1     0       0         1       0          0        0

$ echo "cmd=/bin/sleep\nargs=15m\nqueue" | condor_submit -name 1@  
Submitting job(s).
1 job(s) submitted to cluster 2.

$ condor_status -sched
Name                 Machine    TotalRunningJobs TotalIdleJobs TotalHeldJobs 
0@                   robin.loca                1             0              0
1@                   robin.loca                0             1              0
                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs
               Total                 1                  1                  0

NOTE: Wait for negotiation cycle

$ condor_q -name 0@
-- Schedd: 0@ : <127.0.0.1:38615>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   matt            3/22 09:06   0+00:00:40 I  0   2.0  sleep 15m         
1 jobs; 1 idle, 0 running, 0 held

$ condor_q -name 1@
-- Schedd: 1@ : <127.0.0.1:39685>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   matt            3/22 09:07   0+00:00:00 I  0   0.0  sleep 15m         
1 jobs; 1 idle, 0 running, 0 held

$ grep "exited with status" /var/log/condor/SchedLog.0
03/22 09:07:33 Shadow pid 1331 for job 1.0 exited with status 107

Comment 2 Matthew Farrellee 2010-03-22 13:23:23 UTC

In a successful run the job on 0@ will continue to run and the SchedLog.1 will show "Request was NOT accepted for claim..." without a message about RELEASE_CLAIM.

condor_status -direct can be used to query the actual status of the startd and verify if jobs are running or not.

Comment 3 Matthew Farrellee 2010-03-22 13:35:13 UTC

Fixed in 7.4.3-0.6

Comment 4 Martin Kudlej 2010-08-03 10:52:05 UTC

Tested on RHEL 5.5/4.8 x x86_64/i386 with condor-7.4.4-0.4 and it works as expected. --> VERIFIED

Comment 5 Florian Nadge 2010-10-07 12:57:00 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a
RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.

Comment 6 Martin Prpič 2010-10-10 09:23:28 UTC

    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,2 +1 @@
-Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a
+Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.-RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.

Comment 8 errata-xmlrpc 2010-10-14 16:01:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html