Red Hat Bugzilla – Bug 575784
improper RELEASE_CLAIM after REQUEST_CLAIM rejection
Last modified: 2010-10-14 12:01:07 EDT
Description of problem: High rate of shadow exiting with status 107 (evicted) in an environment with no preemption. With 20K slots and multi-tier Collector setup [1] it is possible for the Negotiator to hand out a single claim multiple times. The second time the Schedd's REQUEST_CLAIM will be rejected. The Schedd will then respond with a RELEASE_CLAIM, which evicts the already running job. The second Schedd will then not try to use the claim. The second Schedd should not send a RELEASE_CLAIM when REQUEST_CLAIM is explicitly rejected. [1] http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToConfigCollectors Version-Release number of selected component (if applicable): 7.4.3-0.5 How reproducible: 100% when a claim is handed out twice, getting a claim to be handed out twice is a low probability event that only occurs naturally in large pools Steps to Reproduce: 1. submit 10K jobs to a pool of 20K slots 2. watch SchedLog for "exited with status 107" 3. investigate reason for 107 error (see "Log trace" below) The key piece to reproducing this is to prevent the Startd's MATCHED and CLAIMED advertisements from reaching the Collector. This may be achievable by allowing the Startd to advertise initially to the Collector, then change the Collector's security configuration to prevent further updates. Something like: 1. normal two node setup, startd machine != collector machine 2. see startd machine slot (one slot!) in condor_status 3. change ALLOW_STARTD_ADVERTISE = $(FULL_HOSTNAME) on collector machine 4. submit two jobs ... It may be necessary to run two Schedds and submit one job to each. Log trace: 1) negotiator matches job 170429.0 with machine 00081 2) schedd9 claims and starts job on 00081 3) * * * (time passes and the collector never gets an update about 00081's new running state) 4) negotiator matches job 175396.0 with machine 00081 5) schedd6 tries to claim 00081 but is rejected 6) schedd6 notices this fact and sends a RELEASE_CLAIM 7) 00081 accepts the release claim! 8) job 170429.0 is evicted NegotiatorLog: 03/19 19:32:18 Request 170429.00000: 03/19 19:32:18 matchmakingAlgorithm: limit 550.000000 used 37.000000 pieLeft 8623.000000 03/19 19:32:18 Attempting to use cached MatchList: Succeeded. (MatchList length: 1426, Autocluster: 0, Schedd Name: billy@condorproject.org, Schedd Address: <192.168.54.238:49898>) 03/19 19:32:18 Connecting to startd slot1@00081.condorproject.org at <10.64.46.121:50511> 03/19 19:32:18 Sending MATCH_INFO/claim id to slot1@00081.condorproject.org 03/19 19:32:18 (Claim ID is"<10.64.46.121:50511>#1268865314#352#..." ) 03/19 19:32:18 Sending PERMISSION, claim id, startdAd to schedd 03/19 19:32:18 Matched 170429.0 billy@condorproject.org <192.168.54.238:49898> preempting none <10.64.46.121:50511> slot1@00081.condorproject.org 03/19 19:32:18 Notifying the accountant 03/19 19:32:18 Successfully matched with slot1@00081.condorproject.org 03/19 19:32:18 Sending SEND_JOB_INFO/eom 03/19 19:32:18 Getting reply from schedd ... 03/19 19:32:18 Got JOB_INFO command; getting classad/eom ... 03/19 19:33:50 Request 175396.00000: 03/19 19:33:50 matchmakingAlgorithm: limit 559.000000 used 138.000000 pieLeft 8091.000000 03/19 19:33:50 Attempting to use cached MatchList: Succeeded. (MatchList length: 1098, Autocluster: 0, Schedd Name: billy@condorproject.org, Schedd Address: <192.168.54.238:33547>) 03/19 19:33:50 Connecting to startd slot1@00081.condorproject.org at <10.64.46.121:50511> 03/19 19:33:50 Sending MATCH_INFO/claim id to slot1@00081.condorproject.org 03/19 19:33:50 (Claim ID is "<10.64.46.121:50511>#1268865314#352#..." ) 03/19 19:33:50 Sending PERMISSION, claim id, startdAd to schedd 03/19 19:33:50 Matched 175396.0 billy@condorproject.org <192.168.54.238:33547> preempting none <10.64.46.121:50511> slot1@00081.condorproject.org 03/19 19:33:50 Notifying the accountant 03/19 19:33:50 Successfully matched with slot1@00081.condorproject.org 03/19 19:33:50 Sending SEND_JOB_INFO/eom 03/19 19:33:50 Getting reply from schedd ... 03/19 19:33:50 Got JOB_INFO command; getting classad/eom SchedLog: 03/19 19:32:19 (pid:20832) SECMAN: command 442 REQUEST_CLAIM to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org from TCP port 53015 (non-blocking). 03/19 19:32:19 (pid:20832) SECMAN: waiting for TCP connection to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org. ... 03/19 19:32:22 (pid:20832) Starting add_shadow_birthdate(170429.0) 03/19 19:32:22 (pid:20832) Started shadow for job 170429.0 on slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org, (shadow pid = 11352) ... 03/19 19:32:20 (pid:20832) SECMAN: resuming command 442 REQUEST_CLAIM to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org from TCP port 53015 (non-blocking). 03/19 19:32:20 (pid:20832) SECMAN: using session 00081:811:1269047156:332 for {<10.64.46.121:50511>,<442>}. 03/19 19:32:20 (pid:20832) SECMAN: resume, other side is $CondorVersion: 7.4.3 Mar 16 2010 BuildID: RH-7.4.3-0.5.el5 PRE-RELEASE $, NOT reauthenticating. 03/19 19:32:20 (pid:20832) SECMAN: successfully enabled message authenticator! 03/19 19:32:20 (pid:20832) SECMAN: successfully enabled encryption! 03/19 19:32:20 (pid:20832) SECMAN: startCommand succeeded. ... 03/19 19:32:20 (pid:20832) Completed REQUEST_CLAIM to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org ... 03/19 19:32:22 (pid:20832) Starting add_shadow_birthdate(170429.0) 03/19 19:32:22 (pid:20832) Started shadow for job 170429.0 on slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org, (shadow pid = 1132): generating 3DES key for session schedd:20832:1269052342:67979... ... 3/19 19:33:56 (pid:20832) Shadow pid 11352 for job 170429.0 exited with status 107 03/19 19:33:56 (pid:20832) SECMAN: command 443 RELEASE_CLAIM to startd at <10.64.46.121:50511> from UDP port 45510 (non-blocking). 03/19 19:33:56 (pid:20832) SECMAN: using session 00081:811:1269047156:332 for {<10.64.46.121:50511>,<443>}. 03/19 19:33:56 (pid:20832) SECMAN: UDP, m_have_session == 1, peer_can_negotiate == 1 03/19 19:33:56 (pid:20832) SECMAN: successfully enabled message authenticator! 03/19 19:33:56 (pid:20832) SECMAN: successfully enabled encryption. 03/19 19:33:56 (pid:20832) SECMAN: startCommand succeeded. 03/19 19:33:56 (pid:20832) Completed RELEASE_CLAIM to startd at <10.64.46.121:50511> 03/19 19:33:56 (pid:20832) Match record (slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org, 170429.0) deleted 03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level DAEMON for 10.64.46.121 now 1 03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level WRITE for 10.64.46.121 now 1 03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level READ for 10.64.46.121 now 3 03/19 19:33:56 (pid:20832) IpVerify::FillHole: open count at level READ for 10.64.46.121 now 2 SchedLog: 03/19 19:33:53 (pid:20774) SECMAN: command 442 REQUEST_CLAIM to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org from TCP port 46658 (non-blocking). 03/19 19:33:53 (pid:20774) SECMAN: waiting for TCP connection to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org. ... 03/19 19:33:54 (pid:20774) SECMAN: resuming command 442 REQUEST_CLAIM to startd lot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org rom TCP port 46658 (non-blocking). ... 03/19 19:33:54 (pid:20774) SECMAN: resuming command 442 REQUEST_CLAIM to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org from TCP port 46658 (non-blocking). ... 03/19 19:33:55 (pid:20774) SECMAN: resuming command 442 REQUEST_CLAIM to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org from TCP port 46658 (non-blocking). 03/19 19:33:55 (pid:20774) SECMAN: added session 00081:811:1269052434:371 to cache for 86400 seconds (3600s lease). 03/19 19:33:55 (pid:20774) SECMAN: startCommand succeeded. ... 03/19 19:33:55 (pid:20774) Request was NOT accepted for claim slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org 175396.0 03/19 19:33:55 (pid:20774) Completed REQUEST_CLAIM to startd slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org 03/19 19:33:55 (pid:20774) SECMAN: command 443 RELEASE_CLAIM to startd at <10.64.46.121:50511> from UDP port 46247 (non-blocking). 03/19 19:33:55 (pid:20774) SECMAN: using session 00081:811:1269052434:371 for {<10.64.46.121:50511>,<443>}. 03/19 19:33:55 (pid:20774) SECMAN: UDP, m_have_session == 1, peer_can_negotiate == 1 03/19 19:33:55 (pid:20774) SECMAN: successfully enabled message authenticator! 03/19 19:33:55 (pid:20774) SECMAN: successfully enabled encryption. 03/19 19:33:55 (pid:20774) SECMAN: startCommand succeeded. 03/19 19:33:55 (pid:20774) Completed RELEASE_CLAIM to startd at <10.64.46.121:50511> 03/19 19:33:55 (pid:20774) Match record (slot1@00081.condorproject.org <10.64.46.121:50511> for billy@condorproject.org, 175396.0) deleted ShadowLog: 03/19 19:32:22 ****************************************************** 03/19 19:32:22 ** condor_shadow (CONDOR_SHADOW) STARTING UP 03/19 19:32:22 ** /usr/sbin/condor_shadow 03/19 19:32:22 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1) 03/19 19:32:22 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON 03/19 19:32:22 ** $CondorVersion: 7.4.3 Mar 16 2010 BuildID: RH-7.4.3-0.5.el5 PRE-RELEASE $ 03/19 19:32:22 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ 03/19 19:32:22 ** PID = 11352 03/19 19:32:22 ** Log last touched 3/19 19:32:22 03/19 19:32:22 ****************************************************** 03/19 19:32:22 Using config source: /etc/condor/condor_config 03/19 19:32:22 Using local config sources: 03/19 19:32:22 /var/lib/condor/condor_config.local 03/19 19:32:22 /opt/condor/bin/condor_config /opt/condor | 03/19 19:32:22 DaemonCore: Command Socket at <192.168.54.238:43235> 03/19 19:32:22 Initializing a VANILLA shadow for job 170429.0 03/19 19:32:22 (170429.0) (11352): Request to run on slot1@00081.condorproject.org <10.64.46.121:50511> was ACCEPTED 03/19 19:33:55 (170429.0) (11352): Job 170429.0 is being evicted from slot1@00081.condorproject.org 03/19 19:33:56 (170429.0) (11352): **** condor_shadow (condor_SHADOW) pid 11352 EXITING WITH STATUS 107 StartLog: 03/19 19:32:18 slot1: match_info called 03/19 19:32:18 slot1: Received match <10.64.46.121:50511>#1268865314#352#... 03/19 19:32:18 slot1: Started match timer (7207) for 120 seconds. 03/19 19:32:18 slot1: State change: match notification protocol successful 03/19 19:32:18 slot1: Changing state: Unclaimed -> Matched 03/19 19:32:20 slot1: Canceled match timer (7207) 03/19 19:32:20 slot1: Schedd addr = <192.168.54.238:49898> 03/19 19:32:20 slot1: Alive interval = 300 03/19 19:32:20 slot1: Received ClaimId from schedd (<10.64.46.121:50511>#1268865314#352#...) 03/19 19:32:20 slot1: Match requesting resources: cpus=1 memory=1 disk=1% 03/19 19:32:20 6471036 kbytes available for "/opt/condor/execute" 03/19 19:32:20 slot1: Total execute space: 6465916 03/19 19:32:20 slot1_2: New machine resource of type -1 allocated ... 03/19 19:32:20 slot1_2: Rank of this claim is: 0.000000 03/19 19:32:20 slot1_2: Request accepted. 03/19 19:32:20 slot1_2: Remote owner is billy@condorproject.org 03/19 19:32:20 slot1_2: State change: claiming protocol successful 03/19 19:32:20 slot1_2: Changing state: Owner -> Claimed 03/19 19:32:20 slot1_2: Started ClaimLease timer (7209) w/ 1800 second lease duration ... 03/19 19:32:22 slot1_2: Got activate_claim request from shadow (<192.168.54.238:43606>) 03/19 19:32:22 slot1_2: Read request ad and starter from shadow. ... 03/19 19:32:22 slot1_2: Remote job ID is 170429.0 03/19 19:32:22 slot1_2: Remote global job ID is schedd9@#170429.0#1269051913 03/19 19:32:22 slot1_2: JobLeaseDuration defined in job ClassAd: 1200 03/19 19:32:22 slot1_2: Resetting ClaimLease timer (7209) with new duration 03/19 19:32:22 slot1_2: Sending Machine Ad to Starter 03/19 19:32:22 slot1_2: About to Create_Process "condor_starter -f -a slot1_2 schedd.condorproject.org" 03/19 19:32:22 Create_Process: using fast clone() to create child process. 03/19 19:32:22 slot1_2: Got RemoteUser (billy@condorproject.org) from request classad 03/19 19:32:22 slot1_2: Got universe "VANILLA" (5) from request classad 03/19 19:32:22 slot1_2: State change: claim-activation protocol successful 03/19 19:32:22 slot1_2: Changing activity: Idle -> Busy ... 03/19 19:32:24 slot1_2: Sent update to 2 collector(s) ... 03/19 19:32:32 slot1_2: Received job ClassAd update from starter. ... 03/19 19:33:55 slot1_2: Got REQUEST_CLAIM while in Claimed state, ignoring. 03/19 19:33:55 slot1_2: State change: received RELEASE_CLAIM command 03/19 19:33:55 slot1_2: Changing state and activity: Claimed/Busy -> Preempting/Vacating 03/19 19:33:55 slot1_2: In Starter::kill() with pid 13616, sig 15 (SIGTERM) 03/19 19:33:55 Send_Signal(): Doing kill(13616,15) [SIGTERM] 03/19 19:33:55 slot1_2: Received job ClassAd update from starter. 03/19 19:33:55 slot1_2: Closing job ClassAd update socket from starter. 03/19 19:33:55 DaemonCore: No more children processes to reap. 03/19 19:33:55 Starter pid 13616 exited with status 0 03/19 19:33:55 slot1_2: State change: starter exited 03/19 19:33:55 slot1_2: Canceled ClaimLease timer (7209) 03/19 19:33:55 slot1_2: State change: No preempting claim, returning to owner 03/19 19:33:55 slot1_2: Changing state and activity: Preempting/Vacating -> Owner/Idle 03/19 19:33:55 slot1_2: State change: IS_OWNER is false 03/19 19:33:55 slot1_2: Changing state: Owner -> Unclaimed 03/19 19:33:55 slot1_2: Changing state: Unclaimed -> Delete 03/19 19:33:55 StartdPluginManager::Invalidate 03/19 19:33:55 Trying to update collector <192.168.54.238:10002> 03/19 19:33:55 Attempting to send update via TCP to collector schedd.condorproject.org <192.168.54.238:10002> 03/19 19:33:55 Trying to update collector <192.168.54.105:10002> 03/19 19:33:55 Attempting to send update via TCP to collector schedd2.condorproject.org <192.168.54.105:10002> 03/19 19:33:55 slot1_2: Resource no longer needed, deleting StarterLog.slot1_2: 03/19 19:32:23 ****************************************************** 03/19 19:32:23 ** condor_starter (CONDOR_STARTER) STARTING UP 03/19 19:32:23 ** /usr/sbin/condor_starter 03/19 19:32:23 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 03/19 19:32:23 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 03/19 19:32:23 ** $CondorVersion: 7.4.3 Mar 16 2010 BuildID: RH-7.4.3-0.5.el5 PRE-RELEASE $ 03/19 19:32:23 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ 03/19 19:32:23 ** PID = 13616 03/19 19:32:23 ** Log last touched 3/19 19:29:14 03/19 19:32:23 ****************************************************** 03/19 19:32:23 Using config source: /etc/condor/condor_config 03/19 19:32:23 Using local config sources: 03/19 19:32:23 /var/lib/condor/condor_config.local 03/19 19:32:23 /opt/condor/bin/condor_config /opt/condor | 03/19 19:32:23 DaemonCore: Command Socket at <10.64.46.121:59188> 03/19 19:32:23 Done setting resource limits 03/19 19:32:23 Communicating with shadow <192.168.54.238:43235> 03/19 19:32:23 Submitting machine is "schedd.condorproject.org" 03/19 19:32:23 setting the orig job name in starter 03/19 19:32:23 setting the orig job iwd in starter 03/19 19:32:23 Job 170429.0 set to execute immediately 03/19 19:32:23 Starting a VANILLA universe job with ID: 170429.0 03/19 19:32:23 IWD: /nfs/data/log/submissions/102185765 03/19 19:32:23 Using wrapper /opt/condor/bin/condor_wrap to exec 392 03/19 19:32:23 Create_Process succeeded, pid=13623 03/19 19:33:55 Got SIGTERM. Performing graceful shutdown. 03/19 19:33:55 ShutdownGraceful all jobs. 03/19 19:33:55 Process exited, pid=13623, signal=15 03/19 19:33:55 Last process exited, now Starter is exiting 03/19 19:33:55 **** condor_starter (condor_STARTER) pid 13616 EXITING WITH STATUS 0
Reproduce with: config - SCHEDD_0 = $(SCHEDD) SCHEDD_0_ARGS = -f -local-name 0 SCHEDD.0.SCHEDD_NAME = 0@ SCHEDD.0.SPOOL = $(SPOOL)/0 SCHEDD.0.SCHEDD_LOG = $(SCHEDD_LOG).0 SCHEDD_1 = $(SCHEDD) SCHEDD_1_ARGS = -f -local-name 1 SCHEDD.1.SCHEDD_NAME = 1@ SCHEDD.1.SPOOL = $(SPOOL)/1 SCHEDD.1.SCHEDD_LOG = $(SCHEDD_LOG).1 NUM_CPUS = 1 DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD_0, SCHEDD_1 #ALLOW_ADVERTISE_STARTD = NONE setup - $ mkdir /var/lib/condor/spool/0 $ chown condor.condor /var/lib/condor/spool/0 $ chmod a+rx /var/lib/condor/spool/0 $ mkdir /var/lib/condor/spool/1 $ chown condor.condor /var/lib/condor/spool/1 $ chmod a+rx /var/lib/condor/spool/1 run - $ service condor start $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime robin.local LINUX INTEL Unclaimed Idle 0.350 2005 0+00:00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 1 0 0 1 0 0 0 Total 1 0 0 1 0 0 0 $ uncomment ALLOW_ADVERTISE_STARTD in config $ condor_reconfig -collector -full $ echo "cmd=/bin/sleep\nargs=15m\nqueue" | condor_submit -name 0@ Submitting job(s). 1 job(s) submitted to cluster 1. NOTE: Wait for job to start $ condor_status -sched Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs 0@ robin.loca 1 0 0 1@ robin.loca 0 0 0 TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 1 0 0 NOTE: Key that the slot ad was not updates $ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime robin.local LINUX INTEL Unclaimed Idle 0.350 2005 0+00:00:04 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/LINUX 1 0 0 1 0 0 0 Total 1 0 0 1 0 0 0 $ echo "cmd=/bin/sleep\nargs=15m\nqueue" | condor_submit -name 1@ Submitting job(s). 1 job(s) submitted to cluster 2. $ condor_status -sched Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs 0@ robin.loca 1 0 0 1@ robin.loca 0 1 0 TotalRunningJobs TotalIdleJobs TotalHeldJobs Total 1 1 0 NOTE: Wait for negotiation cycle $ condor_q -name 0@ -- Schedd: 0@ : <127.0.0.1:38615> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 matt 3/22 09:06 0+00:00:40 I 0 2.0 sleep 15m 1 jobs; 1 idle, 0 running, 0 held $ condor_q -name 1@ -- Schedd: 1@ : <127.0.0.1:39685> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 matt 3/22 09:07 0+00:00:00 I 0 0.0 sleep 15m 1 jobs; 1 idle, 0 running, 0 held $ grep "exited with status" /var/log/condor/SchedLog.0 03/22 09:07:33 Shadow pid 1331 for job 1.0 exited with status 107
In a successful run the job on 0@ will continue to run and the SchedLog.1 will show "Request was NOT accepted for claim..." without a message about RELEASE_CLAIM. condor_status -direct can be used to query the actual status of the startd and verify if jobs are running or not.
Fixed in 7.4.3-0.6
Tested on RHEL 5.5/4.8 x x86_64/i386 with condor-7.4.4-0.4 and it works as expected. --> VERIFIED
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,2 +1 @@ -Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a +Previously,the negotiator could hand out a single claim multiple times. The scheduler daemon would send a RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.-RELEASE_CLAIM, which evicted the already running job. With this update, the scheduler daemon sends a REQUEST_CLAIM instead of a RELEASE_CLAIM and the job continues to run.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html