Hide Forgot
SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1 submit executable = /bin/sleep arguments = 6000 universe = vanilla transfer_executable = true should_transfer_files = YES when_to_transfer_output = ON_EXIT queue 1 slot partitions correctly submit executable = /bin/sleep arguments = 6000 universe = vanilla transfer_executable = true should_transfer_files = YES when_to_transfer_output = ON_EXIT Requirements = cpus == 7 queue 1 Job is matched in the negotiator to the partitionable slot (slot1@), however in the next cycle check_matches removes the match and attempts to match the job to slot1_1@ and not slot1_2@. Slot1_1@ is already match to the first submission. The second submission stays in idle state. condor_submit jrt.job Submitting job(s). 1 job(s) submitted to cluster 575. <i stripped hostnames so log might look odd> 01/26/11 09:23:32 Matched 575.0 jrthomas@<192.168.122.1:56478> preempting none <192.168.122.1:34948> slot1@ 01/26/11 09:23:32 Notifying the accountant 01/26/11 09:23:32 Accountant::AddMatch - CustomerName=jrthomas@, ResourceName=slot1@@<192.168.122.1:34948> 01/26/11 09:23:32 Customername jrthomas@ GroupName is: <none> 01/26/11 09:23:32 GroupWeightedResourcesUsed=2.000000 SlotWeight=1.000000 01/26/11 09:23:32 (ACCOUNTANT) Added match between customer jrthomas@ and resource slot1@@<192.168.122.1:34948> 01/26/11 09:23:32 Successfully matched with slot1@ ... 01/26/11 09:23:52 Resource slot1@@<192.168.122.1:34948> was not claimed by jrthomas@ - removing match 01/26/11 09:23:52 Accountant::RemoveMatch - ResourceName=slot1@@<192.168.122.1:34948> 01/26/11 09:23:52 Customername jrthomas@ GroupName is: <none> 01/26/11 09:23:52 GroupResourcesUsed =1.000000 GroupWeightedResourcesUsed= 1.000000 SlotWeight=0.000000 01/26/11 09:23:52 (ACCOUNTANT) Removed match between customer jrthomas@ and resource slot1@@<192.168.122.1:34948> 01/26/11 09:23:52 Accountant::AddMatch - CustomerName=jrthomas@, ResourceName=slot1_1@@<192.168.122.1:34948> 01/26/11 09:23:52 Match already existed! -- Submitter: basin.redhat.com : <192.168.122.1:56478> : basin.redhat.com ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 574.0 jrthomas 1/26 09:22 0+00:04:07 R 0 0.0 sleep 6000 575.0 jrthomas 1/26 09:23 0+00:00:00 I 0 0.0 sleep 6000 2 jobs; 1 idle, 1 running, 0 held From the logs, it looks like the slot is not partitioning as there should be 3 startd ads: 01/26/11 09:30:14 Public ads include 1 submitter, 2 startd The cycle of matching to slot1@, not partitioning, and then having the match removed continues indefinitely. Partitioning works as expected when using RequestCpus = 7 instead of Requirements = cpus == 7 . $ condor_q -better-analyze 576.000: Request is being serviced --- 578.000: Run analysis summary. Of 2 machines, 1 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match but are serving users with a better priority in the pool 1 match but reject the job for unknown reasons 0 match but will not currently preempt their existing job 0 match but are currently offline 0 are available to run your job Last successful match: Wed Jan 26 09:43:38 2011 The Requirements expression for your job is: ( target.Cpus == 7 ) && ( target.Arch == "X86_64" ) && ( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( target.HasFileTransfer ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( target.Cpus == 7 ) 1 2 ( target.Arch == "X86_64" ) 2 3 ( target.OpSys == "LINUX" ) 2 4 ( target.Disk >= 30 ) 2 5 ( ( 1024 * target.Memory ) >= 30 )2 6 ( ( 1024 * ceiling(ifThenElse(JobVMMemory isnt undefined,JobVMMemory,2.929687500000000E-02)) ) >= 30 ) 2 7 ( target.HasFileTransfer ) 2
The "Requirements = CPUs == 7" job is expected to not run unless it also has "RequestCPUs = 7" The reason is the job may match slot1 (the partitionable slot), but it still needs to be accepted by the dynamic slot created for it. Without RequestCPUs=7, the dynamic slot will have CPUs=1 and the Job's Requirements will fail. Check the StartLog to see this, 02/01/11 13:29:24 slot1: Received match <192.168.1.100:44510>#1296584763#6#... 02/01/11 13:29:24 slot1: State change: match notification protocol successful 02/01/11 13:29:24 slot1: Changing state: Unclaimed -> Matched 02/01/11 13:29:24 slot1_2: New machine resource of type -1 allocated 02/01/11 13:29:24 slot1: Changing state: Matched -> Unclaimed ***02/01/11 13:29:24 slot1_2: Job requirements not satisfied.*** 02/01/11 13:29:24 slot1_2: Request to claim resource refused. 02/01/11 13:29:24 slot1_2: Claiming protocol failed 02/01/11 13:29:24 slot1_2: Changing state: Owner -> Delete 02/01/11 13:29:24 slot1_2: Resource no longer needed, deleting The Requirements are inconsistent with the job's other requests. This cannot be easily detected. The "( ( RequestMemory * 1024 ) >= ImageSize )" expression added to the Requirements is an example of a possible way to detect this situation wrt memory in a way that condor_q -better-analyze can report on. An RFE with a suggest on how to help -better-analyze detect such a situation is welcomed.
I see this differently because it matches in the negotiator. If it matches in the negotiator, it will starve submitters with jobs that could match. Hypothetically, one could flood the queue with jobs with (Requirements = cpus == x ) such that no jobs will run at all. These jobs could be matched against all partitionable slots with cpus>0. I think this is a throughput problem.
A submitter can certainly starve itself. When it comes to starving others, flooding sleep 365d will work, so long as preemption is disabled. This is akin to a user submitting with "Requirements = random(2)". There's a chance the Negotiator will provide a match and the slot will never be properly claimed. Is this deeper in how starvation can occur between users?
I was thinking along the lines of slots never being used by any submitter. There is a simpler case: runs ===== Requirements = cpus == 8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = FALSE NUM_SLOTS_TYPE_1 = 1 RequestCpus = 8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = FALSE NUM_SLOTS_TYPE_1 = 1 RequestCpus = 8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1 doesn't run =========== Requirements = cpus == 8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1 The 8 cpu partitionable slot doesn't behave like a "normal" 8 cpu slot.
A user who always gets to match first and submits RequestCPUs=1,Requirements=CPUs>1 is equivalent to a user who always matches first and submits sleep 365d, and is close to a user who matches first and submits Requirements=random(2). Long term that user's priority should decrease until they no longer match first.
I understand, but the "Requirements = cpus == x" problem only occurs with SLOT_TYPE_1_PARTITIONABLE = TRUE. In one case, using "Requirements = cpus == 8" is valid and not in the other. The only difference is the partitionable flag. runs ===== Requirements = cpus == 8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = FALSE NUM_SLOTS_TYPE_1 = 1 doesn't run =========== Requirements = cpus == 8 SLOT_TYPE_1 = cpus=8 SLOT_TYPE_1_PARTITIONABLE = TRUE NUM_SLOTS_TYPE_1 = 1
That is true, and the proper fix is to make the Negotiator aware of the partitionable/dynamic slot behavior. Unfortunately, such a solution was blocked from inclusion upstream.