Bug 794660 - Partitionable slots can create more dynamic slots than CPUs
Summary: Partitionable slots can create more dynamic slots than CPUs
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor
Version: 2.1
Hardware: All
OS: Linux
high
high
Target Milestone: 2.2
: ---
Assignee: Timothy St. Clair
QA Contact: Lubos Trilety
URL:
Whiteboard: done
Depends On:
Blocks: 828434
TreeView+ depends on / blocked
 
Reported: 2012-02-17 08:50 UTC by Pavel Moravec
Modified: 2018-11-29 21:30 UTC (History)
6 users (show)

Fixed In Version: condor-7.6.5-0.15
Doc Type: Bug Fix
Doc Text:
C: Under certain conditions a partitionable slot can split into too many dynamic slots. C: The machine could potentially be oversubscribed. F: Add logic to prevent a partitionable slot from splitting more then the resources it has available to it. R: The machine should not be oversubscribed.
Clone Of:
Environment:
Last Closed: 2012-09-19 17:42:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
backported patch (1.03 KB, patch)
2012-02-17 08:50 UTC, Pavel Moravec
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:1278 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.2 security update 2012-09-19 21:40:26 UTC

Description Pavel Moravec 2012-02-17 08:50:25 UTC
Created attachment 563855 [details]
backported patch

Description of problem:
Under an unknown scenario, a partitionable slot can be split into too many dynamic slots - more than available memory and/or CPU cores. See https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2043 for a snapshot of condor_status.

It is requested to backpropagate its fix in https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2816 to condor-7.6.5-0.12.el5.


Version-Release number of selected component (if applicable):
condor-7.6.5-0.12.el5 


How reproducible:
unknown


Steps to Reproduce:
N/A
  

Actual results:
scheduler assigns jobs consuming more than available memory and/or CPU cores.


Expected results:
Only jobs requesting less than available memory and/or CPU cores are run at a moment.


Additional info:
Attaching upstream patch backpropagated to condor-7.6.5-0.12.el5.

Comment 3 Luigi Toscano 2012-03-07 19:40:41 UTC
Is the scenario really unknown? Any new clue about the conditions when this bug can show up?

Comment 4 Timothy St. Clair 2012-03-07 21:29:18 UTC
Best insight is in the dedicated scheduler.

Comment 7 Timothy St. Clair 2012-03-19 18:52:26 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Under certain conditions a partitionable slot can split into too many dynamic slots.
C: The machine could potentially be oversubscribed.
F: Add logic to prevent a partitionable slot from splitting more then the resources it has available to it. 
R: The machine should not be oversubscribed.

Comment 9 Luigi Toscano 2012-04-25 16:13:33 UTC
If I understand condor ticket #2816, the issue seems to be 100% reproducible. According to condor ticket #204 the problem was seen "sporadically". What is the realistic expectation about how much is reproducible?

Comment 10 Timothy St. Clair 2012-04-25 16:54:43 UTC
"This is because the requirements expression in the slot ad is not properly evaluated."

One would need to construct a slot_ad such that it caused a match but failed to evaluate after the claim has been given and during the split process.  

The only thing I can think of is to insert an if-then clause in the requirements expression which causes it to fail *only* when it's evaluated on the startd.

Comment 14 Lubos Trilety 2012-06-19 14:23:05 UTC
Could you please specify more precisely how to reproduce this bug? Exactly what type of ifThenElse clause can cause the bug to happen?

Comment 15 Timothy St. Clair 2012-06-19 15:48:00 UTC
if then else on a attribute which only exists on the startd, but is not present in the ad published to the collector.

Comment 16 Lubos Trilety 2012-06-20 12:11:11 UTC
(In reply to comment #15)
> if then else on a attribute which only exists on the startd, but is not
> present in the ad published to the collector.

OK, that much was clear. But I am aware only about those attributes which are published and I don't want to parse source code for others. Could you please write specific example of the if-then-else clause which fulfils these requirements?

Comment 17 Timothy St. Clair 2012-06-20 12:46:47 UTC
in submission: 

 Requirements = ifThenElse( PithyRetort =!= UNDEFINED, FALSE, TRUE)

only on startd: 

 PithyRetort = TRUE

And make sure that PithyRetort is not part of: http://research.cs.wisc.edu/condor/manual/v7.8/3_3Configuration.html#18154

Comment 19 Lubos Trilety 2012-06-21 14:12:53 UTC
The suggested scenario doesn't reproduce the bug.

Currently we aren't able to reproduce it.

Comment 22 errata-xmlrpc 2012-09-19 17:42:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1278.html


Note You need to log in before you can comment on or make changes to this bug.