Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 794660

Summary: Partitionable slots can create more dynamic slots than CPUs
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: condorAssignee: Timothy St. Clair <tstclair>
Status: CLOSED ERRATA QA Contact: Lubos Trilety <ltrilety>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: jneedle, ltoscano, ltrilety, matt, mkudlej, tstclair
Target Milestone: 2.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: done
Fixed In Version: condor-7.6.5-0.15 Doc Type: Bug Fix
Doc Text:
C: Under certain conditions a partitionable slot can split into too many dynamic slots. C: The machine could potentially be oversubscribed. F: Add logic to prevent a partitionable slot from splitting more then the resources it has available to it. R: The machine should not be oversubscribed.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-09-19 17:42:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 828434    
Attachments:
Description Flags
backported patch none

Description Pavel Moravec 2012-02-17 08:50:25 UTC
Created attachment 563855 [details]
backported patch

Description of problem:
Under an unknown scenario, a partitionable slot can be split into too many dynamic slots - more than available memory and/or CPU cores. See https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2043 for a snapshot of condor_status.

It is requested to backpropagate its fix in https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2816 to condor-7.6.5-0.12.el5.


Version-Release number of selected component (if applicable):
condor-7.6.5-0.12.el5 


How reproducible:
unknown


Steps to Reproduce:
N/A
  

Actual results:
scheduler assigns jobs consuming more than available memory and/or CPU cores.


Expected results:
Only jobs requesting less than available memory and/or CPU cores are run at a moment.


Additional info:
Attaching upstream patch backpropagated to condor-7.6.5-0.12.el5.

Comment 3 Luigi Toscano 2012-03-07 19:40:41 UTC
Is the scenario really unknown? Any new clue about the conditions when this bug can show up?

Comment 4 Timothy St. Clair 2012-03-07 21:29:18 UTC
Best insight is in the dedicated scheduler.

Comment 7 Timothy St. Clair 2012-03-19 18:52:26 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Under certain conditions a partitionable slot can split into too many dynamic slots.
C: The machine could potentially be oversubscribed.
F: Add logic to prevent a partitionable slot from splitting more then the resources it has available to it. 
R: The machine should not be oversubscribed.

Comment 9 Luigi Toscano 2012-04-25 16:13:33 UTC
If I understand condor ticket #2816, the issue seems to be 100% reproducible. According to condor ticket #204 the problem was seen "sporadically". What is the realistic expectation about how much is reproducible?

Comment 10 Timothy St. Clair 2012-04-25 16:54:43 UTC
"This is because the requirements expression in the slot ad is not properly evaluated."

One would need to construct a slot_ad such that it caused a match but failed to evaluate after the claim has been given and during the split process.  

The only thing I can think of is to insert an if-then clause in the requirements expression which causes it to fail *only* when it's evaluated on the startd.

Comment 14 Lubos Trilety 2012-06-19 14:23:05 UTC
Could you please specify more precisely how to reproduce this bug? Exactly what type of ifThenElse clause can cause the bug to happen?

Comment 15 Timothy St. Clair 2012-06-19 15:48:00 UTC
if then else on a attribute which only exists on the startd, but is not present in the ad published to the collector.

Comment 16 Lubos Trilety 2012-06-20 12:11:11 UTC
(In reply to comment #15)
> if then else on a attribute which only exists on the startd, but is not
> present in the ad published to the collector.

OK, that much was clear. But I am aware only about those attributes which are published and I don't want to parse source code for others. Could you please write specific example of the if-then-else clause which fulfils these requirements?

Comment 17 Timothy St. Clair 2012-06-20 12:46:47 UTC
in submission: 

 Requirements = ifThenElse( PithyRetort =!= UNDEFINED, FALSE, TRUE)

only on startd: 

 PithyRetort = TRUE

And make sure that PithyRetort is not part of: http://research.cs.wisc.edu/condor/manual/v7.8/3_3Configuration.html#18154

Comment 19 Lubos Trilety 2012-06-21 14:12:53 UTC
The suggested scenario doesn't reproduce the bug.

Currently we aren't able to reproduce it.

Comment 22 errata-xmlrpc 2012-09-19 17:42:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1278.html