Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 794660 - Partitionable slots can create more dynamic slots than CPUs
Partitionable slots can create more dynamic slots than CPUs
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: condor (Show other bugs)
2.1
All Linux
high Severity high
: 2.2
: ---
Assigned To: Timothy St. Clair
Lubos Trilety
done
:
Depends On:
Blocks: 828434
  Show dependency treegraph
 
Reported: 2012-02-17 03:50 EST by Pavel Moravec
Modified: 2012-09-19 14:03 EDT (History)
6 users (show)

See Also:
Fixed In Version: condor-7.6.5-0.15
Doc Type: Bug Fix
Doc Text:
C: Under certain conditions a partitionable slot can split into too many dynamic slots. C: The machine could potentially be oversubscribed. F: Add logic to prevent a partitionable slot from splitting more then the resources it has available to it. R: The machine should not be oversubscribed.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-09-19 13:42:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
backported patch (1.03 KB, patch)
2012-02-17 03:50 EST, Pavel Moravec
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:1278 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Grid 2.2 security update 2012-09-19 17:40:26 EDT

  None (edit)
Description Pavel Moravec 2012-02-17 03:50:25 EST
Created attachment 563855 [details]
backported patch

Description of problem:
Under an unknown scenario, a partitionable slot can be split into too many dynamic slots - more than available memory and/or CPU cores. See https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2043 for a snapshot of condor_status.

It is requested to backpropagate its fix in https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2816 to condor-7.6.5-0.12.el5.


Version-Release number of selected component (if applicable):
condor-7.6.5-0.12.el5 


How reproducible:
unknown


Steps to Reproduce:
N/A
  

Actual results:
scheduler assigns jobs consuming more than available memory and/or CPU cores.


Expected results:
Only jobs requesting less than available memory and/or CPU cores are run at a moment.


Additional info:
Attaching upstream patch backpropagated to condor-7.6.5-0.12.el5.
Comment 3 Luigi Toscano 2012-03-07 14:40:41 EST
Is the scenario really unknown? Any new clue about the conditions when this bug can show up?
Comment 4 Timothy St. Clair 2012-03-07 16:29:18 EST
Best insight is in the dedicated scheduler.
Comment 7 Timothy St. Clair 2012-03-19 14:52:26 EDT
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
C: Under certain conditions a partitionable slot can split into too many dynamic slots.
C: The machine could potentially be oversubscribed.
F: Add logic to prevent a partitionable slot from splitting more then the resources it has available to it. 
R: The machine should not be oversubscribed.
Comment 9 Luigi Toscano 2012-04-25 12:13:33 EDT
If I understand condor ticket #2816, the issue seems to be 100% reproducible. According to condor ticket #204 the problem was seen "sporadically". What is the realistic expectation about how much is reproducible?
Comment 10 Timothy St. Clair 2012-04-25 12:54:43 EDT
"This is because the requirements expression in the slot ad is not properly evaluated."

One would need to construct a slot_ad such that it caused a match but failed to evaluate after the claim has been given and during the split process.  

The only thing I can think of is to insert an if-then clause in the requirements expression which causes it to fail *only* when it's evaluated on the startd.
Comment 14 Lubos Trilety 2012-06-19 10:23:05 EDT
Could you please specify more precisely how to reproduce this bug? Exactly what type of ifThenElse clause can cause the bug to happen?
Comment 15 Timothy St. Clair 2012-06-19 11:48:00 EDT
if then else on a attribute which only exists on the startd, but is not present in the ad published to the collector.
Comment 16 Lubos Trilety 2012-06-20 08:11:11 EDT
(In reply to comment #15)
> if then else on a attribute which only exists on the startd, but is not
> present in the ad published to the collector.

OK, that much was clear. But I am aware only about those attributes which are published and I don't want to parse source code for others. Could you please write specific example of the if-then-else clause which fulfils these requirements?
Comment 17 Timothy St. Clair 2012-06-20 08:46:47 EDT
in submission: 

 Requirements = ifThenElse( PithyRetort =!= UNDEFINED, FALSE, TRUE)

only on startd: 

 PithyRetort = TRUE

And make sure that PithyRetort is not part of: http://research.cs.wisc.edu/condor/manual/v7.8/3_3Configuration.html#18154
Comment 19 Lubos Trilety 2012-06-21 10:12:53 EDT
The suggested scenario doesn't reproduce the bug.

Currently we aren't able to reproduce it.
Comment 22 errata-xmlrpc 2012-09-19 13:42:50 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1278.html

Note You need to log in before you can comment on or make changes to this bug.