Created attachment 563855 [details] backported patch Description of problem: Under an unknown scenario, a partitionable slot can be split into too many dynamic slots - more than available memory and/or CPU cores. See https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2043 for a snapshot of condor_status. It is requested to backpropagate its fix in https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2816 to condor-7.6.5-0.12.el5. Version-Release number of selected component (if applicable): condor-7.6.5-0.12.el5 How reproducible: unknown Steps to Reproduce: N/A Actual results: scheduler assigns jobs consuming more than available memory and/or CPU cores. Expected results: Only jobs requesting less than available memory and/or CPU cores are run at a moment. Additional info: Attaching upstream patch backpropagated to condor-7.6.5-0.12.el5.
Is the scenario really unknown? Any new clue about the conditions when this bug can show up?
Best insight is in the dedicated scheduler.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: C: Under certain conditions a partitionable slot can split into too many dynamic slots. C: The machine could potentially be oversubscribed. F: Add logic to prevent a partitionable slot from splitting more then the resources it has available to it. R: The machine should not be oversubscribed.
If I understand condor ticket #2816, the issue seems to be 100% reproducible. According to condor ticket #204 the problem was seen "sporadically". What is the realistic expectation about how much is reproducible?
"This is because the requirements expression in the slot ad is not properly evaluated." One would need to construct a slot_ad such that it caused a match but failed to evaluate after the claim has been given and during the split process. The only thing I can think of is to insert an if-then clause in the requirements expression which causes it to fail *only* when it's evaluated on the startd.
Could you please specify more precisely how to reproduce this bug? Exactly what type of ifThenElse clause can cause the bug to happen?
if then else on a attribute which only exists on the startd, but is not present in the ad published to the collector.
(In reply to comment #15) > if then else on a attribute which only exists on the startd, but is not > present in the ad published to the collector. OK, that much was clear. But I am aware only about those attributes which are published and I don't want to parse source code for others. Could you please write specific example of the if-then-else clause which fulfils these requirements?
in submission: Requirements = ifThenElse( PithyRetort =!= UNDEFINED, FALSE, TRUE) only on startd: PithyRetort = TRUE And make sure that PithyRetort is not part of: http://research.cs.wisc.edu/condor/manual/v7.8/3_3Configuration.html#18154
The suggested scenario doesn't reproduce the bug. Currently we aren't able to reproduce it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-1278.html