Description of problem:
when there are multiple multi-host jobs, it happens quite often that the result of scheduling efforts is just a deadlock.
From my observation it happens due to the nature of how the recipes are scheduled.
They are first sorted (prio,id) and then for each of them the machines are looked for. If any suitable machine is found, it's assigned.
The problem is that when I have 3 different jobs, it may happen, that first job was running (and thus holding some systems), job 2 now got to schedule loop and this did not find any suitable system. Microsecond after this job 2 finishes and returns the systems. Job 3 now finds free systems and takes them.
This processing is very unlucky and eventually ruins any scheduling predictability wrt jobs with the same priority. And worse, it can lead to deadlocks when any job is able to take any systems provided they are available just in time for their scheduling SELECT. While many of them get at least one system, it may happen that end up each holds resources needed for completition of at least one other job.
The only way I can see how to avoid it would be NOT to have separate sessions for each recipe. This would guarantee consistent reads when the systems are looked up (so returning systems during schedule loop would not be counted in until next loop starts). (beakerd.py, line 325)
Maybe it would be worth a try to reverse the logic a bit. First look up free systems and then try to match them to recipes (sorted as they are now). Then the first eligible recipe would be assigned any newly available system(s). All competing recipes would be behind him and this would force FCFS scheduling on recipes with the same prio and shared systems.
Version-Release number of selected component (if applicable):
0.9.4 also probably vulnerable as the code is the same
Steps to Reproduce:
1. schedule many multihost jobs
2. observe that they are not assigned systems according to their ID sometimes
scheduling is not consistent and may lead to deadlocks
no deadlocks, sequenced and predictable scheduling. Job with lower ID must be
finished (or scheduled) before job with the same prio and higher ID gets any system from the shared set.
It would also make sense in this logic when new high prio job appears and low prio job has already a system scheduled, this would be taken back and assigned to the higher prio job.
There are many flaws in our current scheduling algorithm, this is one of the more serious.
*** Bug 889065 has been marked as a duplicate of this bug. ***
The draft of our scheduler redesign (which will fix this along with several other problems) is at http://beaker-project.org/dev/proposals/event-driven-scheduler.html
The "sweeping wave" scheduling model we currently use (where all recipes are advanced through a given state change in parallel, rather than each recipe being pushed through as as many states as possible as they will be in the new design) is unfortunately fundamentally broken :(
*** Bug 884683 has been marked as a duplicate of this bug. ***
Beaker 0.12 has been released.