Bug 872187 - Remove scheduling deadlocks
Summary: Remove scheduling deadlocks
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Community
Component: scheduler
Version: 0.9
Hardware: Unspecified
OS: Unspecified
high
urgent vote
Target Milestone: 0.12
Assignee: Raymond Mancy
QA Contact: Dan Callaghan
URL:
Whiteboard: Scheduler
Keywords:
: 884683 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-01 14:11 UTC by Jaroslav Kortus
Modified: 2018-02-06 00:41 UTC (History)
9 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2013-04-11 04:56:15 UTC


Attachments (Terms of Use)

Description Jaroslav Kortus 2012-11-01 14:11:03 UTC
Description of problem:
when there are multiple multi-host jobs, it happens quite often that the result of scheduling efforts is just a deadlock.

From my observation it happens due to the nature of how the recipes are scheduled.
They are first sorted (prio,id) and then for each of them the machines are looked for. If any suitable machine is found, it's assigned.

The problem is that when I have 3 different jobs, it may happen, that first job was running (and thus holding some systems), job 2 now got to schedule loop and this did not find any suitable system. Microsecond after this job 2 finishes and returns the systems. Job 3 now finds free systems and takes them.

This processing is very unlucky and eventually ruins any scheduling predictability wrt jobs with the same priority. And worse, it can lead to deadlocks when any job is able to take any systems provided they are available just in time for their scheduling SELECT. While many of them get at least one system, it may happen that end up each holds resources needed for completition of at least one other job.

The only way I can see how to avoid it would be NOT to have separate sessions for each recipe. This would guarantee consistent reads when the systems are looked up (so returning systems during schedule loop would not be counted in until next loop starts). (beakerd.py, line 325)

Maybe it would be worth a try to reverse the logic a bit. First look up free systems and then try to match them to recipes (sorted as they are now). Then the first eligible recipe would be assigned any newly available system(s). All competing recipes would be behind him and this would force FCFS scheduling on recipes with the same prio and shared systems.

Version-Release number of selected component (if applicable):
0.9.3 (tested)
0.9.4 also probably vulnerable as the code is the same

How reproducible:
eventually 100%

Steps to Reproduce:
1. schedule many multihost jobs
2. observe that they are not assigned systems according to their ID sometimes
3.
  
Actual results:
scheduling is not consistent and may lead to deadlocks

Expected results:
no deadlocks, sequenced and predictable scheduling. Job with lower ID must be
finished (or scheduled) before job with the same prio and higher ID gets any system from the shared set.

It would also make sense in this logic when new high prio job appears and low prio job has already a system scheduled, this would be taken back and assigned to the higher prio job.

Additional info:

Comment 1 Dan Callaghan 2012-11-01 22:25:18 UTC
There are many flaws in our current scheduling algorithm, this is one of the more serious.

Comment 2 Marian Csontos 2013-01-01 19:37:57 UTC
*** Bug 889065 has been marked as a duplicate of this bug. ***

Comment 3 Nick Coghlan 2013-02-08 01:39:26 UTC
The draft of our scheduler redesign (which will fix this along with several other problems) is at http://beaker-project.org/dev/proposals/event-driven-scheduler.html

The "sweeping wave" scheduling model we currently use (where all recipes are advanced through a given state change in parallel, rather than each recipe being pushed through as as many states as possible as they will be in the new design) is unfortunately fundamentally broken :(

Comment 7 Raymond Mancy 2013-02-13 05:35:20 UTC
http://gerrit.beaker-project.org/#/c/1718/

Comment 8 Raymond Mancy 2013-02-14 00:47:53 UTC
http://gerrit.beaker-project.org/#/c/1720/

Comment 9 Raymond Mancy 2013-03-20 03:55:48 UTC
*** Bug 884683 has been marked as a duplicate of this bug. ***

Comment 12 Dan Callaghan 2013-04-11 04:56:15 UTC
Beaker 0.12 has been released.


Note You need to log in before you can comment on or make changes to this bug.