Red Hat Bugzilla – Bug 1470959
deadlock between multi-host recipe sets with host requirements can sometimes occur if jobs are dirty
Last modified: 2017-09-19 23:53:06 EDT
Bug 889065 tried to avoid deadlocks between multi-host recipe sets with related host requirements, by using a very blunt hammer: any recipe in a recipe set that has been scheduled, will be given higher priority over any recipe that is not.
This was achieved by adding .order_by(RecipeSet.lab_controller == None) in the schedule_queued_recipes routine, when it is looking for candidate recipes ready to be scheduled. This effectively sorts any recipe where recipe_set.lab_controller has been set to the top of the list. This attribute is set when the first recipe in the set is scheduled.
As an example:
R:2 (requires system-a)
R:3 (requires system-b)
R:5 (requires system-a)
R:6 (requires system-b)
Initially system-a and system-b are both busy.
Then system-a becomes free, and R:2 is scheduled onto it (since its id is lower, assuming RS:1 and RS:4 are equal priority).
Then system-b becomes free, and R:3 *should* be scheduled onto it, regardless of any other factors such as priority. This is the .order_by() fix for bug 889065.
*However* this fix assumes that all potentially deadlocking recipes will be selected by the query -- and thus just tweaking the ordering of the results is enough to ensure the correct recipe is fixed.
One of the filter criteria in all scheduler criteria is to exclude dirty jobs. This dates to when beakerd was running update_dirty_jobs concurrently with the scheduling routines (bug 807237), although we later had to serialize it to avoid database deadlocks even between unrelated jobs (bug 952587).
In the example above, if RS:1 and RS:4 are part of a much larger job including unrelated recipe sets which have already started running (and thus the harness is marking the jobs dirty as it sends new results), then at each scheduling pass either R:2 or R:5 could be excluded randomly from the query. If R:2 is excluded because it is dirty but R:5 is not dirty, then the scheduler will pick R:5 instead, defeating the .order_by() and leading to deadlock.
The obvious solution is to just remove the criterion which is filtering out dirty jobs from the scheduler queries. By definition the harness cannot be sending updates for a recipe that is still queued. But I'm not sure if that could break things in some other way.