Bug 1519589

Summary:	schedule_queued_recipes() query is too complicated and fragile
Product:	[Retired] Beaker	Reporter:	Dan Callaghan <dcallagh>
Component:	general	Assignee:	Dan Callaghan <dcallagh>
Status:	CLOSED CURRENTRELEASE	QA Contact:	tools-bugs <tools-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	24	CC:	achatter, dcallagh, mjia, rjoost
Target Milestone:	25.0	Keywords:	FutureFeature, Patch
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-03-19 04:18:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	911515

Description Dan Callaghan 2017-12-01 00:36:55 UTC

The schedule_queued_recipes() function in beakerd currently has a very large and complex query for finding queued recipes which are ready to be scheduled onto a free system.

Most of the complexity actually comes from the extra bits added to it which try to ensure that each recipe only gets scheduled in a lab where all the distro trees for any guest recipes are also available (in addition to the distro for the host recipe itself of course), which was bug 967479.

https://git.beaker-project.org/cgit/beaker/commit/?id=4df9474ed6afc5f776e6b91b48c64c643502450b

Ultimately the complexity comes from the fact that it is not starting from a single recipe (and matching systems against it) nor starting from a single system (and matching recipes against it), but rather it is trying to select *all* recipes which have some system they can be scheduled onto.

In implementing bug 911515 we have run up against this complexity. Specifically, trying to adjust that mega-query to handle the cases where either the host recipe, or any of the guest recipes, or both, can have NULL for their distro_tree_id is proving impossible. In spite of many attempts we haven't succeeded in devising a query that gives correct results in all cases and also performs reasonably. (Our best version still kills the database on RHEL6 due to producing too many rows in a temp table.)

So this bug is about finding a way to eliminate that mega-query in favour of something simpler.

Comment 1 Dan Callaghan 2017-12-01 00:40:07 UTC

Our best bet for solving this is still the old "event-driven scheduler" proposal:

https://beaker-project.org/dev/proposals/event-driven-scheduler.html

or at least the key idea from it, which is to have the scheduler look at newly-freed systems and find a recipe which can run on each system, instead of constantly polling for queued recipes which have runnable systems as it does now.

This requires us to track a new piece of state on a system: whether the system has become available and there might be a queued recipe which can be scheduled onto it (this is the "pending" state for systems, in the design proposal doc).

Rather than continuing to abuse the already-overloaded system "status" / "condition" this might require us to also clean up that mess (see bug 1039842).

Comment 2 Dan Callaghan 2017-12-01 04:46:45 UTC

There is actually another possible type of "event" which the original design proposal didn't consider.

When a system becomes free and we look for any queued recipes which can be run on it, we need to account for lab controller restrictions (that is, a recipe is only runnable if its distro, and all its guest recipe distros, are available in the lab where the system is located).

However if those restrictions means that we *don't* find any queued recipe which is runnable, and we put the system back into the free state -- then it's possible that a distro later being imported into that lab could now make some queued recipes runnable again. So effectively, importing a distro is also an event which could have scheduling implications.

The problem is that the design assumes there is a simple way to poll for things needing action -- either recipes in New state, or systems in Pending state. But there is no nice way to write a loop which polls the database looking for "a tree is now in a lab that it wasn't in before"...

Comment 3 Dan Callaghan 2017-12-04 00:17:04 UTC

Another "event" to consider is when a system loan is returned. While it's loaned, it is only eligible to run recipes owned by the loanee. But once the loan is returned, it may now be eligible to run other recipes.

That one can easily be handled by just flipping the system status back to "pending" when the loan is returned, to force the scheduler to reconsider it.

Comment 4 Dan Callaghan 2017-12-04 07:36:00 UTC

This implements sort of the bare minimum idea from the "event driven scheduler" proposal:

https://gerrit.beaker-project.org/5939

Most of the existing beakerd mechanisms are left as is (messy as they are). There are opportunities for further cleanup down the road.

I still need to find a solution for the problem in comment 2 (it's broken in my initial version of the patch I just posted). I am thinking to hijack the "dirty" state of a job as a way of flagging beakerd to re-examine queued recipes after their distro tree has been added to a new lab. Not sure how ugly that will be, I'll try a version of it tomorrow.

Comment 5 Dan Callaghan 2017-12-05 01:49:06 UTC

(In reply to Dan Callaghan from comment #4)
> I still need to find a solution for the problem in comment 2 (it's broken in
> my initial version of the patch I just posted). I am thinking to hijack the
> "dirty" state of a job as a way of flagging beakerd to re-examine queued
> recipes after their distro tree has been added to a new lab. Not sure how
> ugly that will be, I'll try a version of it tomorrow.

Instead of messing around with dirty jobs (which would also require a bunch of special handling to exist in the _update_status methods, just for this case) I realised a simpler approach would be to use the same @event.listens_for trick as I have in the patch for the other cases, like where a system's loan is changed etc. Flip any potentially affected systems back to "pending" and let the scheduler re-evaluate them to see if there is now any queued recipe for them, otherwise they will go back to "idle".

Comment 6 Dan Callaghan 2017-12-06 00:58:40 UTC

Looks like we have a viable solution here. The latest version of the patch covers ever possible corner case I can think of and seems to work. We will know for sure once it soaks on beaker-devel for a while.

Comment 9 Roman Joost 2018-03-19 04:18:20 UTC

Beaker 25.0 has been released.

Release notes are available upstream: https://beaker-project.org/docs/whats-new/release-25.html