709012 – multi-host recipe sets can be assigned to suboptimal labs

Bug 709012 - multi-host recipe sets can be assigned to suboptimal labs

Summary: multi-host recipe sets can be assigned to suboptimal labs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	beaker-dev-list
QA Contact:
Docs Contact:
URL:
Whiteboard:	Scheduler
Duplicates (1):	1014218 (view as bug list)
Depends On:	1127129
Blocks:
TreeView+	depends on / blocked

Reported:	2011-05-30 11:58 UTC by Martin Kudlej
Modified:	2023-09-14 01:23 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-10-21 14:14:40 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 2 Bill Peck 2011-05-31 12:53:13 UTC

It may be possible to sort the systems by the number of systems in a lab.

Comment 3 Marian Csontos 2011-05-31 13:31:48 UTC

IIUC no sorting will help us when:
- there are no matching machines as is often the case
- and when the job is first in queue machine from small lab becomes available

What we should try to get is to:

- fix any free systems in given lab
- and maximize the lowest number of available systems for remaining recipes in the lab

Example:

RS asks for 2 machines meeting A and B
lab-1 contains 1 machine of type A and 4 machines of type B
lab-2 contains 2 machines of type A and 2 machines of type B.

If there are no free machines we would prefer lab-2 as the lowest number of available machines is 2 (for type A, where lab-1 has only 1 such machine).

However, if there is a free machine of type A in lab-1 and a free machine of type A in lab-2, lab-1 would become the best choice as we need machine of type B where lab-1 wins 4:2.

Another option would be rescheduling MH recipesets when better alternative becomes available. (But we still need ability to identify better alternatives.)

Also this does not make much sense for tasks using widely available machines and picking at random would be good enough for them.

This looks like a non-trivial task and I incline to solution using human intelligence where adding interface to pick LC could be good enough solution.

Comment 4 Nick Coghlan 2012-10-17 04:34:57 UTC

Bulk reassignment of issues as Bill has moved to another team.

Comment 6 Dan Callaghan 2013-10-02 01:29:59 UTC

*** Bug 1014218 has been marked as a duplicate of this bug. ***

Comment 7 Nick Coghlan 2013-10-02 04:24:07 UTC

Note that the LC can already be forced through hostRequires. However, it may make sense to provide a straightforward option to pick an LC when using the workflow commands in the bkr client.

Comment 8 Nick Coghlan 2014-08-08 01:47:10 UTC

This is on hold until we evaluate the possibility of switching to a more capable scheduling engine.

Comment 10 Nick Coghlan 2014-08-11 06:42:22 UTC

Dan, Jaroslav suggested a possible heuristic that would favour labs with the most free systems as a general rule for *all* recipes. It's an additional database query per recipe scheduled, but it might be worthwhile, since it means the first recipe scheduled in a recipe set will favour the one with the most "headroom" to run recipes that don't have any strict hardware requirements.

Comment 11 Nick Coghlan 2014-08-11 07:38:25 UTC

Although as Marian discusses in comment 3, there's definitely a chance for this to go wrong when dealing with recipe sets containing more restrictive host requirements. Thus the effectiveness of Jaroslav's suggested heuristic (which mirrors Bill's original suggestion) may depend heavily on the specific kinds of jobs submitted.

I wonder if we could figure out a way to analyse the activity history of our main instance to identify cases where the additional heuristic would have helped and where it would have hindered.

Comment 12 Dan Callaghan 2014-08-18 01:52:54 UTC

Indeed, I don't think Jaroslav's idea will help with this problem in the general case. The real problem here is that we pin a recipe set to a lab controller as soon as the first recipe in the set is scheduled, without any way to pick a different lab. It's the same problem we keep having with the scheduler's greedy approach of picking a system as soon as one is free.

Comment 13 Ales Zelinka 2015-01-31 15:12:35 UTC

(In reply to Dan Callaghan from comment #12)
> The real problem here is that we pin a recipe set to a lab
> controller as soon as the first recipe in the set is scheduled

Getting bit by this repeatedly:

Multihost recipeset, 1st recipe gets assigned to a lab in which the requirements on second machine can't be satisfied. Deadlock; broken recipe and wasted resources.

> https://beaker.engineering.redhat.com/jobs/864396
> https://beaker.engineering.redhat.com/jobs/864395

Related:
> https://bugzilla.redhat.com/show_bug.cgi?id=1071364#c8

Comment 14 Nick Coghlan 2015-02-02 05:24:24 UTC

When selecting the initial set of candidate systems, the scheduling algorithm assumes that if a new distro is missing it will eventually show up in all labs, allowing the recipe set to proceed even if it has to wait for the sync to finish.

Has anything happened recently to make that assumption invalid? Are some of the arch specific trees not getting replicated to the other labs properly?

Comment 16 Stefan Kremen 2016-10-03 08:52:48 UTC

Increasing priority, we are blocked by this every week for our Tier testing.
Please fix ASAP!

Comment 17 Roman Joost 2016-10-04 00:26:13 UTC

Dear Stefan,

I'm sorry that is causing a lot of pain for you. At this point in time the item Nick raised in comment 8 still stands. We haven't evaluated a different scheduling engine and the current one in use will not suffice to address this problem. The best way to address this is to use the work around Jan posted to
limit the recipes to a single labcontroller using:

 <hostlabcontroller value="<lab host>" op="="/>

Comment 18 Stefan Kremen 2016-10-05 13:05:19 UTC

Hi Roman,

we have tested aforementioned workaround and for our usecase we ended up with the following one:

<hostname op="!=" value="amd-seattle-04.ml3.eng.bos.redhat.com"/>

This will exculde the disputed machine (the only aarch64 in BOS controller) from being scheduled and at the same time we aren't limited to a single labcontroller (for example RDU).

Thanks and good luck to your work on new scheduler.
Stefan

Comment 19 Ales Zelinka 2017-01-09 14:50:11 UTC

It's been almost a year since my last comment here, this warrants a refresh :)

The usual still keeps happening - ppc64 picked from BOS, now waiting for an aarch64 to be bought and racked to BOS as well, meanwhile RDU aarch64s are idling.

> RS:2728217

Comment 22 Red Hat Bugzilla 2023-09-14 01:23:55 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.