It may be possible to sort the systems by the number of systems in a lab.
IIUC no sorting will help us when: - there are no matching machines as is often the case - and when the job is first in queue machine from small lab becomes available What we should try to get is to: - fix any free systems in given lab - and maximize the lowest number of available systems for remaining recipes in the lab Example: RS asks for 2 machines meeting A and B lab-1 contains 1 machine of type A and 4 machines of type B lab-2 contains 2 machines of type A and 2 machines of type B. If there are no free machines we would prefer lab-2 as the lowest number of available machines is 2 (for type A, where lab-1 has only 1 such machine). However, if there is a free machine of type A in lab-1 and a free machine of type A in lab-2, lab-1 would become the best choice as we need machine of type B where lab-1 wins 4:2. Another option would be rescheduling MH recipesets when better alternative becomes available. (But we still need ability to identify better alternatives.) Also this does not make much sense for tasks using widely available machines and picking at random would be good enough for them. This looks like a non-trivial task and I incline to solution using human intelligence where adding interface to pick LC could be good enough solution.
Bulk reassignment of issues as Bill has moved to another team.
*** Bug 1014218 has been marked as a duplicate of this bug. ***
Note that the LC can already be forced through hostRequires. However, it may make sense to provide a straightforward option to pick an LC when using the workflow commands in the bkr client.
This is on hold until we evaluate the possibility of switching to a more capable scheduling engine.
Dan, Jaroslav suggested a possible heuristic that would favour labs with the most free systems as a general rule for *all* recipes. It's an additional database query per recipe scheduled, but it might be worthwhile, since it means the first recipe scheduled in a recipe set will favour the one with the most "headroom" to run recipes that don't have any strict hardware requirements.
Although as Marian discusses in comment 3, there's definitely a chance for this to go wrong when dealing with recipe sets containing more restrictive host requirements. Thus the effectiveness of Jaroslav's suggested heuristic (which mirrors Bill's original suggestion) may depend heavily on the specific kinds of jobs submitted. I wonder if we could figure out a way to analyse the activity history of our main instance to identify cases where the additional heuristic would have helped and where it would have hindered.
Indeed, I don't think Jaroslav's idea will help with this problem in the general case. The real problem here is that we pin a recipe set to a lab controller as soon as the first recipe in the set is scheduled, without any way to pick a different lab. It's the same problem we keep having with the scheduler's greedy approach of picking a system as soon as one is free.
(In reply to Dan Callaghan from comment #12) > The real problem here is that we pin a recipe set to a lab > controller as soon as the first recipe in the set is scheduled Getting bit by this repeatedly: Multihost recipeset, 1st recipe gets assigned to a lab in which the requirements on second machine can't be satisfied. Deadlock; broken recipe and wasted resources. > https://beaker.engineering.redhat.com/jobs/864396 > https://beaker.engineering.redhat.com/jobs/864395 Related: > https://bugzilla.redhat.com/show_bug.cgi?id=1071364#c8
When selecting the initial set of candidate systems, the scheduling algorithm assumes that if a new distro is missing it will eventually show up in all labs, allowing the recipe set to proceed even if it has to wait for the sync to finish. Has anything happened recently to make that assumption invalid? Are some of the arch specific trees not getting replicated to the other labs properly?
Increasing priority, we are blocked by this every week for our Tier testing. Please fix ASAP!
Dear Stefan, I'm sorry that is causing a lot of pain for you. At this point in time the item Nick raised in comment 8 still stands. We haven't evaluated a different scheduling engine and the current one in use will not suffice to address this problem. The best way to address this is to use the work around Jan posted to limit the recipes to a single labcontroller using: <hostlabcontroller value="<lab host>" op="="/>
Hi Roman, we have tested aforementioned workaround and for our usecase we ended up with the following one: <hostname op="!=" value="amd-seattle-04.ml3.eng.bos.redhat.com"/> This will exculde the disputed machine (the only aarch64 in BOS controller) from being scheduled and at the same time we aren't limited to a single labcontroller (for example RDU). Thanks and good luck to your work on new scheduler. Stefan
It's been almost a year since my last comment here, this warrants a refresh :) The usual still keeps happening - ppc64 picked from BOS, now waiting for an aarch64 to be bought and racked to BOS as well, meanwhile RDU aarch64s are idling. > RS:2728217
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days