Bug 967479

Summary:

Failed to provision guestrecipe because of missing distro

Product:

[Retired] Beaker

Reporter:

Marian Ganisin <mganisin>

Component:

scheduler

Assignee:

Raymond Mancy <rmancy>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

0.12

CC:

aigao, asaha, atodorov, dcallagh, ebaak, llim, qwan, rmancy, xjia

Target Milestone:

0.14.3

Keywords:

Regression

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-12-19 06:11:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
scheduled_queued_recipes modified SQL	none
SQL generated by http://gerrit.beaker-project.org/#/c/2457/2	none

Description Marian Ganisin 2013-05-27 09:00:19 UTC

Description of problem:

job including guestrecipe is assigned to lab despite to missing distro required by guestrecipe. After that it fails with message:

Failed to provision recipeid 890896, No usable URL found for DistroTree(distro=Distro(name=u'RHEL-6.4'), variant=u'Client', arch=x86_64) in lab

where distro name is one that is was defined in guestrecipe and is not available in mentioned lab

Version-Release number of selected component (if applicable):
Version - 0.12.1

Expected results:
all recupes including guestrecipes are scheduled in lab with required distro available.

Comment 1 Nick Coghlan 2013-05-28 00:54:19 UTC

This appears to be a regression caused by the new more dynamic approach to guest provisioning (first used in 0.10, I believe).

If the host recipes in a recipe set have different distro requirements from one or more of the guest recipes, then it seems the recipe set may be scheduled in a lab that doesn't have access to all of the relevant distro trees for the guests.

Possible workaround for some cases: this situation shouldn't be possible if at least one host recipe in the recipe set needs the same distro tree as the affected guests. (This clearly isn't a good workaround for recipe sets where one host is running several guests that need different distro trees)

The real fix is to make sure we correctly take the distro tree requirements of the guest recipes into account when determining eligible labs for the recipe set.

Comment 2 Dan Callaghan 2013-05-28 23:06:39 UTC

Another workaround is to lock the host recipe to a lab which is known to have the necessary distros:

    <hostRequires>
        <labcontroller value="goodlab.example.com" />
    </hostRequires>

Comment 3 Dan Callaghan 2013-06-19 05:21:22 UTC

(In reply to Marian Ganisin from comment #0)
> Failed to provision recipeid 890896, No usable URL found for
> DistroTree(distro=Distro(name=u'RHEL-6.4'), variant=u'Client', arch=x86_64)
> in lab

Incidentally, why was RHEL-6.4 Client x86_64 not in the lab?

Comment 4 Dan Callaghan 2013-06-19 07:19:21 UTC

I don't think there's any reasonable way to fix this in 0.13.x.

We can add extra filter criteria easily enough in schedule_queued_recipe when picking systems, the problem is that the massive query for the outer loop in schedule_queued_recipes has to have exactly matching filter criteria (expressed in a single, massive query) otherwise beakerd will get into a busy loop. Adding an extra filter to that massive query to say "only systems in labs where every guest recipe of the host recipe is in the lab" is the hard part.

We could fix this once the "event driven scheduling" design is in place.

Comment 6 Dan Callaghan 2013-09-22 23:34:35 UTC

*** Bug 1010218 has been marked as a duplicate of this bug. ***

Comment 8 Nick Coghlan 2013-10-25 05:34:46 UTC

I think there may be two potentially simpler solutions to this. Currently, the scheduling loop goes to great effort to ensure that labs are still considered if the required distro shows up *after* the job is submitted.

We could just stop doing that. If the distro is available in the lab when the job is submitted, great, we consider that lab a candidate for running the recipe. If it isn't there yet, then tough, recipes in that job won't consider that lab.

Since this query is only run when queueing the recipe, it can take guest recipe distros into account fairly readily.


Then, once the event driven scheduling is in place, we can go back to more dynamic detection of distro availability. This option trades the current behaviour (aborted recipes), for potentially delayed execution (since jobs submitted shortly after a new tree is imported may not consider all available labs). Since an aborted job needs to be rescheduled and goes to the back of the queue *anyway* (and may abort again if the trees still haven't been imported in all labs), that would be an improvement on the status quo.

A variant on this idea would be to make a pass through the queued recipes whenever a distro is imported, updating the candidate systems for queued recipes appropriately.

I think this is substantially simpler than running the full query on every scheduling pass.

Comment 9 Raymond Mancy 2013-10-28 12:33:27 UTC

(In reply to Nick Coghlan from comment #8)
> I think there may be two potentially simpler solutions to this. Currently,
> the scheduling loop goes to great effort to ensure that labs are still
> considered if the required distro shows up *after* the job is submitted.
> 
> We could just stop doing that. If the distro is available in the lab when
> the job is submitted, great, we consider that lab a candidate for running
> the recipe. If it isn't there yet, then tough, recipes in that job won't
> consider that lab.
> 

Yes we could, and it would be similar to the kind of query we would have to do if we added it to the schedule_queued_recipes query (i.e if recipes have a guest recipe, don't return them if the guest recipe distro is also not satisfied). 

So if we are determined to add the check, adding doesn't gain us much and loses us a lot of that flexibility

> Since this query is only run when queueing the recipe, it can take guest
> recipe distros into account fairly readily.
> 

I've been playing with the SQL, and to get it to take into account the guest_recipe distros in the outer schedule_queued_recipes is quite doable (in that I've written the query, and  seems to works with the testing I've done). It does add a bit more girth to the existing scheduled_queued_recipes query length, but not much, and the optimizer seems to handle it just fine (tested on prod data as well as well as my own beaker env).

The problem now is getting it into sqlalchemy and not have it do unnecessary things (like eagerloading columns when joining aliased table) which might add some a bit of bloat. Once I figure that out I'll put a patch up for review.


> <snip>
> I think this is substantially simpler than running the full query on every
> scheduling pass.

The SQL itself is really not that more complicated it turns out. I've attached the SQL, with the additional lines commented to indicate as such.

Comment 10 Raymond Mancy 2013-10-28 13:14:13 UTC

Created attachment 816806 [details]
scheduled_queued_recipes modified SQL

Comment 11 Nick Coghlan 2013-10-29 03:02:30 UTC

That SQL doesn't look right to me - for recipes with multiple guests, won't it return the host recipe/guest recipe pairs where both the host distro and the guest distro exist in the same lab, without any guarantee that *all* the guest distros exist in that lab?

In English terms, the query we need to run is:

  Tell me the recipes where:
  - the host distro exists in at least one lab
  - in at least one lab where the host distro exists, all the guest recipe distros also exist

Or, to put it the other way, we need to find at least one lab controller where the host recipe and *all* the guest recipes are available. Otherwise, the recipe can't be scheduled anywhere.

It's that "all distros exist in the lab" part that Dan was worried would be complex and expensive to determine on every scheduling pass. However, that's the ideal fix, since we'd automatically wait until the distros finished propagating and then run the recipe.

The status quo is that we try to run the recipe as soon as the host distro is available, which may cause it to fail when provisioning the guests rather than waiting for the mirroring to complete.

However, I realised that changing to *only* check at the start would mean that instead of maybe failing (if the host distro becomes available in the lab before the guest distro is available), such recipes would just fail immediately instead of after provisioning the host. So that's not a good idea.

Here's a completely different possibility: what if we changed the scheduling loop to include a separate "update candidate systems" step?

Then new->processed could skip the "all_systems" query, and we'd just update recipe.systems in a new "update_candidate_systems_for_queued_recipes" pass as the last step in the scheduling loop.

Comment 12 Raymond Mancy 2013-10-29 04:23:54 UTC

Yes of course you're right, I was only testing against hosts that had one guest recipe. Off the top of my head I'm not sure what the query to test for all guests would look like. It may or may not too slow for us, I wouldn't say until we had seen and tested the query.

Do you mean not attach any candidate systems to recipe in new->processed?
There are intermediate steps between new->processed and queued->scheduled that currently rely on recipe.systems (i.e priority bumping, finding bad lab controllers), also scheduled_queued_recipes itself relies on knowing when a candidate system is free at the start of the outer loop. I guess it's possible we could rearrange all of that, but I'm not sure if that's what we really want.

Comment 13 Nick Coghlan 2013-10-29 06:37:45 UTC

At the moment, we do two queries on the new->processed step:

systems = systems in labs were the distro is available. We require at least one of these or we abort the recipe.

all_systems = systems that meet the requirements, but may not have the relevant distro available yet. These are the ones we actually add to recipe.systems, and hence require that the "doesn't actually have the relevant distro yet" labs be filtered out. If this only returns one system, we bump the job priority.

You're right, though, processed->queued also uses that list to determine the set of eligible lab controllers for multi-host tests, and limiting it to just labs that already had the relevant distros could cause problems just after a new tree is made available.

Comment 16 Nick Coghlan 2013-10-30 07:23:58 UTC

Ray and I discussed this one with the aid of the whiteboard, and realised it can actually be split into two distinct problems:

1. Consistent guest distros

If all the guest recipes are using the same distro (e.g. using a known good version to host to run a nightly build running in one or more guests), then Ray's original SQL should do the right thing in the outer query.

2. Inconsistent guest distros

If one or more guests are running a *different* distro from the other guests, then Ray's first query update won't help. In particular, if any guest is running the same distro as the host, then problems may occur in exactly the same cases as they do now.

However, checking *all* recipes is a much more complex addition to the outer query. Ray had the idea of tweaking his original proposal to check specifically for the host recipe distro and the *newest* guest recipe distro existing in the same lab, rather than checking for *any* guest distro (as the original SQL does).

The inner query in schedule_queued_recipe can then do the more comprehensive check that ensures all guest distros for that host recipe are available rather than just the newest one.

Core assumption:

We also identified that one of our core assumptions is that the main case where this bug may cause problems is when:

- the host distro has already been mirrored to all labs (e.g. it's a released version of a distro)
- the guest distro is available in at least one lab (so the recipe gets queued in the first place), but is still being mirrored to other labs (e.g. it's a new nightly build)

The outcome we *want* is for Beaker to only consider the lab controllers where both distros needed are present, while still waiting for the mirroring to occur if the labs that already have both trees are in high demand.

Comment 17 Raymond Mancy 2013-11-08 01:03:46 UTC

Patch with the 'latest guest distro' fix. http://gerrit.beaker-project.org/#/c/2457/

Comment 18 Raymond Mancy 2013-11-08 06:31:35 UTC

Created attachment 821432 [details]
SQL generated by http://gerrit.beaker-project.org/#/c/2457/2

Comment 21 xjia 2013-11-12 12:16:12 UTC

Question :
how does user get to know the lastest distro is missing on labcontroller, there's no message to tell them.

Comment 22 Dan Callaghan 2013-11-12 21:18:41 UTC

(In reply to xjia from comment #21)
> how does user get to know the lastest distro is missing on labcontroller,
> there's no message to tell them.

If you have a specific distro tree in mind, you can find it in Beaker (Distros -> Trees from the menu, then search) and the Lab Controllers tab on the distro tree page will show which labs it is present in. Is that what you mean?

Comment 23 Raymond Mancy 2013-11-13 00:23:49 UTC

(In reply to xjia from comment #21)
> Question :
> how does user get to know the lastest distro is missing on labcontroller,
> there's no message to tell them.

If the distro is missing, it is only ever supposed to be a temporary issue.
The idea is that the previously missing distro will soon appear (i.e be synced) and thus the user shouldn't have to worry about it. The code comments specifically address this issue with similar wording.

However, we also decided that we should deal with the scenario where the syncing service fails, and the distro is never fully synced to all lab controllers. I've created BZ#1029706.