Red Hat Bugzilla – Bug 851354
RFE: Allow scheduling of jobs against Manual and Broken systems
Last modified: 2015-07-26 18:14:51 EDT
Description of problem:
Having system on load deemed useless as it was marked Broken after it failed installing stable distro. This should not require lab-admin's intervetion.
Use case: how's kernel/anaconda developer supposed to examine bugs preventing a system from installing/booting?
Version-Release number of selected component (if applicable):
Bulk reassignment of issues as Bill has moved to another team.
#790492 is possibly related (proposed solution there is to provide a way to opt-in to allowing a system to be provisioned while Broken when that system is explicitly requested in order to be able to diagnose/fix isssues with the hardware).
Given the discussion on bug 1062706, I'm proposing that we add this to the development list for Beaker 0.16. My suggestion is that we do something like the following.
1. Allow hostRequires to explicitly specify a system instead of host requirements:
This is mutually exclusive with normal host selection criteria.
2. If a recipe specifies a specific system in this way, then the scheduler allows the recipe to be scheduled as long as the system is either free or already reserved by the submitting user.
3. If the user already had the system reserved before the recipe started, then the scheduler ensures it does not get powered off at the end of the recipe execution.
Amit pointed out the context of this proposal wasn't clear. Some key use cases:
1. More easily checking if a Broken system is working again before putting it back to Automated (e.g. by running an inventory scan)
2. Marian's original use case above, running further tests on a Broken system to figure out why it broke.
3. Running inventory scans on Manual systems
4. Running other automated tests on Manual systems without having to make them generally available to satisfy arbitrary hostRequires checks
Dan also noted that points two and three in comment 4 aren't really feasible without implementing bug 639938 first (and perhaps not even then), so existing reservations by the current user should instead cause the submitted recipe to be queued, just as they do for reservations of Automated systems today.
Preliminary patch: http://gerrit.beaker-project.org/#/c/2864/1
To summarize what is being changed: If the recipe's <hostRequires> has something like this:
<name op="=" value="testsystem1.test.fqdn" />
<numanodes op="=" value="5" />
<memory op="=" value="1000" />
Then, the specified system's status is not taken into account. The other criteria are also ignored. They are not simply taken into account.
It does look like this will enable the use cases mentioned above. However, does this allow something which we don't want to?
When I suggested using a <system/> element, it didn't occur to me that we already had a system element. I'd like this to use a *new* subelement of hostRequires and update the Relax-NG spec to make it mutually exclusive with the normal search criteria.
Since <system/> is taken, I suggest:
The idea is that "Automated" now means "is available to satisfy criteria based system requests through the scheduler", while "Manual" systems can still be accessed through the scheduler, but only if you force scheduling on that particular system.
If the new selection mechanism isn't mutually exclusive with the old one, we would lose that Automated/Manual distinction entirely, and that would be a serious problem.
(In reply to Nick Coghlan from comment #7)
> When I suggested using a <system/> element, it didn't occur to me that we
> already had a system element. I'd like this to use a *new* subelement of
> hostRequires and update the Relax-NG spec to make it mutually exclusive with
> the normal search criteria.
> Since <system/> is taken, I suggest:
> <force name="system.fqdn"/>
I'm not sure if its a good idea to have the verb as the element name.
<system value='system.fqdn' force='1' />
This way you don't need to worry about any mutual exclusive logic where both <force/> and <system/> are present.
Also, I worry that equating 'force' with 'Manual' is not intuitive. Unless we're likely to have other system modes that could fall under the 'force' umbrella, why not just use:
<system value='system.fqdn' allow_manual='1' />
> The idea is that "Automated" now means "is available to satisfy criteria
> based system requests through the scheduler", while "Manual" systems can
> still be accessed through the scheduler, but only if you force scheduling on
> that particular system.
> If the new selection mechanism isn't mutually exclusive with the old one, we
> would lose that Automated/Manual distinction entirely, and that would be a
> serious problem.
After some discussion on IRC, the preferred syntax for this feature is now:
The Relax-NG schema will be updated to ensure that using the "force" attribute is mutually exclusive with using any of the filter elements.
The docs updates for this should clearly explain that the intended use case for this feature is *not* the same as the use case for the existing automated scheduling support.
In the existing automated scheduling, the general intent is that the user *shouldn't care* exactly which system their job runs on - they may constrain it to a particular lab, or a particular architecture, or a particular set of systems, or systems with a particular piece of hardware, but the case of selecting a particular system by name is a degenerate one that is only supported indidentally, rather than being what the filtering system is *for*. Setting a system to "Automated" is a matter of saying "this system is available to satisfy arbitrary host selection criteria for recipes".
By contrast, the new mechanism is designed to allow users to say "I want this job to run on that specific system, right there, so long as I have permission to do so, and as soon as nobody else is using it". It's the kind of behaviour we want in order to be able to properly automate inventory scans and machine health tests, even if the system is nominally in Manual or Broken mode.
This fundamentally changes the nature of the "Manual" and "Broken" states - they no longer mean "this system is not available to the scheduler". Instead, they mean "this system is not available to satisfy arbitrary host selection criteria, it must be specifically requested". This makes the manual and broken states far more useful - a system in Manual mode will still allow users with access to run recipes on it when they want to, without making it a candidate system for *all* of the jobs they submit, while Broken systems will still support running jobs that are designed to check whether or not they're actually Broken, or if they were flagged as such due to a transient environmental issue or a software failure.
Fixed review link: http://gerrit.beaker-project.org/#/c/2864 (the new Gerrit doesn't make it obvious when you're not looking at the most recent version of a patch)
Beaker 0.17.0 has been released.