Bug 851354 - RFE: Allow scheduling of jobs against Manual and Broken systems
RFE: Allow scheduling of jobs against Manual and Broken systems
Status: CLOSED CURRENTRELEASE
Product: Beaker
Classification: Community
Component: scheduler (Show other bugs)
0.9
Unspecified Unspecified
unspecified Severity unspecified (vote)
: 0.17
: ---
Assigned To: Amit Saha
Misc
: FutureFeature
Depends On:
Blocks: 790492 846185 1093224 1093226
  Show dependency treegraph
 
Reported: 2012-08-23 18:06 EDT by Marian Csontos
Modified: 2015-07-26 18:14 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-06-10 19:28:12 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Marian Csontos 2012-08-23 18:06:21 EDT
Description of problem:
Having system on load deemed useless as it was marked Broken after it failed installing stable distro. This should not require lab-admin's intervetion.

Use case: how's kernel/anaconda developer supposed to examine bugs preventing a system from installing/booting?

Version-Release number of selected component (if applicable):
0.9.2
Comment 1 Nick Coghlan 2012-10-17 00:39:53 EDT
Bulk reassignment of issues as Bill has moved to another team.
Comment 2 Nick Coghlan 2012-11-01 21:05:12 EDT
#790492 is possibly related (proposed solution there is to provide a way to opt-in to allowing a system to be provisioned while Broken when that system is explicitly requested in order to be able to diagnose/fix isssues with the hardware).
Comment 4 Nick Coghlan 2014-02-11 05:26:54 EST
Given the discussion on bug 1062706, I'm proposing that we add this to the development list for Beaker 0.16. My suggestion is that we do something like the following.

1. Allow hostRequires to explicitly specify a system instead of host requirements:

    <hostRequires>
       <system fqdn="system.example.com"/>
    </hostRequires>

This is mutually exclusive with normal host selection criteria.

2. If a recipe specifies a specific system in this way, then the scheduler allows the recipe to be scheduled as long as the system is either free or already reserved by the submitting user.

3. If the user already had the system reserved before the recipe started, then the scheduler ensures it does not get powered off at the end of the recipe execution.
Comment 5 Nick Coghlan 2014-02-25 01:45:50 EST
Amit pointed out the context of this proposal wasn't clear. Some key use cases:

1. More easily checking if a Broken system is working again before putting it back to Automated (e.g. by running an inventory scan)
2. Marian's original use case above, running further tests on a Broken system to figure out why it broke.
3. Running inventory scans on Manual systems
4. Running other automated tests on Manual systems without having to make them generally available to satisfy arbitrary hostRequires checks

Dan also noted that points two and three in comment 4 aren't really feasible without implementing bug 639938 first (and perhaps not even then), so existing reservations by the current user should instead cause the submitted recipe to be queued, just as they do for reservations of Automated systems today.
Comment 6 Amit Saha 2014-03-02 22:10:50 EST
Preliminary patch: http://gerrit.beaker-project.org/#/c/2864/1

To summarize what is being changed: If the recipe's <hostRequires> has something like this:
 <hostRequires>
       <system>
             <name op="=" value="testsystem1.test.fqdn" />
             <numanodes op="=" value="5" />
             <memory op="=" value="1000" />
        </system>
 </hostRequires>


Then, the specified system's status is not taken into account. The other criteria are also ignored. They are not simply taken into account.

It does look like this will enable the use cases mentioned above. However, does this allow something which we don't want to?
Comment 7 Nick Coghlan 2014-03-02 23:25:37 EST
When I suggested using a <system/> element, it didn't occur to me that we already had a system element. I'd like this to use a *new* subelement of hostRequires and update the Relax-NG spec to make it mutually exclusive with the normal search criteria.

Since <system/> is taken, I suggest:

    <force name="system.fqdn"/>

The idea is that "Automated" now means "is available to satisfy criteria based system requests through the scheduler", while "Manual" systems can still be accessed through the scheduler, but only if you force scheduling on that particular system.

If the new selection mechanism isn't mutually exclusive with the old one, we would lose that Automated/Manual distinction entirely, and that would be a serious problem.
Comment 8 Raymond Mancy 2014-03-03 19:00:19 EST
(In reply to Nick Coghlan from comment #7)
> When I suggested using a <system/> element, it didn't occur to me that we
> already had a system element. I'd like this to use a *new* subelement of
> hostRequires and update the Relax-NG spec to make it mutually exclusive with
> the normal search criteria.
> 
> Since <system/> is taken, I suggest:
> 
>     <force name="system.fqdn"/>

I'm not sure if its a good idea to have the verb as the element name.

How about: 

  <system value='system.fqdn' force='1' />


This way you don't need to worry about any mutual exclusive logic where both <force/> and <system/> are present. 

Also, I worry that equating 'force' with 'Manual' is not intuitive. Unless we're likely to have other system modes that could fall under the 'force' umbrella, why not just use:

  <system value='system.fqdn' allow_manual='1' />


> 
> The idea is that "Automated" now means "is available to satisfy criteria
> based system requests through the scheduler", while "Manual" systems can
> still be accessed through the scheduler, but only if you force scheduling on
> that particular system.
> 
> If the new selection mechanism isn't mutually exclusive with the old one, we
> would lose that Automated/Manual distinction entirely, and that would be a
> serious problem.
Comment 9 Nick Coghlan 2014-03-03 20:29:09 EST
After some discussion on IRC, the preferred syntax for this feature is now:

    <hostRequires force="system.fqdn"/>

The Relax-NG schema will be updated to ensure that using the "force" attribute is mutually exclusive with using any of the filter elements.

The docs updates for this should clearly explain that the intended use case for this feature is *not* the same as the use case for the existing automated scheduling support.

In the existing automated scheduling, the general intent is that the user *shouldn't care* exactly which system their job runs on - they may constrain it to a particular lab, or a particular architecture, or a particular set of systems, or systems with a particular piece of hardware, but the case of selecting a particular system by name is a degenerate one that is only supported indidentally, rather than being what the filtering system is *for*. Setting a system to "Automated" is a matter of saying "this system is available to satisfy arbitrary host selection criteria for recipes".

By contrast, the new mechanism is designed to allow users to say "I want this job to run on that specific system, right there, so long as I have permission to do so, and as soon as nobody else is using it". It's the kind of behaviour we want in order to be able to properly automate inventory scans and machine health tests, even if the system is nominally in Manual or Broken mode.

This fundamentally changes the nature of the "Manual" and "Broken" states - they no longer mean "this system is not available to the scheduler". Instead, they mean "this system is not available to satisfy arbitrary host selection criteria, it must be specifically requested". This makes the manual and broken states far more useful - a system in Manual mode will still allow users with access to run recipes on it when they want to, without making it a candidate system for *all* of the jobs they submit, while Broken systems will still support running jobs that are designed to check whether or not they're actually Broken, or if they were flagged as such due to a transient environmental issue or a software failure.
Comment 10 Nick Coghlan 2014-03-04 23:59:32 EST
Fixed review link: http://gerrit.beaker-project.org/#/c/2864 (the new Gerrit doesn't make it obvious when you're not looking at the most recent version of a patch)
Comment 13 Dan Callaghan 2014-06-10 19:28:12 EDT
Beaker 0.17.0 has been released.

Note You need to log in before you can comment on or make changes to this bug.