639938 – [RFE] The reservesys step should be separate from the rest of the tests

Bug 639938 - [RFE] The reservesys step should be separate from the rest of the tests

Summary: [RFE] The reservesys step should be separate from the rest of the tests

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	0.17
Assignee:	Amit Saha
QA Contact:
Docs Contact:
URL:
Whiteboard:	Misc
Duplicates (5):	601235 745960 903691 913339 979840 (view as bug list)
Depends On:
Blocks:	1100593 1119221
TreeView+	depends on / blocked

Reported:	2010-10-04 11:17 UTC by Jan Pazdziora (Red Hat)
Modified:	2018-02-06 00:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Clone Of:
Clones:	1100593 (view as bug list)
Environment:
Last Closed:	2014-06-10 23:28:18 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	651479	0	medium	CLOSED	[RFE] Support time limits for system loans	2024-06-30 23:46:03 UTC
Red Hat Bugzilla	989294	0	unspecified	CLOSED	reservesys task may not detect failure of immediately preceding task	2021-02-22 00:41:40 UTC

Internal Links: 651479 989294

Description Jan Pazdziora (Red Hat) 2010-10-04 11:17:01 UTC

Description of problem:

We frequently use beaker to have machine populated with OS and application suite installation and maybe the application suite populated with some data. The last test is /distribution/reservesys. When I then want to release the system and I cancel it from the WebUI, the overall result is Warn because that's what beaker marks there.

It would be nice if the reservesys step could be done not via a test but via some other mechanism, from the point of view of the scheduler, so that the overall result of all the tests (sans the reservesys) could be Pass, the test would be done and finished, and the fact that the machine is reserved would be tracked by beaker separately, optionally based on the results of the previous tests.

Version-Release number of selected component (if applicable):

Version - 0.5.58

How reproducible:

Deterministic.

Steps to Reproduce:
1. Run job which has /distribution/reservesys as the last test.

Actual results:

The test is in state Running until the reservesys finishes. If you cancel it (release the machine) from the WebUI, the status is Warn.

Expected results:

Make it possible (maybe via some attribute to recipeSet) to reservesys the machines used for the test in such a way that the reservesys would not account into the overall state of the job. It would just be a reservation done after the tests passed.

Actually, it would be even better if it could be specified condition under which the system(s) should be reservesys'ed. Frequently, when running the daily or weekly regression tests, we do not want the machine reserved if the tests passed, but we want it reserved if some of the tests failed.

So the attributes could be

reservesys-time="12h"

and

reservesys-cond="failed"

with the accepted values for reservesys-time being some time-ish specification, and reservesys-cond would have values "passed" or "failed", and if specified, it would default reservesys-time to "1d" for one day.

Additional info:

Comment 1 Marian Csontos 2010-10-04 12:20:30 UTC

*** Bug 601235 has been marked as a duplicate of this bug. ***

Comment 2 Jan Pazdziora (Red Hat) 2010-10-04 13:00:59 UTC

(In reply to comment #0)
> or weekly regression tests, we do not want the machine reserved if the tests
> passed, but we want it reserved if some of the tests failed.

I should have added: ... to investigate the cause of the failure, on the machine where it happened, without having to re-run the whole test set again.

Comment 3 Marian Csontos 2010-10-05 04:38:14 UTC

+1

It should also keep machine "reserved" for given amount of time [1] after recipe was Aborted (e.g. by external watchdog) and *after the notification* was sent to job owner. This can not be implemented by harness, as machine/harness may be corrupted.

This should allow extending the watchdog.

As currently watchdog requires a task, this could be implemented using empty task - i.e. task without rpm element. Harness would handle the situation correctly and would not touch the task.

[1] Default configured on the server optionally overridden by the recipe.

Comment 4 Dan Callaghan 2012-10-04 06:16:20 UTC

Jan, does the new RESERVE_IF_FAIL=1 parameter for /distribution/reservesys meet your needs here?

Comment 5 Jan Pazdziora (Red Hat) 2012-10-04 07:18:15 UTC

(In reply to comment #4)
> Jan, does the new RESERVE_IF_FAIL=1 parameter for /distribution/reservesys
> meet your needs here?

It does, mostly. There's still the issue of distinguishing jobs in state Running (== executing all the tests) from jobs that have Finished all the tests but are Reserved for certain period of time, and not counting the result of the reservesys into the overall job result. My assumption is that noone (except maybe beaker and beakerlib developers) really puts /distribution/reservesys into their jobs as a test that should influence job's result -- it's there to get access to the machine. If beaker could account for that, it would be great.

Comment 6 Nick Coghlan 2012-10-17 04:37:17 UTC

Bulk reassignment of issues as Bill has moved to another team.

Comment 7 Dan Callaghan 2013-02-21 03:55:18 UTC

*** Bug 913339 has been marked as a duplicate of this bug. ***

Comment 8 Dan Callaghan 2013-02-21 03:55:22 UTC

*** Bug 903691 has been marked as a duplicate of this bug. ***

Comment 9 Dan Callaghan 2013-02-21 03:56:10 UTC

Some good suggestions from Nick on how to represent this in the job XML are in the duped bug 903691.

Comment 10 Nick Coghlan 2013-02-26 01:50:01 UTC

My suggestions are actually in bug 913339. Reproducing the suggestion here for convenient reference:

A more elegant alternative may be to define an actual "reservesys" element, which may be specified at the job level, the recipe set level, or the individual recipe level (applying to all recipes covered by the scope where it appears). (A simpler starting point would be to allow the new element only at the recipe level)

A possible definition of such an element (ignoring things like making the attributes optional and giving them specific data types):

    <element name="reservesys">
        <attribute name="onabort"/>
        <attribute name="onfail"/>
        <attribute name="onwarn"/>
        <attribute name="onpass"/>
        <attribute name="duration"/>
    </element>

By default, the system would be reserved regardless of how the recipe finishes. The duration setting would control how long the automatic reservation would be for (defaulting to 24 hours) 

onabort/fail/warn/pass would control whether or not the system is actually reserved in the relevant situations:

- onabort = reserve system if a task aborts
- onfail = reserve system if at least one task reports a failed result
- onwarn = reserve system if at least one task reports a warning
- onpass = reserve system if none of the above occur

All four settings would default to "true", so the reservation is unconditional by default.

Comment 11 Nick Coghlan 2013-02-26 01:56:14 UTC

Since the reservation task currently only works at the recipe level, it would make sense for the <reservesys/> element to initially be allowed only at the recipe level. Supporting it at higher levels would then be a possible future enhancement.

Comment 12 Nick Coghlan 2013-04-15 04:07:44 UTC

Ack'ed for the version described in comments 10 & 11.

The "onabort" handling should also cover the case of External Watchdog triggering, which means this is a substantial improvement over the existing reservesys task.

It will also complement the new alternative harness support, as it will be usable even with harnesses that don't support execution of RHTS style tasks.

Comment 13 Nick Coghlan 2013-04-15 05:17:08 UTC

*** Bug 745960 has been marked as a duplicate of this bug. ***

Comment 14 Nick Coghlan 2013-04-15 05:20:34 UTC

One more detail is that when a recipe or job is explicitly cancelled through the web UI or CLI/XML-RPC API, the system should *not* be reserved.

Comment 15 Raymond Mancy 2013-07-01 03:05:53 UTC

*** Bug 979840 has been marked as a duplicate of this bug. ***

Comment 16 Nick Coghlan 2013-10-01 08:52:50 UTC

Added dependency on bug 651479, since it makes sense to build this mechanism on top of the changes planned to make that possible.

Comment 17 Amit Saha 2014-05-11 12:54:18 UTC

http://gerrit.beaker-project.org/#/c/3075/

Comment 18 Nick Coghlan 2014-05-14 04:20:58 UTC

Removed the dependency, since Amit came up with an approach that tied in more cleanly with the existing watchdog mechanism.

Comment 19 Amit Saha 2014-05-28 06:51:51 UTC

The initial implementation of this feature allows adding a new XML element ``<reservesys>`` to a recipe. This will reserve a system after the tests have been completed. An optional attribute, ``duration`` can be used to specify the time of the reservation in seconds. By default it is 86400 seconds or 24 hours.

Comment 21 Amit Saha 2014-05-28 07:23:34 UTC

Tips for verification:

As stated in comment 19 above, this feature adds a new element <reservesys> with an optional attribute 'duration'. If a recipe has <reservesys/>, after all the tests have completed, the system will be reserved for 24 hours. 
If a duration is specified, such as <reservesys duration="3600"/>, it will reserve the system for 1 hour and so on. 

When the system is reserved, the recipe's status will show "Reserved" and the time remaining will also be shown. The system can be returned early using the "Release System" button on the Web UI, else it will be claimed back by Beaker after the reservation period is over.

PS:  Note that the existing /distribution/reservesys task will still continue working the way it has been.

Comment 23 Nick Coghlan 2014-05-28 07:54:14 UTC

Note that one key step for verification will be to use some bad settings that cause the distro installation step to fail. The existing system reservation mechanism will skip in that case, while the new mechanism will reserve the system.

Similarly, a test that emits a fake kernel panic message should still be reserved with the new mechanism.

If a job or recipe set is explicitly *cancelled* by the user, then a recipe with <reservesys> should still be cancelled without being reserved. That can be tested while waiting for installation (for example).

Comment 25 David Kutálek 2014-05-28 10:34:07 UTC

Is there a possibility to release the system from the reserved system command line? Going to UI to individually release systems is overhead when compared to releasing them from command line just after finishing the work there.

Comment 26 Amit Saha 2014-05-28 23:38:06 UTC

I understand your point and we are aware of it. However, in the initial release this is not possible, but we will have one soon hopefully.

Comment 27 Nick Coghlan 2014-05-29 01:13:16 UTC

Bug #1102442 now covers adding the ability to release these systems via "bkr system-release". We're still considering whether or not we want to offer the ability to release reserved systems as part of the harness API (and add an appropriate command to the beah harness or RHTS).

Comment 28 David Kutálek 2014-05-29 10:04:56 UTC

(In reply to Nick Coghlan from comment #27)
> Bug #1102442 now covers adding the ability to release these systems via "bkr
> system-release". We're still considering whether or not we want to offer the
> ability to release reserved systems as part of the harness API (and add an
> appropriate command to the beah harness or RHTS).

Thanks for info. I do definitely see added value of releasing from reserved machine, as explained above.

What are the reasons why not to implement it?

Comment 29 Nick Coghlan 2014-05-30 06:49:58 UTC

Complexity, mainly - when the release command is run from the reserved machine, it needs to be supported in every test harness (some of which aren't maintained by the core Beaker team), as well as in the harness API on the lab controller. You also *can't* run the release command on the target system in cases where there's no test harness installed in the first place (as occurs with manual provisioning). In the case of Manual reservations and aborted installations, there may not even *be* a usable test harness on the target system.

By contrast, once bug #1102442 is addressed, following up a "ssh <FQDN>" command with "bkr system-release <FQDN>" when you're done will work regardless of the original reservation mechanism (bkr system-reserve, taking the system in the web UI, /distribution/reservesys task, <reservesys/> element) and independently of the test harness (if any) installed on the target system.

Comment 30 David Kutálek 2014-05-30 11:34:31 UTC

(In reply to Nick Coghlan from comment #29)
> Complexity, mainly - when the release command is run from the reserved
> machine, it needs to be supported in every test harness (some of which
> aren't maintained by the core Beaker team), as well as in the harness API on
> the lab controller. You also *can't* run the release command on the target
> system in cases where there's no test harness installed in the first place
> (as occurs with manual provisioning). In the case of Manual reservations and
> aborted installations, there may not even *be* a usable test harness on the
> target system.

Thank you for explanation. From my POW having supported API and in one major harness can be enough if there is also bkr system-release at the same time. Other harnesses then can use the API depending on their choice.
 
> By contrast, once bug #1102442 is addressed, following up a "ssh <FQDN>"
> command with "bkr system-release <FQDN>" when you're done will work
> regardless of the original reservation mechanism (bkr system-reserve, taking
> the system in the web UI, /distribution/reservesys task, <reservesys/>
> element) and independently of the test harness (if any) installed on the
> target system.

In most cases I connect to several machines at once using cssh (usually all archs for given product). Sometimes I do release these one by one, sometimes all at once. By using external command it is possible but not so comfortable.

At least, can we consider bkr system-release to handle more fqdn at once? 
I will post this idea to bug you mentioned.

Comment 31 David Kutálek 2014-05-30 11:45:23 UTC

I actually created a separate bug for releasing more hosts at once: bug 1103156.

Comment 32 Dan Callaghan 2014-06-10 23:28:18 UTC

Beaker 0.17.0 has been released.

Note You need to log in before you can comment on or make changes to this bug.