Bug 591652 - [RFE] automatic removal of systems that do not power cycle
Summary: [RFE] automatic removal of systems that do not power cycle
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: inventory
Version: 0.5
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Raymond Mancy
QA Contact:
URL:
Whiteboard:
: 631931 (view as bug list)
Depends On:
Blocks: 632609 637260
TreeView+ depends on / blocked
 
Reported: 2010-05-12 19:29 UTC by Cameron Meadors
Modified: 2019-05-22 13:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 637260 (view as bug list)
Environment:
Last Closed: 2010-09-30 04:56:55 UTC
Embargoed:


Attachments (Terms of Use)
Patch: automatically mark systems as broken if cobbler task fails (6.18 KB, patch)
2010-09-15 07:36 UTC, Dan Callaghan
no flags Details | Diff
Patch: automatically mark systems as broken if cobbler task fails (6.49 KB, patch)
2010-09-15 23:00 UTC, Dan Callaghan
no flags Details | Diff

Description Cameron Meadors 2010-05-12 19:29:35 UTC
I would be nice for machines to be removed (or black listed) automatically from the pool if installs keep failing.  This would be per release of OS.  There would need to be a process to add the machine back after determining the problem.

Alternatively, they system could be marked as suspect.  The warning would allow people to possibly select a different machine

Comment 1 Matt Brodeur 2010-08-10 19:33:54 UTC
We had this feature in legacy RHTS, at least for failed power resets.  That process would also generate an email so we'd know to fix the system.  In this post-RHTS world I suppose an email to the system owner would be appropriate.

Comment 2 Marian Csontos 2010-08-11 03:53:04 UTC
> We had this feature in legacy RHTS, at least for failed power resets.

Does not it qualify for blocker then?

+ This would be really nice to have: Aborted installations are frustrating especially in case of multihost tests, where one sit waiting days to get all the machines...

Comment 3 Matt Brodeur 2010-08-11 16:25:01 UTC
(In reply to comment #2)
> > We had this feature in legacy RHTS, at least for failed power resets.
> 
> Does not it qualify for blocker then?

The feature was implemented in the legacy lab controllers, the last of which was decommissioned early this year.  We've been running without it since.  Retiring the legacy scheduler won't affect this feature.

Comment 4 Marian Csontos 2010-08-11 16:56:08 UTC
Re: Comment 3: then by definition it is not a blocker. Taking back my request.

Comment 5 Raymond Mancy 2010-08-12 01:26:24 UTC
Cameron,
As far as your alternative, "they[sic] could be marked as suspect", this is already possible via the 'Broken' option in the system status. I believe this would be the correct way to deal with it, as the system would actually appear to be broken.

Comment 7 Marian Csontos 2010-08-12 09:53:26 UTC
Another thing is regular user can not mark a machine broken: Bug 623603

Comment 8 Bill Peck 2010-09-07 16:10:04 UTC
Bumping the priority of this bug.

This will be implemented in two ways..

- If the scheduler (beakerd) fails to provision a system it should change the status to broken and email the owner with fail log.

- For failures after provision (ie, power succeeded but pxe failed) we need a cron job which takes the following args

   bkr-health-check --threshold 10 --distro-tag STABLE --task '/distribution/install' --email-owners  --owner admin --owner user1

the above would report failure to the owner on any system which is owned by admin or user1 and failed the last 10 installs of distros which were tagged as STABLE.

Comment 9 Bill Peck 2010-09-08 17:47:56 UTC
*** Bug 631931 has been marked as a duplicate of this bug. ***

Comment 10 Raymond Mancy 2010-09-10 06:04:58 UTC
(In reply to comment #8)
> Bumping the priority of this bug.
> 
> This will be implemented in two ways..
> 
> - If the scheduler (beakerd) fails to provision a system it should change the
> status to broken and email the owner with fail log.
> 

Am I right in thinking that we don't yet have a reasonable way of determining whether a machine has failed due to it's own fault or the controller's, thus we should probably wait until we have decided how to do that first?

> - For failures after provision (ie, power succeeded but pxe failed) we need a
> cron job which takes the following args
> 
>    bkr-health-check --threshold 10 --distro-tag STABLE --task
> '/distribution/install' --email-owners  --owner admin --owner
> user1
> 
> the above would report failure to the owner on any system which is owned by
> admin or user1 and failed the last 10 installs of distros which were tagged as
> STABLE.

Comment 11 Dan Callaghan 2010-09-15 07:36:44 UTC
Created an attachment (id=447402)
Patch: automatically mark systems as broken if cobbler task fails

Comment 12 Dan Callaghan 2010-09-15 07:45:51 UTC
Spoke to rmancy about this, we figured that the most conservative thing we could do would be to mark the system as "Broken" only if the cobbler power task reports failure. Anything else (like an XMLRPC fault) might mean that the lab controller is screwed up, so to be safe we don't want to go marking great swathes of systems as broken when they aren't.

Bill, can you please review the attached patch? I'm not sure how I can test it out myself (apart from hacking it up by hand in a Python interpreter).

Comment 13 Bill Peck 2010-09-15 13:16:26 UTC
Hi Dan,

This looks good except I don't see how recipe would ever be defined in mark_broken().  Should it be recipe.system.mark_broken(recipe)?

Comment 14 Bill Peck 2010-09-15 13:18:06 UTC
Actually it looks like you need to pass in a reason as well, and since thats not optional I think it would fail as it currently is.

Comment 15 Dan Callaghan 2010-09-15 23:00:48 UTC
Created attachment 447587 [details]
Patch: automatically mark systems as broken if cobbler task fails

Oops, I screwed that up. Attaching revised patch.

Comment 16 Dan Callaghan 2010-09-15 23:05:37 UTC
I also forgot to add an entry to system activity, the revised patch does that now.

Comment 17 Dan Callaghan 2010-09-20 06:03:34 UTC
Do we need to do anything else here, or should I push this as is for 0.5.58?

Comment 19 Bill Peck 2010-09-22 13:50:38 UTC
Patch looks good.  What testing have you done?  Do you have any tests which simulate these failures?

Comment 20 Dan Callaghan 2010-09-23 06:55:01 UTC
(In reply to comment #19)
> Patch looks good.  What testing have you done?  Do you have any tests which
> simulate these failures?

I was just going to wait and test it on beaker-stage, but I realised there's no reason I couldn't put it on beaker-devel right now and test it there. I worked with Ray today on getting beaker-devel into a usable state (with a lab controller), and this afternoon I tested a few different failure scenarios with my code. I had some bugs which I've now pushed fixes for.

Comment 22 Dan Callaghan 2010-09-27 01:27:19 UTC
Merged into develop branch.


Note You need to log in before you can comment on or make changes to this bug.