Bug 591652

Summary:

[RFE] automatic removal of systems that do not power cycle

Product:

[Retired] Beaker

Reporter:

Cameron Meadors <cmeadors>

Component:

inventory

Assignee:

Raymond Mancy <rmancy>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

high

Version:

0.5

CC:

bpeck, dcallagh, ebaak, jburke, kbaker, mbrodeur, mcsontos, rmancy

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

637260 (view as bug list)

Environment:

Last Closed:

2010-09-30 04:56:55 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

632609, 637260

Attachments:

Description	Flags
Patch: automatically mark systems as broken if cobbler task fails	none
Patch: automatically mark systems as broken if cobbler task fails	none

Description Cameron Meadors 2010-05-12 19:29:35 UTC

I would be nice for machines to be removed (or black listed) automatically from the pool if installs keep failing.  This would be per release of OS.  There would need to be a process to add the machine back after determining the problem.

Alternatively, they system could be marked as suspect.  The warning would allow people to possibly select a different machine

Comment 1 Matt Brodeur 2010-08-10 19:33:54 UTC

We had this feature in legacy RHTS, at least for failed power resets.  That process would also generate an email so we'd know to fix the system.  In this post-RHTS world I suppose an email to the system owner would be appropriate.

Comment 2 Marian Csontos 2010-08-11 03:53:04 UTC

> We had this feature in legacy RHTS, at least for failed power resets.

Does not it qualify for blocker then?

+ This would be really nice to have: Aborted installations are frustrating especially in case of multihost tests, where one sit waiting days to get all the machines...

Comment 3 Matt Brodeur 2010-08-11 16:25:01 UTC

(In reply to comment #2)
> > We had this feature in legacy RHTS, at least for failed power resets.
> 
> Does not it qualify for blocker then?

The feature was implemented in the legacy lab controllers, the last of which was decommissioned early this year.  We've been running without it since.  Retiring the legacy scheduler won't affect this feature.

Comment 4 Marian Csontos 2010-08-11 16:56:08 UTC

Re: Comment 3: then by definition it is not a blocker. Taking back my request.

Comment 5 Raymond Mancy 2010-08-12 01:26:24 UTC

Cameron,
As far as your alternative, "they[sic] could be marked as suspect", this is already possible via the 'Broken' option in the system status. I believe this would be the correct way to deal with it, as the system would actually appear to be broken.

Comment 7 Marian Csontos 2010-08-12 09:53:26 UTC

Another thing is regular user can not mark a machine broken: Bug 623603

Comment 8 Bill Peck 2010-09-07 16:10:04 UTC

Bumping the priority of this bug.

This will be implemented in two ways..

- If the scheduler (beakerd) fails to provision a system it should change the status to broken and email the owner with fail log.

- For failures after provision (ie, power succeeded but pxe failed) we need a cron job which takes the following args

   bkr-health-check --threshold 10 --distro-tag STABLE --task '/distribution/install' --email-owners  --owner admin --owner user1

the above would report failure to the owner on any system which is owned by admin or user1 and failed the last 10 installs of distros which were tagged as STABLE.

Comment 9 Bill Peck 2010-09-08 17:47:56 UTC

*** Bug 631931 has been marked as a duplicate of this bug. ***

Comment 10 Raymond Mancy 2010-09-10 06:04:58 UTC

(In reply to comment #8)
> Bumping the priority of this bug.
> 
> This will be implemented in two ways..
> 
> - If the scheduler (beakerd) fails to provision a system it should change the
> status to broken and email the owner with fail log.
> 

Am I right in thinking that we don't yet have a reasonable way of determining whether a machine has failed due to it's own fault or the controller's, thus we should probably wait until we have decided how to do that first?

> - For failures after provision (ie, power succeeded but pxe failed) we need a
> cron job which takes the following args
> 
>    bkr-health-check --threshold 10 --distro-tag STABLE --task
> '/distribution/install' --email-owners  --owner admin --owner
> user1
> 
> the above would report failure to the owner on any system which is owned by
> admin or user1 and failed the last 10 installs of distros which were tagged as
> STABLE.

Comment 11 Dan Callaghan 2010-09-15 07:36:44 UTC

Created an attachment (id=447402)
Patch: automatically mark systems as broken if cobbler task fails

Comment 12 Dan Callaghan 2010-09-15 07:45:51 UTC

Spoke to rmancy about this, we figured that the most conservative thing we could do would be to mark the system as "Broken" only if the cobbler power task reports failure. Anything else (like an XMLRPC fault) might mean that the lab controller is screwed up, so to be safe we don't want to go marking great swathes of systems as broken when they aren't.

Bill, can you please review the attached patch? I'm not sure how I can test it out myself (apart from hacking it up by hand in a Python interpreter).

Comment 13 Bill Peck 2010-09-15 13:16:26 UTC

Hi Dan,

This looks good except I don't see how recipe would ever be defined in mark_broken().  Should it be recipe.system.mark_broken(recipe)?

Comment 14 Bill Peck 2010-09-15 13:18:06 UTC

Actually it looks like you need to pass in a reason as well, and since thats not optional I think it would fail as it currently is.

Comment 15 Dan Callaghan 2010-09-15 23:00:48 UTC

Created attachment 447587 [details]
Patch: automatically mark systems as broken if cobbler task fails

Oops, I screwed that up. Attaching revised patch.

Comment 16 Dan Callaghan 2010-09-15 23:05:37 UTC

I also forgot to add an entry to system activity, the revised patch does that now.

Comment 17 Dan Callaghan 2010-09-20 06:03:34 UTC

Do we need to do anything else here, or should I push this as is for 0.5.58?

Comment 18 Dan Callaghan 2010-09-22 07:07:24 UTC

Pushed branch bz591652 for review.

http://git.fedorahosted.org/git/?p=beaker.git;a=commitdiff;h=af13c7897167c8acaffc983c62f79e804df95fc4

Comment 19 Bill Peck 2010-09-22 13:50:38 UTC

Patch looks good.  What testing have you done?  Do you have any tests which simulate these failures?

Comment 20 Dan Callaghan 2010-09-23 06:55:01 UTC

(In reply to comment #19)
> Patch looks good.  What testing have you done?  Do you have any tests which
> simulate these failures?

I was just going to wait and test it on beaker-stage, but I realised there's no reason I couldn't put it on beaker-devel right now and test it there. I worked with Ray today on getting beaker-devel into a usable state (with a lab controller), and this afternoon I tested a few different failure scenarios with my code. I had some bugs which I've now pushed fixes for.

Comment 22 Dan Callaghan 2010-09-27 01:27:19 UTC

Merged into develop branch.