Bug 591652
Summary: | [RFE] automatic removal of systems that do not power cycle | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Cameron Meadors <cmeadors> | ||||||
Component: | inventory | Assignee: | Raymond Mancy <rmancy> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 0.5 | CC: | bpeck, dcallagh, ebaak, jburke, kbaker, mbrodeur, mcsontos, rmancy | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 637260 (view as bug list) | Environment: | |||||||
Last Closed: | 2010-09-30 04:56:55 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 632609, 637260 | ||||||||
Attachments: |
|
Description
Cameron Meadors
2010-05-12 19:29:35 UTC
We had this feature in legacy RHTS, at least for failed power resets. That process would also generate an email so we'd know to fix the system. In this post-RHTS world I suppose an email to the system owner would be appropriate. > We had this feature in legacy RHTS, at least for failed power resets.
Does not it qualify for blocker then?
+ This would be really nice to have: Aborted installations are frustrating especially in case of multihost tests, where one sit waiting days to get all the machines...
(In reply to comment #2) > > We had this feature in legacy RHTS, at least for failed power resets. > > Does not it qualify for blocker then? The feature was implemented in the legacy lab controllers, the last of which was decommissioned early this year. We've been running without it since. Retiring the legacy scheduler won't affect this feature. Re: Comment 3: then by definition it is not a blocker. Taking back my request. Cameron, As far as your alternative, "they[sic] could be marked as suspect", this is already possible via the 'Broken' option in the system status. I believe this would be the correct way to deal with it, as the system would actually appear to be broken. Another thing is regular user can not mark a machine broken: Bug 623603 Bumping the priority of this bug. This will be implemented in two ways.. - If the scheduler (beakerd) fails to provision a system it should change the status to broken and email the owner with fail log. - For failures after provision (ie, power succeeded but pxe failed) we need a cron job which takes the following args bkr-health-check --threshold 10 --distro-tag STABLE --task '/distribution/install' --email-owners --owner admin --owner user1 the above would report failure to the owner on any system which is owned by admin or user1 and failed the last 10 installs of distros which were tagged as STABLE. *** Bug 631931 has been marked as a duplicate of this bug. *** (In reply to comment #8) > Bumping the priority of this bug. > > This will be implemented in two ways.. > > - If the scheduler (beakerd) fails to provision a system it should change the > status to broken and email the owner with fail log. > Am I right in thinking that we don't yet have a reasonable way of determining whether a machine has failed due to it's own fault or the controller's, thus we should probably wait until we have decided how to do that first? > - For failures after provision (ie, power succeeded but pxe failed) we need a > cron job which takes the following args > > bkr-health-check --threshold 10 --distro-tag STABLE --task > '/distribution/install' --email-owners --owner admin --owner > user1 > > the above would report failure to the owner on any system which is owned by > admin or user1 and failed the last 10 installs of distros which were tagged as > STABLE. Created an attachment (id=447402) Patch: automatically mark systems as broken if cobbler task fails Spoke to rmancy about this, we figured that the most conservative thing we could do would be to mark the system as "Broken" only if the cobbler power task reports failure. Anything else (like an XMLRPC fault) might mean that the lab controller is screwed up, so to be safe we don't want to go marking great swathes of systems as broken when they aren't. Bill, can you please review the attached patch? I'm not sure how I can test it out myself (apart from hacking it up by hand in a Python interpreter). Hi Dan, This looks good except I don't see how recipe would ever be defined in mark_broken(). Should it be recipe.system.mark_broken(recipe)? Actually it looks like you need to pass in a reason as well, and since thats not optional I think it would fail as it currently is. Created attachment 447587 [details]
Patch: automatically mark systems as broken if cobbler task fails
Oops, I screwed that up. Attaching revised patch.
I also forgot to add an entry to system activity, the revised patch does that now. Do we need to do anything else here, or should I push this as is for 0.5.58? Pushed branch bz591652 for review. http://git.fedorahosted.org/git/?p=beaker.git;a=commitdiff;h=af13c7897167c8acaffc983c62f79e804df95fc4 Patch looks good. What testing have you done? Do you have any tests which simulate these failures? (In reply to comment #19) > Patch looks good. What testing have you done? Do you have any tests which > simulate these failures? I was just going to wait and test it on beaker-stage, but I realised there's no reason I couldn't put it on beaker-devel right now and test it there. I worked with Ray today on getting beaker-devel into a usable state (with a lab controller), and this afternoon I tested a few different failure scenarios with my code. I had some bugs which I've now pushed fixes for. Merged into develop branch. |