Bug 994970 - [RFE] Actively fight machine pool depletion
[RFE] Actively fight machine pool depletion
Status: NEW
Product: Beaker
Classification: Community
Component: general (Show other bugs)
0.13
Unspecified Unspecified
medium Severity medium (vote)
: ---
: ---
Assigned To: beaker-dev-list
tools-bugs
: FutureFeature
Depends On: 991236
Blocks:
  Show dependency treegraph
 
Reported: 2013-08-08 06:20 EDT by Hubert Kario
Modified: 2017-10-16 05:10 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Hubert Kario 2013-08-08 06:20:41 EDT
Machine can be removed from pool because of transitional problems with infrastructure causing pool depletion by automated processes.

Beaker should schedule a "hardware check" job every few ours that installs supported distro and runs some hardware checks (simple memory check, look at SMART errors, checks if free space is available and NFS servers are available, both read and write, etc.). If this hardware check job passes on all supported distros of the specific machine. The system should put it automatically back to pool.
Comment 2 Hubert Kario 2013-08-08 06:28:26 EDT
(In reply to Hubert Kario from comment #0)
> Beaker should schedule a "hardware check" job every few ours that installs

That should obviously be:

> Beaker should schedule a "hardware check" job every few hours that installs
                                                          ^^^^^
Comment 3 Raymond Mancy 2013-08-19 01:16:26 EDT
This is an interesting idea. Is this primarily so then you don't waste time reserving a system that would have otherwise failed this hardware test? Has this been a common problem for you?

The shortage of resources is already quite acute, and having more machines taken out of circulation while they have basic hardware tests performed on them would cause even further strain. I wonder if it might make more sense to do similar testing before/after a recipe is run on a system.
Comment 4 Nick Coghlan 2013-08-25 22:56:42 EDT
The proposal is to do this on *Broken* machines, to see if they can be set back to Automated. The idea is to automatically pick up systems that actually failed due to some external problem in the lab, rather than anything being inherently wrong with the system itself.

We need a more comprehensive machine health check that *maintainers* can initiate before we can consider automating any such check, though.
Comment 5 Hubert Kario 2013-08-26 04:23:28 EDT
(In reply to Raymond Mancy from comment #3)
> This is an interesting idea. Is this primarily so then you don't waste time
> reserving a system that would have otherwise failed this hardware test? Has
> this been a common problem for you?

This is a different problem, but yes, I've seen 7-8% provisioning (/distribution/install task) failure rate.

> The shortage of resources is already quite acute, and having more machines
> taken out of circulation while they have basic hardware tests performed on
> them would cause even further strain.

As Nick said, the issue at hand is supposed to alleviate the shortage of machines.
Comment 6 Raymond Mancy 2013-08-26 05:27:51 EDT
(In reply to Nick Coghlan from comment #4)
> The proposal is to do this on *Broken* machines

Oh, ok. That makes more sense.

Note You need to log in before you can comment on or make changes to this bug.