Bug 994970

Summary:	[RFE] Actively fight machine pool depletion
Product:	[Retired] Beaker	Reporter:	Hubert Kario <hkario>
Component:	general	Assignee:	beaker-dev-list
Status:	CLOSED WONTFIX	QA Contact:	tools-bugs <tools-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	0.13	CC:	cbouchar, fedora, pholica, qwan, tools-bugs
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-19 21:55:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	991236
Bug Blocks:

Description Hubert Kario 2013-08-08 10:20:41 UTC

Machine can be removed from pool because of transitional problems with infrastructure causing pool depletion by automated processes.

Beaker should schedule a "hardware check" job every few ours that installs supported distro and runs some hardware checks (simple memory check, look at SMART errors, checks if free space is available and NFS servers are available, both read and write, etc.). If this hardware check job passes on all supported distros of the specific machine. The system should put it automatically back to pool.

Comment 2 Hubert Kario 2013-08-08 10:28:26 UTC

(In reply to Hubert Kario from comment #0)
> Beaker should schedule a "hardware check" job every few ours that installs

That should obviously be:

> Beaker should schedule a "hardware check" job every few hours that installs
                                                          ^^^^^

Comment 3 Raymond Mancy 2013-08-19 05:16:26 UTC

This is an interesting idea. Is this primarily so then you don't waste time reserving a system that would have otherwise failed this hardware test? Has this been a common problem for you?

The shortage of resources is already quite acute, and having more machines taken out of circulation while they have basic hardware tests performed on them would cause even further strain. I wonder if it might make more sense to do similar testing before/after a recipe is run on a system.

Comment 4 Nick Coghlan 2013-08-26 02:56:42 UTC

The proposal is to do this on *Broken* machines, to see if they can be set back to Automated. The idea is to automatically pick up systems that actually failed due to some external problem in the lab, rather than anything being inherently wrong with the system itself.

We need a more comprehensive machine health check that *maintainers* can initiate before we can consider automating any such check, though.

Comment 5 Hubert Kario 2013-08-26 08:23:28 UTC

(In reply to Raymond Mancy from comment #3)
> This is an interesting idea. Is this primarily so then you don't waste time
> reserving a system that would have otherwise failed this hardware test? Has
> this been a common problem for you?

This is a different problem, but yes, I've seen 7-8% provisioning (/distribution/install task) failure rate.

> The shortage of resources is already quite acute, and having more machines
> taken out of circulation while they have basic hardware tests performed on
> them would cause even further strain.

As Nick said, the issue at hand is supposed to alleviate the shortage of machines.

Comment 6 Raymond Mancy 2013-08-26 09:27:51 UTC

(In reply to Nick Coghlan from comment #4)
> The proposal is to do this on *Broken* machines

Oh, ok. That makes more sense.