Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 790492

Summary:	[RFE] Provide a simple mechanism for admins to check a system is working properly
Product:	[Retired] Beaker	Reporter:	Sean Waite <swaite>
Component:	general	Assignee:	beaker-dev-list
Status:	CLOSED WONTFIX	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	0.8	CC:	azelinka, bpeck, cbouchar, jnicolet, mcsontos, qwan, stl, tools-bugs
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	Scheduler
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-19 22:10:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	851354
Bug Blocks:

Description Sean Waite 2012-02-14 16:38:05 UTC

Description of problem:
When testing to see if a system is working again after it was marked broken, we are forced to mark it automated. If it is still broken, it'll auto-disable again, which sends an email and (in the case of admin owned systems) opens a ticket. We'd like a method of marking this "work in progress" so that we can freely work on a system without changing the owner, and without having to clean up spurious, redundant tickets.

Alternately, perhaps another checkbox like the secret and shared boxes that indicate that the system isn't happy.

Comment 1 Bill Peck 2012-02-14 16:57:38 UTC

How about adding a "In Test" Status?  

We would also need to support selecting the system by status, with it defaulting to "Automated".

Comment 2 Sean Waite 2012-02-14 17:32:01 UTC

An "In Test" status would work, I'd worry slightly about the name, though. Users "test" software on the systems, and we've had issues in the past with some users being confused by status (the old "Working" status). 

Perhaps "Repairing" or "Test Broken"?

Comment 3 Sean Waite 2012-02-14 18:20:10 UTC

Phil had a good idea for a name - "Maintenance"

Comment 4 Marian Csontos 2012-02-14 21:02:15 UTC

What's wrong with "Broken"? I guess systems are checked one by one and it is host name which is important, and status override is only to allow to schedule on broken machine. Am I missing something?

Would "selecting by status" allow scheduling on "Manual"? Rather not, or it should be forbidden to anyone but system owner.

Comment 5 Raymond Mancy 2012-02-14 22:53:14 UTC

Is the problem that it is auto marked broken again or that it sends all the emails etc ?

It probably shouldn't matter to us why (i.e 'testing')  we don't want something to send all the emails and switch to the auto status change. If we just have something like an 'Ignore warnings' checkbox that is available in Automated mode and perhaps even have it stop from switching back to broken if that's something that gets in the way.

We should also make sure that while the machine is in that state that it is not available for general use (i.e it should be loaned), otherwise if someone forgot about it then it could end up back in the general pool and be a source of complaints and more tickets being generated about the broken system.

Comment 6 Bill Peck 2012-02-15 14:09:52 UTC

We don't need any more check boxes.

After thinking about it some more, I agree with Marian.  The current Broken status is fine if we allow a job xml to specify the status when being scheduled.

Default system status would be "Automated", but the admins could submit a job for a specific machine with system status == "Broken" and the scheduler will schedule the job.

We could make a link "Verify system is Working" on the page if status is set to "Broken" and the current user == Owner.

Comment 7 Sean Waite 2012-02-15 15:09:26 UTC

There's nothing in particular wrong with "Broken," beyond the fact that we can't schedule tests against it, and when it's set to Automated, a system that is actually broken will switch to broken and open an email.

I'm fine with allowing explicit scheduling against Broken systems, that would accomplish the same thing.

We generally loan the systems to ourselves, anyway, while we are working on them.

Comment 10 Nick Coghlan 2013-08-01 01:03:18 UTC

Revisiting this, since it's still a problem :)

Currently, if a machine is marked Broken, an admin has to go through a lot of manual steps to test it properly before releasing it back for general use. Because this process is painful and not properly documented, it inevitably leads to flaky machines being returned to the pool, which leads to more failures and unhappy users. We should do something about that :)

The current process for a Beaker admin to properly test a machine marked as Broken:

1. Investigate the original failure, attempt to determine the root cause and fix it (don't *assume* it was a transient environmental glitch that caused the original problem, or there's a risk of returning a system with flaky hardware back to the pool. However, be aware that transient environmental glitches do happen, so check for external incidents around the time of the failure, particularly those that may affect )
2. Loan it to yourself
3. Set the status back to Automated
4. Schedule a "bkr machine-test" job for that system for all stable distros that support that architecture
5. If all the jobs pass, return the loan to release the system back for general use
6. If any of them fail, clean up the spurious failure reports, and continue investigation.

Rather than allowing users to specify the status of systems, I would prefer to add a mechanism to hostRequires that allows the normal filtering logic to be bypassed entirely: add an optional "system" attribute. If "system" is set on a hostRequires element, then Beaker will translate that into a direct lookup on the FQDN, and completely ignore any filtering conditions other than whether or not the user has permission to schedule jobs for that system.

Secondly, it should be simple for a user to supply a system name to create and run a more comprehensive "machine-test" job that contains several recipe sets, one for each of several distro family that supports the system's architecture and ensures they all install correctly. The current model risks admins checking a common distro installs correctly and assuming the machine is fine, when there may be other problems.

With just these two changes, steps 2 to 6 above could be simplified to something like:

2. Run "bkr machine-test --all-families --force --machine=<FQDN>"
3. If that job passes, set the machine back to Automated
4. If it fails, investigate further

Longer term, we may also consider the following two additions:

* provide a button in the web UI to trigger a machine test
* trigger a machine test automatically when a system is marked Broken

Comment 11 Nick Coghlan 2014-02-25 07:26:25 UTC

Bug 851354 adds the ability to run recipes on Manual and Broken systems by explicitly specifying a system rather than using normal host filtering, which is one of the requirements to make it easier to properly test systems before putting them back into service.

That leaves this issue to cover making it easier to run a more comprehensive test of a particular system via "bkr machine-test".