Bug 1269076

Summary:	Beaker should not "break" system aborted by installation or rhts-abort
Product:	[Retired] Beaker	Reporter:	Marian Ganisin <mganisin>
Component:	lab controller	Assignee:	Roman Joost <rjoost>
Status:	CLOSED CURRENTRELEASE	QA Contact:	tools-bugs <tools-bugs>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	21	CC:	bpeck, dcallagh, dowang, draeuftl, jstancek, mjia, rjoost
Target Milestone:	21.2	Keywords:	NeedsTestCase, Patch
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-11-23 03:54:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marian Ganisin 2015-10-06 09:11:34 UTC

Description of problem:

Beaker marks machines as broken if certain condition is met. Surprisingly this condition is quite wide (despite to strict definiton) and my team can suffer from this quite often (actually all teams as machines running tests can be from shared pool)

Beaker is able to detect installation failure. This is great candidate to exclude it from "breaking rule" as it is likely installer bug.

Another case to exclude from "breaking rule" is rhts-abort called from recipe itself (if it isn't yet). That seems to be pretty clear.

I'm not sure what cases can result in abort state, so I don't know whether it is best idea to "break" machines just in case of external watchdog expiration. Anyway this seems to be something to consider really seriously.

Version-Release number of selected component (if applicable):
Beaker 21.0

Comment 1 Marian Ganisin 2015-10-06 09:35:17 UTC

Good starting point seems to be either:

 - count just aborts caused by "installation watchdog"

or

 - count just aborts caused by "booting watchdog"

Eventually count both these watchdogs but nothing else.

Comment 2 Dan Callaghan 2015-10-07 01:51:12 UTC

(In reply to Marian Ganisin from comment #0)
> Beaker is able to detect installation failure. This is great candidate to
> exclude it from "breaking rule" as it is likely installer bug.

There are many cases where the system really is broken or misconfigured which manifest as installer errors. For example, dead hard disk or RAID controller, or not supported by this kernel version; incorrect DHCP settings; incorrect ksdevice= argument in the system install options...

So it is not always an installer bug. But we don't have any numbers which would tell us how likely one or the other is.

The broken system detection is always a trade-off between false positives and false negatives. Certainly we can try removing installation failures from the broken system logic and see if that makes things overall better or worse.

> Another case to exclude from "breaking rule" is rhts-abort called from
> recipe itself (if it isn't yet). That seems to be pretty clear.

Yes this is a good point. Explicitly triggered aborts should not trigger the broken system detection logic.

Currently Beaker doesn't distinguish between abort from watchdog timeout vs. abort from rhts-abort on the system itself, we will need to add something to distinguish them.

Comment 3 Dan Callaghan 2015-10-07 21:57:08 UTC

Bill pointed out that the broken system detection seems to be firing for recipes which abort during /distribution/reservesys, which is expected to happen all the time (someone ignores/forgets about their reservation so it EWD's). So that is either a regression in the logic or a serious bug we never noticed before.

Expected behaviour is if any task is Completed then a subsequent Aborted task will not mark the system as broken (because it indicates that it successfully installed and rebooted and started the harness).

Comment 4 Marian Ganisin 2015-10-08 08:33:09 UTC

(In reply to Dan Callaghan from comment #3)
> Bill pointed out that the broken system detection seems to be firing for
> recipes which abort during /distribution/reservesys, 

Actually this was also my concern and I wanted to raise a question about that. Luckily already answered.

> Expected behaviour is if any task is Completed then a subsequent Aborted
> task will not mark the system as broken (because it indicates that it
> successfully installed and rebooted and started the harness).

Well this applies to those people that keep /distribution/install in the recipe. I might know at least one who does not do that...

Actually I can not imagine that significant amount of /distribution/install tasks could be aborted because of hardware issue (certainly there is some portion). Most of them could be because of booting issue (either boot loader problem or badly generated configuration, maybe missing initrd, (unlikely) kernel bug, problem with init, bug in network configuration, issue with harness... A bug in short.

Really only unsuccessful installation should be counted (all that what happens before any task starts). Once it is marked as finished "breaking counter" should be cleared.

I didn't change my opinion on identified installer issues (and kernel panic, etc.) they should be excluded at all. It's pretty clear that installer failure can be hardware issue. As already mentioned certainty of that is quite uncertain though. All this started by installer issue that "broke" machines just because of file conflict in the RHEL. Unless "breaking rule" recognizes various anaconda failures (btw how about different languages?) it should not make any decision.

For example such events can be detected by different pattern. Let's say 13 subsequent aborts could really mean something...

So far anaconda issue detection seems to be really great feature and abort based on that is highly desirable behavior.

It seems to be quite good approach to have "starting" implementation quite tight and to extend it later as the failing patterns will be identified and well described.

Comment 5 Dan Callaghan 2015-10-12 04:06:12 UTC

(In reply to Dan Callaghan from comment #3)
> Bill pointed out that the broken system detection seems to be firing for
> recipes which abort during /distribution/reservesys, which is expected to
> happen all the time (someone ignores/forgets about their reservation so it
> EWD's). So that is either a regression in the logic or a serious bug we
> never noticed before.

Okay, I figured out what was going on here. It *is* a regression in Beaker 21 due to bug 714937. Previously if a recipe had some Completed tasks and some Aborted tasks, it would be Completed. Now it is Aborted, which is why it is triggering the broken system logic. I filed a new bug for this specifically since it's quite a serious regression: bug 1270649.

We will leave this bug just for the changes in the broken system logic, as originally requested by Marian.

Comment 6 Dan Callaghan 2015-10-12 04:27:10 UTC

So back to the original request: to summarize, we want to change the broken system logic so that a recipe aborted by (a) beaker-watchdog's install failure detection or (b) by rhts-abort, will not count as suspicious.

We can cheat a little bit and say instead that an aborted recipe is not suspicious if its install_started timestamp is non-NULL, meaning that Anaconda successfully started and executed %pre. By definition, installation failures only happen after Anaconda has started, and rhts-abort can only happen after the installation has started (and completed). So that lets us keep the logic still relatively simple.

(In reply to Marian Ganisin from comment #4)
> It seems to be quite good approach to have "starting" implementation quite
> tight and to extend it later as the failing patterns will be identified and
> well described.

BTW the broken system detection has been in Beaker since 0.5.59 (September 2010) and the logic hasn't changed since then, apart from this regression in 21.0 and the configuration change made on beaker.engineering.redhat.com recently to use RTT_ACCEPTED instead of RELEASED.

Comment 7 Dan Callaghan 2015-10-12 04:29:34 UTC

The broken system detection is not implemented using a counter. Each time there is a suspicious abort, there is a query to check for a run of consecutive suspicious aborts leading up to the latest one. That does make things a bit more complicated, because we need to keep the condition for triggering the check the same as the condition applied in the query.

It might help to introduce a new hybrid method/property on Recipe for deciding if it aborted suspiciously. We could then use it in both places. That might help to simplify the query in System.suspicious_abort() which is already a bit of a nightmare.

If that isn't possible, it might be simpler to switch to a counter-based implementation but that would need to go into a major release since it would require a database change.

Comment 8 Marian Ganisin 2015-10-15 05:52:18 UTC

(In reply to Dan Callaghan from comment #6)
> We can cheat a little bit and say instead that an aborted recipe is not
> suspicious if its install_started timestamp is non-NULL, meaning that
> Anaconda successfully started and executed %pre.

This seems to be equal to what I codenamed "booting watchdog". I think that it could be good trigger. Machine has to boot (...get images...) configure network & access beaker. So the confidence that the machine is in good shape (more or less) after this step is on reasonable level. Obviously it doesn't catch mentioned storage issues, etc.

I believe that the detection of "broken" systems will be still quite good and I'm pretty sure that the ratio of invalid switches to broken state will be lowered significantly. Especially with respect to what we started to do and what I mentioned earlier. We started to reduce "installation watchdog" in some jobs, we also exclude /distribution/install quite often + our tasks might call rhts-abort sometimes what can result to a job with none task passed. It could happen that such tasks can run even on "stable" distros (actually they often run, we handle upgrades this way), then the final mix could be quite explosive.

Of course this is just my point of view. I hope that I didn't overlook any aspect, however I might to miss something. :)

Comment 9 Roman Joost 2015-10-20 04:32:02 UTC

Patch on gerrit:

https://gerrit.beaker-project.org/#/c/4447/

Comment 10 Roman Joost 2015-10-22 01:02:43 UTC

Tips for verifying:

* If the installation has started, but leaves all tasks aborted the system should not be marked as broken

* For verifying the intermittent failure case which does not leave a broken system, you can use three recipes. The second recipe is the one which aborts, but at least starts the installation:
If the first running recipe aborts and does not start the installation, the second recipe starts the installation, but leaves all tasks aborted, the third recipe (like the first recipe) aborts without starting the installation the system is not marked as broken.

Comment 13 matt jia 2015-11-23 03:54:32 UTC

Beaker 21.2 has been released.