Bug 1296552

Summary: [Beaker 21.2] system not marked broken, after two consecutive released distros fail to install
Product: [Retired] Beaker Reporter: PaulB <pbunyan>
Component: web UIAssignee: Roman Joost <rjoost>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 21CC: bpeck, btherrie, dcallagh, dowang, jburke, mganisin, mjia, pbunyan, rjoost
Target Milestone: 22.3Keywords: Patch
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-04 05:34:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description PaulB 2016-01-07 14:03:26 UTC
Description of problem:
 system not marked broken, after two consecutive released distros fail to install

Version-Release number of selected component (if applicable):
 Beaker 21.2

Actual results:
T:36995509 	/distribution/install 	RHEL-6.5 Server x86_64	2016-01-06 08:29:45 -05:00	2016-01-06 08:50:59 -05:00

T:36995425 	/distribution/install 	RHEL-6.7 Server i386	2016-01-06 07:48:36 -05:00	2016-01-06 07:49:49 -05:00

Both released distros failed to install.
I thought this met the criteria for Beaker to automatically mark the system
broken.

Expected results:
Beaker automagically sets system to broken - if two consecutive released distros
fail to install.


Additional info:

Comment 1 PaulB 2016-01-07 14:06:27 UTC
All,
I had also clicked "Report a problem" for this host Wed Jan 06 09:34:13 2016 
and opened the following RT:
[] RT#385756: hp-bl685cg6-01.rhts.eng.bos.redhat.com: Input/output error during 
   read on /dev/cciss/c0d0
    https://engineering.redhat.com/rt/Ticket/Display.html?id=385756

If I report a problem for the host - should the host be set to broken??

Regardless, I had to contact an admin and request system be marked broken due to 
RT#385756.

Best,
-pbunyan

Comment 2 PaulB 2016-01-07 14:25:06 UTC
*** Bug 1296551 has been marked as a duplicate of this bug. ***

Comment 3 PaulB 2016-01-07 14:31:38 UTC
All,
These are the actual jobs I mentioned in the opening comment:

[] RHEL-6.5 Server x86_64
   https://beaker.engineering.redhat.com/jobs/1184642
   T:36995509 	/distribution/install 	RHEL-6.5 Server x86_64	2016-01-06    
   08:29:45 -05:00	2016-01-06 08:50:59 -05:00


[] RHEL-6.7 Server i386
   https://beaker.engineering.redhat.com/jobs/1184615
   T:36995425 	/distribution/install 	RHEL-6.7 Server i386	2016-01-06 
   07:48:36 -05:00	2016-01-06 07:49:49 -05:00


Best,
-pbunyan

Comment 4 Roman Joost 2016-01-13 01:37:45 UTC
Dear Paul,

thanks for your bug report. We have actually changed the way installer aborts are treated when implementing Bug 1269076. Both installer tasks started the installation therefore Beaker doesn't treat the failure as suspicious.

In the conversation in Bug 1269076 comment 4, Marian pointed out that installer failures are most likely not a hardware issue from his point of view. Yet you've encountered an installer failure exactly because of a hardware issue. It seems keeping this patch will most likely affect more people dealing with broken hardware not marked as such than the repercussion of to aggressively marked broken hardware.

Might be interesting to know what Marians point of view is, if we revert back to the more aggressive state of marking systems as broken.

Comment 5 Marian Ganisin 2016-01-15 15:06:48 UTC
The reason for Bug 1269076 was quite critical issue that caused many machines marked as broken falsely.

There's no doubt issue here is pretty valid and can negatively impact running tests as well (especially if the machine is fast enough to quickly do the turnaround).

Switching it back:

1) will force us (my team) to find some kind of workaround to avoid that "frenzy" hunt on possibly broken machines. (should be easy for some cases but difficult for the others)

2) Might be insufficient & potentially dangerous at this moment. According to information from Dan Callaghan relevant tag for detection of broken system was changed from RELEASED to RTT_ACCEPTED. I'm afraid RTT_ACCEPTED isn't sufficient and could cause more broken systems marked falsely.

I'd like to keep the new algorithm in-place however I'm not going to block any decision. If switched back then I'd suggest changing the relevant tag back to RELEASED (expect that it has to be done on production instance, not in the code).

So far best solution from longer perspective could be redesign of algorithm. For example jobs mentioned in comment 3 report:

Input/output error during read on /dev/cciss/c0d0

This seems to be clear indicator of broken system.

The other example demonstrating that improvement is necessary: Current algorithm + RTT_ACCEPTED as decisive tag can have devastating impact on machines in lab with some temporary outage. It can mark all machines as broken quite quickly if it isn't possible to access install images or kickstart because of temporary outage (we saw it in the past, it might be much faster now "thanks" to implemented changes). Once the outage is resolved machines will be still marked broken...

No answer just some thoughts. Sorry ;) For short term I'm (more or less) fine with whatever decision you make (either keep current which I prefer or switch back to old algorithm). If you decide to switch back just let me know in advance please.

Comment 6 Roman Joost 2016-01-20 04:58:58 UTC
Our team had another chat about this issue.

1) In the cited example, the installation task correctly aborted. The previous logic would have marked the system as suspiciously broken - therefore this bug report. Extending the algorithm to catch faulty hardware based on what we can find in the console log (e.g. Input/output error) would currently lead to many false positives since any test/task could create this kind of output for trivial cases.

2) The initial idea of switching to RTT_ACCEPTED was because we wanted it to be more aggressive on finding broken systems due to the fact that most testing happens on RTT_ACCEPTED and not RELEASED composes. Back in January the broken system detection didn't fired for the seattle machines just because of that. Btw. happy to discuss this further, but perhaps the mailing list is a better place.

@Marian we can't think of a short term fix which helps both you and Pauls team. I'm sorry. Yet you mentioned that you and your team would want to look for a workaround. Maybe this is something both our teams could work together? I don't know what you had in mind, but perhaps you guys find a better way of detecting broken hardware? I think the mailing list would be a good place for an elaborate discussion.

Conclusion: We'll revert the patch introduced by Bug 1269076. I'll provide notice to Marian on the timeline.

Comment 7 Marian Ganisin 2016-01-21 07:05:54 UTC
FYI: First test execution on RHEL-7.3 nightly is just in progress. Compose is good enough to gain RTT_ACCEPTED however then suffered from several ABORTS in consecutive jobs. We usually do not tag nightly composes. However it can happen also to rel-eng compose and we tag rel-eng. In that case RTT_ACCEPTED tag as part of "breaking algorithm" would lead to many falsely broken systems.

From my point of view there isn't much to discuss. It seems to be good idea to avoid such scenario, for example by reverting back to RELEASED.

Comment 8 Roman Joost 2016-03-18 05:00:26 UTC
Reverted patch available on gerrit:

https://gerrit.beaker-project.org/#/c/4754/

Comment 11 Roman Joost 2016-04-04 05:34:54 UTC
Beaker 22.3 has been released. Release Notes can be found here: https://beaker-project.org/docs/whats-new/release-22.html#beaker-22-3