Red Hat Bugzilla – Bug 1296552
[Beaker 21.2] system not marked broken, after two consecutive released distros fail to install
Last modified: 2016-04-04 01:34:54 EDT
Description of problem:
system not marked broken, after two consecutive released distros fail to install
Version-Release number of selected component (if applicable):
T:36995509 /distribution/install RHEL-6.5 Server x86_64 2016-01-06 08:29:45 -05:00 2016-01-06 08:50:59 -05:00
T:36995425 /distribution/install RHEL-6.7 Server i386 2016-01-06 07:48:36 -05:00 2016-01-06 07:49:49 -05:00
Both released distros failed to install.
I thought this met the criteria for Beaker to automatically mark the system
Beaker automagically sets system to broken - if two consecutive released distros
fail to install.
I had also clicked "Report a problem" for this host Wed Jan 06 09:34:13 2016
and opened the following RT:
 RT#385756: hp-bl685cg6-01.rhts.eng.bos.redhat.com: Input/output error during
read on /dev/cciss/c0d0
If I report a problem for the host - should the host be set to broken??
Regardless, I had to contact an admin and request system be marked broken due to
*** Bug 1296551 has been marked as a duplicate of this bug. ***
These are the actual jobs I mentioned in the opening comment:
 RHEL-6.5 Server x86_64
T:36995509 /distribution/install RHEL-6.5 Server x86_64 2016-01-06
08:29:45 -05:00 2016-01-06 08:50:59 -05:00
 RHEL-6.7 Server i386
T:36995425 /distribution/install RHEL-6.7 Server i386 2016-01-06
07:48:36 -05:00 2016-01-06 07:49:49 -05:00
thanks for your bug report. We have actually changed the way installer aborts are treated when implementing Bug 1269076. Both installer tasks started the installation therefore Beaker doesn't treat the failure as suspicious.
In the conversation in Bug 1269076 comment 4, Marian pointed out that installer failures are most likely not a hardware issue from his point of view. Yet you've encountered an installer failure exactly because of a hardware issue. It seems keeping this patch will most likely affect more people dealing with broken hardware not marked as such than the repercussion of to aggressively marked broken hardware.
Might be interesting to know what Marians point of view is, if we revert back to the more aggressive state of marking systems as broken.
The reason for Bug 1269076 was quite critical issue that caused many machines marked as broken falsely.
There's no doubt issue here is pretty valid and can negatively impact running tests as well (especially if the machine is fast enough to quickly do the turnaround).
Switching it back:
1) will force us (my team) to find some kind of workaround to avoid that "frenzy" hunt on possibly broken machines. (should be easy for some cases but difficult for the others)
2) Might be insufficient & potentially dangerous at this moment. According to information from Dan Callaghan relevant tag for detection of broken system was changed from RELEASED to RTT_ACCEPTED. I'm afraid RTT_ACCEPTED isn't sufficient and could cause more broken systems marked falsely.
I'd like to keep the new algorithm in-place however I'm not going to block any decision. If switched back then I'd suggest changing the relevant tag back to RELEASED (expect that it has to be done on production instance, not in the code).
So far best solution from longer perspective could be redesign of algorithm. For example jobs mentioned in comment 3 report:
Input/output error during read on /dev/cciss/c0d0
This seems to be clear indicator of broken system.
The other example demonstrating that improvement is necessary: Current algorithm + RTT_ACCEPTED as decisive tag can have devastating impact on machines in lab with some temporary outage. It can mark all machines as broken quite quickly if it isn't possible to access install images or kickstart because of temporary outage (we saw it in the past, it might be much faster now "thanks" to implemented changes). Once the outage is resolved machines will be still marked broken...
No answer just some thoughts. Sorry ;) For short term I'm (more or less) fine with whatever decision you make (either keep current which I prefer or switch back to old algorithm). If you decide to switch back just let me know in advance please.
Our team had another chat about this issue.
1) In the cited example, the installation task correctly aborted. The previous logic would have marked the system as suspiciously broken - therefore this bug report. Extending the algorithm to catch faulty hardware based on what we can find in the console log (e.g. Input/output error) would currently lead to many false positives since any test/task could create this kind of output for trivial cases.
2) The initial idea of switching to RTT_ACCEPTED was because we wanted it to be more aggressive on finding broken systems due to the fact that most testing happens on RTT_ACCEPTED and not RELEASED composes. Back in January the broken system detection didn't fired for the seattle machines just because of that. Btw. happy to discuss this further, but perhaps the mailing list is a better place.
@Marian we can't think of a short term fix which helps both you and Pauls team. I'm sorry. Yet you mentioned that you and your team would want to look for a workaround. Maybe this is something both our teams could work together? I don't know what you had in mind, but perhaps you guys find a better way of detecting broken hardware? I think the mailing list would be a good place for an elaborate discussion.
Conclusion: We'll revert the patch introduced by Bug 1269076. I'll provide notice to Marian on the timeline.
FYI: First test execution on RHEL-7.3 nightly is just in progress. Compose is good enough to gain RTT_ACCEPTED however then suffered from several ABORTS in consecutive jobs. We usually do not tag nightly composes. However it can happen also to rel-eng compose and we tag rel-eng. In that case RTT_ACCEPTED tag as part of "breaking algorithm" would lead to many falsely broken systems.
From my point of view there isn't much to discuss. It seems to be good idea to avoid such scenario, for example by reverting back to RELEASED.
Reverted patch available on gerrit:
Beaker 22.3 has been released. Release Notes can be found here: https://beaker-project.org/docs/whats-new/release-22.html#beaker-22-3