Description of problem: Sometimes the reservesys installation EWD-timeouts. I found out it is best to schedule the same installation/reservation multiple times as one of the machines may succeed. If you wait till it Aborts you will get the same incompatible combination scheduled during the retry. Sorry for no log references; the existing jobs are too messy to find it out (Bug 221226). Steps to Reproduce: 1. Schedule an OS/arch combination which does not report "No (Machines|Distros) match for recipe" but timeouts. 2. Wait till it EWDs after about 2 hours. 3. Schedule the same OS/arch combination. Actual results: The same machine gets assigned and timeouts again. Expected results: A different machine would get assigned, in a round-robin mode. Additional info: I would expect someone is checking these failed installations and blacklisting the incompatible OS/arch/machine pairs either manually or even automatically (before it gets rechecked and corrected by a lab admin). According to the repeating EWD failures I do not expect it is happenning.
Notice: Legacy RHTS is soon to be retired and replaced by Beaker. As part of this migration process all RHTS bugs need to be re-verified against a Beaker instance by the cut-off date 5pm Monday April 12th UTC-4. Please confirm this bug is still relevant to Beaker by re-verifying it against the stage deployment of Beaker https://beaker-stage.app.eng.bos.redhat.com. To keep this bug open please comment on it If it has not received a comment by that date the bug will be closed/wontfix. After the cutoff date all commented bugs will be moved to the Beaker product. thank you
The RHTS product is going away. If any of these bugs still apply to the new Beaker system which is replacing RHTS then please open a new bug with Beaker as the component. Thanks!
Tried Beaker if it will be better but this exact problem still happens: Everything: RHEL5.7-Server-20110413.1_nfs J:76729 R:157025 all ibm-js21-01.rhts.englab.brq.redhat.com -> Aborted blade: connection failed, retrying after 30 second delay Therefore I clicked `Report problem with system', created: [engineering.redhat.com #108428] AutoReply: Problem reported for ibm-js21-01.rhts.englab.brq.redhat.com J:76739 R:157049 ppc64 ibm-js21-01.rhts.englab.brq.redhat.com -> Running... blade: connection failed, retrying after 30 second delay Why can I get assigned the same system when others are free and I have even bugreported problem with this specific system via Beaker GUI? So I had to create new Job _before_ J:76739 Aborts, which succeeded with the same distro/arch: J:76740 R:157050 ppc64 ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com -> reserved And then J:76739 Aborted but I was able to avoid broken ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com this way.
Possible solutions to the "report problem with system": 1 - in addition to the email sent to the owner we also move the status to "Broken" The downside to this is someone could DOS beaker by moving all systems to broken, but we have a record in History that says who did it. 2 - Instead of sending the email we could auto-submit a job that runs a set of released distros on that machine (with a high priority). If this job failed it would automatically email the owner and flip the machine to broken. I think number 2 is a great solution in that it gives the admin some data already about if the system installs properly on released distros. I can bring this up in the Stake holder meeting.
Those sound good, Bill. Have you considered also randomizing the assignment of jobs to systems, so that problematic machines are less likely to be hit again and again while someone is retrying, or the above procedure is under way?
I spoke to the admins and the suggestion is to leave the choice up to the user: Or we present 1 and 2 together Describe the problem you are having with system: [ ] [ ] [ ] (X) Leave system Active, the issue is annoying but not enough to prevent other jobs from running. ( ) Move system to broken status. ( ) Submit a job of released distros to verify issue with system. (if more than 2 released distros fail the system will automatically be marked broken) [Submit Report] also, you can randomize your machine choice by changing the following in your xml: <autopick random="true"/>
This implemented in Beaker now: * If powering a system fails, we immediately mark it as Broken and inform the user. * Users can add <autopick random="True" /> inside <recipe/> in their job XML to make the scheduler pick a random system.
(In reply to comment #8) > * Users can add <autopick random="True" /> inside <recipe/> in their job XML > to make the scheduler pick a random system. How can I do it from https://beaker.engineering.redhat.com/reserveworkflow ?
(In reply to comment #9) > How can I do it from https://beaker.engineering.redhat.com/reserveworkflow ? Currently you can't. We could add a checkbox to the Reserve Workflow for randomizing the system selection.
Oops, just noticed that we already have this. *** This bug has been marked as a duplicate of bug 738701 ***