Bug 221228 - [RFE] randomize selection of system in reserve workflow
[RFE] randomize selection of system in reserve workflow
Status: CLOSED DUPLICATE of bug 738701
Product: Beaker
Classification: Community
Component: scheduler (Show other bugs)
0.7
All Linux
medium Severity medium (vote)
: ---
: ---
Assigned To: Dan Callaghan
UX
: FutureFeature, Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-01-02 19:52 EST by Jan Kratochvil
Modified: 2014-08-12 00:34 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 738701 (view as bug list)
Environment:
Last Closed: 2012-10-03 02:15:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Jan Kratochvil 2007-01-02 19:52:04 EST
Description of problem:
Sometimes the reservesys installation EWD-timeouts.  I found out it is best to
schedule the same installation/reservation multiple times as one of the machines
may succeed.  If you wait till it Aborts you will get the same incompatible
combination scheduled during the retry.

Sorry for no log references; the existing jobs are too messy to find it out (Bug
221226).

Steps to Reproduce:
1. Schedule an OS/arch combination which does not report
   "No (Machines|Distros) match for recipe" but timeouts.
2. Wait till it EWDs after about 2 hours.
3. Schedule the same OS/arch combination.
  
Actual results:
The same machine gets assigned and timeouts again.

Expected results:
A different machine would get assigned, in a round-robin mode.

Additional info:
I would expect someone is checking these failed installations and blacklisting
the incompatible OS/arch/machine pairs either manually or even automatically
(before it gets rechecked and corrected by a lab admin). According to the
repeating EWD failures I do not expect it is happenning.
Comment 1 Bill Peck 2010-04-07 08:37:49 EDT
Notice:

Legacy RHTS is soon to be retired and replaced by Beaker. As part of
this migration process all RHTS bugs need to be re-verified against a
Beaker instance by the cut-off date

   5pm Monday April 12th UTC-4. 

Please confirm this bug is still relevant to Beaker by re-verifying it
against the stage deployment of Beaker https://beaker-stage.app.eng.bos.redhat.com.

To keep this bug open please comment on it

If it has not received a comment by that date the bug will be closed/wontfix.

After the cutoff date all commented bugs will be moved to the Beaker
product.


thank you
Comment 2 Bill Peck 2010-04-14 10:51:55 EDT
The RHTS product is going away.  If any of these bugs still apply to the new Beaker system which is replacing RHTS then please open a new bug with Beaker as the component.

Thanks!
Comment 3 Jan Kratochvil 2011-04-23 01:42:45 EDT
Tried Beaker if it will be better but this exact problem still happens:

Everything: RHEL5.7-Server-20110413.1_nfs

J:76729 R:157025 all   ibm-js21-01.rhts.englab.brq.redhat.com -> Aborted
  blade: connection failed, retrying after 30 second delay

Therefore I clicked `Report problem with system', created:
  [engineering.redhat.com #108428] AutoReply: Problem reported for
                                   ibm-js21-01.rhts.englab.brq.redhat.com

J:76739 R:157049 ppc64 ibm-js21-01.rhts.englab.brq.redhat.com -> Running...
  blade: connection failed, retrying after 30 second delay

Why can I get assigned the same system when others are free and I have even bugreported problem with this specific system via Beaker GUI?

So I had to create new Job _before_ J:76739 Aborts, which succeeded with the same distro/arch:

J:76740 R:157050 ppc64 ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com -> reserved

And then J:76739 Aborted but I was able to avoid broken ibm-js12-vios-01-lp3.rhts.eng.bos.redhat.com this way.
Comment 5 Bill Peck 2011-09-15 10:28:34 EDT
Possible solutions to the "report problem with system":

1 - in addition to the email sent to the owner we also move the status to "Broken"
   The downside to this is someone could DOS beaker by moving all systems
   to broken, but we have a record in History that says who did it.

2 - Instead of sending the email we could auto-submit a job
    that runs a set of released distros on that machine (with a high priority).
    If this job failed it would automatically email the owner and flip the 
    machine to broken.


I think number 2 is a great solution in that it gives the admin some data already about if the system installs properly on released distros.

I can bring this up in the Stake holder meeting.
Comment 6 Frank Ch. Eigler 2011-09-15 10:42:23 EDT
Those sound good, Bill.  Have you considered also randomizing the
assignment of jobs to systems, so that problematic machines are less
likely to be hit again and again while someone is retrying, or the
above procedure is under way?
Comment 7 Bill Peck 2011-09-15 10:45:56 EDT
I spoke to the admins and the suggestion is to leave the choice up to the user:

Or we present 1 and 2 together 

Describe the problem you are having with system:
[                                         ]
[                                         ]
[                                         ]

(X) Leave system Active, the issue is annoying but not enough to prevent
    other jobs from running.
( ) Move system to broken status.
( ) Submit a job of released distros to verify issue with system.
    (if more than 2 released distros fail the system will automatically
     be marked broken)

[Submit Report]



also, you can randomize your machine choice by changing the following in your xml:  

<autopick random="true"/>
Comment 8 Dan Callaghan 2012-10-03 01:31:52 EDT
This implemented in Beaker now:

* If powering a system fails, we immediately mark it as Broken and inform the user.
* Users can add <autopick random="True" /> inside <recipe/> in their job XML to make the scheduler pick a random system.
Comment 9 Jan Kratochvil 2012-10-03 01:35:21 EDT
(In reply to comment #8)
> * Users can add <autopick random="True" /> inside <recipe/> in their job XML
> to make the scheduler pick a random system.

How can I do it from https://beaker.engineering.redhat.com/reserveworkflow ?
Comment 10 Dan Callaghan 2012-10-03 02:13:12 EDT
(In reply to comment #9)
> How can I do it from https://beaker.engineering.redhat.com/reserveworkflow ?

Currently you can't. We could add a checkbox to the Reserve Workflow for randomizing the system selection.
Comment 11 Dan Callaghan 2012-10-03 02:15:54 EDT
Oops, just noticed that we already have this.

*** This bug has been marked as a duplicate of bug 738701 ***

Note You need to log in before you can comment on or make changes to this bug.