Beaker can leave guest recipes in the Waiting state even when their host recipe is Cancelled, Completed, or Aborted. (I have examples of each.) Still need to figure out if this is another symptom of the status update race condition issues (bug 807237 etc), or if there is an actual flaw in our logic somewhere.
(In reply to comment #0) > Beaker can leave guest recipes in the Waiting state even when their host > recipe is Cancelled, Completed, or Aborted. (I have examples of each.) Scratch that, I only have examples where the host Aborted, so it might be a problem specific to that.
So we just need to abort the guest recipes when a host recipe is aborted. Cancelling is fine because it only happens at the job or recipe-set level, and completion we explicitly *don't* want to propagate to guests because of the case where the host "completes" but the guests are still running. But it probably makes sense to leave this bug until bug 807237 is done, since all the status updating code will (hopefully) get cleaned up for that bug.
*** Bug 911670 has been marked as a duplicate of this bug. ***
As pointed out on bug 911670, the host system is never returned if this bug is hit, which makes this quite a serious waste of system time.
Dan, We are hitting this issue daily. I don't have the permission to cancle these jobs. They have been stale for two days now. I am not sure if it will ever timeout and retunr these hosts. Currently Jarod has a macine in this funky state. https://beaker.engineering.redhat.com/jobs/384963 RecipeSet ID RS:658080 System dell-per710-01.lab.bos.redhat.com Phillip has several hosts that have been tied up since the 26th. https://beaker.engineering.redhat.com/jobs/384256 RecipeSet ID RS:656971 hp-z620-01.lab.bos.redhat.com RecipeSet ID RS:656981 intel-canoepass-03.lab.bos.redhat.com RecipeSet ID RS:656982 dell-per820-02.lab.bos.redhat.com RecipeSet ID RS:656983 amd-dinar-06.lab.bos.redhat.com RecipeSet ID RS:656984 amd-pike-02.lab.bos.redhat.com RecipeSet ID RS:656991 hp-rx8640-02.rhts.eng.bos.redhat.com I think the only thing I can do at this point is have the maintainers cancel each one of the recipes that are stale. Otherwise I am not sure how the hosts will get used for additonal jobs. Thanks, Jeff
Min, Can you please evaluate this BZ to be included into 0.12 or a hotfix. Looking at the bug that this depends that is scheduled for 0.12. Thanks, Jeff
please excuse my ugly query which shows outstanding machines which are stuck in limbo.. recipesets = set([watchdog.recipe.recipeset for watchdog in Watchdog.query.filter(Watchdog.kill_time==None)]) >>> for rs in recipesets: ... for recipe in rs.recipes: ... abort = Watchdog.query.filter(Watchdog.recipe_id == recipe.id).filter(Watchdog.kill_time != None).first() ... if abort: ... print abort.recipe.recipeset.id, abort.recipe.id, abort.recipe.finish_time ... 656982 807244 2013-02-26 19:24:34 658080 808555 2013-02-28 00:51:06 656983 807247 2013-02-26 19:30:35 656971 807227 2013-02-26 19:35:36 658095 808578 2013-02-28 02:29:38 658232 808763 2013-02-28 04:22:36 646757 794424 2013-02-09 00:42:40 648948 797182 2013-02-13 19:05:23 658322 808876 2013-02-28 10:08:31 656984 807250 2013-02-26 17:47:15 658224 808743 2013-02-28 06:21:40 656981 807241 2013-02-26 19:33:37 658143 808630 2013-02-28 05:27:55 654598 804213 2013-02-23 00:44:02 654598 804214 2013-02-23 00:44:02 647649 795485 2013-02-11 15:48:58 656991 807263 2013-02-27 00:47:21 646756 794421 2013-02-09 00:35:04 656636 806779 2013-02-26 13:03:43 649704 798121 2013-02-15 18:03:49
On Gerrit: http://gerrit.beaker-project.org/1814
Beaker 0.12 has been released.