903935 – guest recipes remain stuck in Waiting even though their host recipe is finished

Bug 903935 - guest recipes remain stuck in Waiting even though their host recipe is finished

Summary: guest recipes remain stuck in Waiting even though their host recipe is finished

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.11
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	0.12
Assignee:	Dan Callaghan
QA Contact:	Raymond Mancy
Docs Contact:
URL:
Whiteboard:	Misc
Duplicates (1):	911670 (view as bug list)
Depends On:	807237
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-25 05:07 UTC by Dan Callaghan
Modified:	2018-02-06 00:41 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-04-11 04:56:46 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2013-01-25 05:07:39 UTC

Beaker can leave guest recipes in the Waiting state even when their host recipe is Cancelled, Completed, or Aborted. (I have examples of each.)

Still need to figure out if this is another symptom of the status update race condition issues (bug 807237 etc), or if there is an actual flaw in our logic somewhere.

Comment 1 Dan Callaghan 2013-01-25 05:40:43 UTC

(In reply to comment #0)
> Beaker can leave guest recipes in the Waiting state even when their host
> recipe is Cancelled, Completed, or Aborted. (I have examples of each.)

Scratch that, I only have examples where the host Aborted, so it might be a problem specific to that.

Comment 3 Dan Callaghan 2013-02-01 00:48:17 UTC

So we just need to abort the guest recipes when a host recipe is aborted. Cancelling is fine because it only happens at the job or recipe-set level, and completion we explicitly *don't* want to propagate to guests because of the case where the host "completes" but the guests are still running.

But it probably makes sense to leave this bug until bug 807237 is done, since all the status updating code will (hopefully) get cleaned up for that bug.

Comment 5 Dan Callaghan 2013-02-17 21:56:39 UTC

*** Bug 911670 has been marked as a duplicate of this bug. ***

Comment 6 Dan Callaghan 2013-02-17 22:00:04 UTC

As pointed out on bug 911670, the host system is never returned if this bug is hit, which makes this quite a serious waste of system time.

Comment 7 Jeff Burke 2013-02-28 15:21:49 UTC

Dan,
 We are hitting this issue daily. I don't have the permission to cancle these jobs. They have been stale for two days now. I am not sure if it will ever timeout and retunr these hosts.

Currently Jarod has a macine in this funky state. 
 https://beaker.engineering.redhat.com/jobs/384963
  RecipeSet ID RS:658080
  System dell-per710-01.lab.bos.redhat.com

Phillip has several hosts that have been tied up since the 26th.
 https://beaker.engineering.redhat.com/jobs/384256
  RecipeSet ID RS:656971
  hp-z620-01.lab.bos.redhat.com

  RecipeSet ID RS:656981
  intel-canoepass-03.lab.bos.redhat.com

  RecipeSet ID RS:656982
  dell-per820-02.lab.bos.redhat.com

  RecipeSet ID RS:656983
  amd-dinar-06.lab.bos.redhat.com

  RecipeSet ID RS:656984
  amd-pike-02.lab.bos.redhat.com

  RecipeSet ID RS:656991
  hp-rx8640-02.rhts.eng.bos.redhat.com

I think the only thing I can do at this point is have the maintainers cancel each one of the recipes that are stale. Otherwise I am not sure how the hosts will get used for additonal jobs.

Thanks,
Jeff

Comment 8 Jeff Burke 2013-02-28 15:28:26 UTC

Min,
 Can you please evaluate this BZ to be included into 0.12 or a hotfix. Looking at the bug that this depends that is scheduled for 0.12.

Thanks,
Jeff

Comment 9 Bill Peck 2013-02-28 16:53:16 UTC

please excuse my ugly query which shows outstanding machines which are stuck in limbo..

recipesets = set([watchdog.recipe.recipeset for watchdog in Watchdog.query.filter(Watchdog.kill_time==None)])
>>> for rs in recipesets:
...     for recipe in rs.recipes:
...         abort = Watchdog.query.filter(Watchdog.recipe_id == recipe.id).filter(Watchdog.kill_time != None).first()
...         if abort:
...             print abort.recipe.recipeset.id, abort.recipe.id, abort.recipe.finish_time
... 
656982 807244 2013-02-26 19:24:34
658080 808555 2013-02-28 00:51:06
656983 807247 2013-02-26 19:30:35
656971 807227 2013-02-26 19:35:36
658095 808578 2013-02-28 02:29:38
658232 808763 2013-02-28 04:22:36
646757 794424 2013-02-09 00:42:40
648948 797182 2013-02-13 19:05:23
658322 808876 2013-02-28 10:08:31
656984 807250 2013-02-26 17:47:15
658224 808743 2013-02-28 06:21:40
656981 807241 2013-02-26 19:33:37
658143 808630 2013-02-28 05:27:55
654598 804213 2013-02-23 00:44:02
654598 804214 2013-02-23 00:44:02
647649 795485 2013-02-11 15:48:58
656991 807263 2013-02-27 00:47:21
646756 794421 2013-02-09 00:35:04
656636 806779 2013-02-26 13:03:43
649704 798121 2013-02-15 18:03:49

Comment 10 Dan Callaghan 2013-03-21 10:14:15 UTC

On Gerrit: http://gerrit.beaker-project.org/1814

Comment 13 Dan Callaghan 2013-04-11 04:56:46 UTC

Beaker 0.12 has been released.

Note You need to log in before you can comment on or make changes to this bug.