589325 – Failed to provision recipeid 8, 'No watchdog exists for recipe 8'

Bug 589325 - Failed to provision recipeid 8, 'No watchdog exists for recipe 8'

Summary: Failed to provision recipeid 8, 'No watchdog exists for recipe 8'

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	scheduler
Sub Component:
Version:	0.5
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Bill Peck
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	545868
TreeView+	depends on / blocked

Reported:	2010-05-05 20:42 UTC by Zack Cerza
Modified:	2011-09-28 15:34 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2010-10-14 03:13:07 UTC
Embargoed:

Attachments	(Terms of Use)

Description Zack Cerza 2010-05-05 20:42:06 UTC

A job of mine aborted with this message: "Failed to provision recipeid 8, 'No watchdog exists for recipe 8'"

No clue what this means; I guess either it shouldn't have failed, or that the error message should be more informative.

https://beaker.engineering.redhat.com/jobs/6

Comment 1 Bill Peck 2010-05-05 21:27:16 UTC

I am aware of the problem.  I believe it was due to misconfig on the DB server.  verifying now.

Comment 2 Bill Peck 2010-05-20 15:10:16 UTC

this was due to a misconfiguration in the DB.  DB was not configured for transactions, now it is.

Comment 3 Han Pingtian 2010-07-27 07:29:19 UTC

I hit this problem with some jobs. Such as this one:

https://beaker.engineering.redhat.com/recipes/16452

Comment 4 Marian Csontos 2010-08-12 13:13:15 UTC

It's still around:

https://beaker.engineering.redhat.com/recipes/22029

Comment 5 Raymond Mancy 2010-08-12 23:19:42 UTC

Hmmm. Unless it's urgent I might wait for Bill to get back to have a look at this.

Comment 6 Marian Csontos 2010-08-13 04:37:50 UTC

Happens here and there, urgent looks different.

Comment 7 Jeff Burke 2010-08-13 13:10:08 UTC

I am seeing the same issue. It happened on the the xen testing in the KT1 tests.
See RecipeSet ID RS:19422

Comment 8 Marian Csontos 2010-08-13 13:17:07 UTC

Is it RHTS taking machine from under our ...?
...control?

Workaround: reschedule the job.

Comment 9 Jan Hutař 2010-08-23 07:18:40 UTC

Hello,
I still see the issue:

https://beaker.engineering.redhat.com/recipes/24423
https://beaker.engineering.redhat.com/recipes/24424
https://beaker.engineering.redhat.com/recipes/24425
https://beaker.engineering.redhat.com/recipes/24424

Would it be possible to somehow fix this so it is transparent for users? 

Thanks in advance,
Jan

Comment 10 Bill Peck 2010-09-29 19:22:00 UTC

I finally tracked this one down.  Not an easy one to debug.

Here are two scenarios, scenario one works because its the only recipe being acted on in the loop.  scenario two fails because if there are multiple recipes then session.close() doesn't get called till we leave the loop.

1)
- Scheduler notices a free system for recipe
- between the time it enters the loop and the time it does the atomic operation to reserve the system, its taken by another user. 
- atomic operation fails and we call session.rollback()
- we leave the loop and call session.close()

2)
- Scheduler notices a free system for a couple of recipes
- Same thing happens above except after the rollback for the first recipe it succeeds on the second recipe.
- Problem is we don't call session.close() until outside of the loop.
Here is the progression of the calls:
session.begin()
session.rollback()
session.begin()
session.commit()
session.close()

That last commit and close seems to revert our previous rollback!

The correct calls are:

session.begin()
session.rollback()
session.close()
session.begin()
session.commit()
session.close()

But now we hit another problem.  the original recipes object is from outside of this session, so when we try and save anything back to the recipe object which originated from the outside session it bombs!

The correct thing seems to be this:

recipes = sqlalchemy query of recipes that have a matching free system
for _recipe in recipes:
 session.begin()
 recipe = Recipe.by_id(_recipe.id)
 if atomic operation to reserve system:
     session.commit()
 else:
     session.rollback()
 session.close()

notice we create the recipe object from inside our new session and only use the id from the original list to get it.  The original list is fine to use to query from,  we just can't save anything back with it.

Here is the working diff

http://git.fedorahosted.org/git/?p=beaker.git;a=commitdiff;h=e3318d7b9932c522ff6da3a6d051d1a831cd70ff

Note You need to log in before you can comment on or make changes to this bug.