Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 589325 - Failed to provision recipeid 8, 'No watchdog exists for recipe 8'
Failed to provision recipeid 8, 'No watchdog exists for recipe 8'
Product: Beaker
Classification: Community
Component: scheduler (Show other bugs)
All Linux
medium Severity medium (vote)
: ---
: ---
Assigned To: Bill Peck
: Reopened
Depends On:
Blocks: 545868
  Show dependency treegraph
Reported: 2010-05-05 16:42 EDT by Zack Cerza
Modified: 2011-09-28 11:34 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-10-13 23:13:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Zack Cerza 2010-05-05 16:42:06 EDT
A job of mine aborted with this message: "Failed to provision recipeid 8, 'No watchdog exists for recipe 8'"

No clue what this means; I guess either it shouldn't have failed, or that the error message should be more informative.

Comment 1 Bill Peck 2010-05-05 17:27:16 EDT
I am aware of the problem.  I believe it was due to misconfig on the DB server.  verifying now.
Comment 2 Bill Peck 2010-05-20 11:10:16 EDT
this was due to a misconfiguration in the DB.  DB was not configured for transactions, now it is.
Comment 3 Han Pingtian 2010-07-27 03:29:19 EDT
I hit this problem with some jobs. Such as this one:

Comment 4 Marian Csontos 2010-08-12 09:13:15 EDT
It's still around:

Comment 5 Raymond Mancy 2010-08-12 19:19:42 EDT
Hmmm. Unless it's urgent I might wait for Bill to get back to have a look at this.
Comment 6 Marian Csontos 2010-08-13 00:37:50 EDT
Happens here and there, urgent looks different.
Comment 7 Jeff Burke 2010-08-13 09:10:08 EDT
I am seeing the same issue. It happened on the the xen testing in the KT1 tests.
See RecipeSet ID RS:19422
Comment 8 Marian Csontos 2010-08-13 09:17:07 EDT
Is it RHTS taking machine from under our ...?

Workaround: reschedule the job.
Comment 9 Jan Hutař 2010-08-23 03:18:40 EDT
I still see the issue:


Would it be possible to somehow fix this so it is transparent for users? 

Thanks in advance,
Comment 10 Bill Peck 2010-09-29 15:22:00 EDT
I finally tracked this one down.  Not an easy one to debug.

Here are two scenarios, scenario one works because its the only recipe being acted on in the loop.  scenario two fails because if there are multiple recipes then session.close() doesn't get called till we leave the loop.

- Scheduler notices a free system for recipe
- between the time it enters the loop and the time it does the atomic operation to reserve the system, its taken by another user. 
- atomic operation fails and we call session.rollback()
- we leave the loop and call session.close()

- Scheduler notices a free system for a couple of recipes
- Same thing happens above except after the rollback for the first recipe it succeeds on the second recipe.
- Problem is we don't call session.close() until outside of the loop.
Here is the progression of the calls:

That last commit and close seems to revert our previous rollback!

The correct calls are:


But now we hit another problem.  the original recipes object is from outside of this session, so when we try and save anything back to the recipe object which originated from the outside session it bombs!

The correct thing seems to be this:

recipes = sqlalchemy query of recipes that have a matching free system
for _recipe in recipes:
 recipe = Recipe.by_id(_recipe.id)
 if atomic operation to reserve system:

notice we create the recipe object from inside our new session and only use the id from the original list to get it.  The original list is fine to use to query from,  we just can't save anything back with it.

Here is the working diff


Note You need to log in before you can comment on or make changes to this bug.