A job of mine aborted with this message: "Failed to provision recipeid 8, 'No watchdog exists for recipe 8'"
No clue what this means; I guess either it shouldn't have failed, or that the error message should be more informative.
I am aware of the problem. I believe it was due to misconfig on the DB server. verifying now.
this was due to a misconfiguration in the DB. DB was not configured for transactions, now it is.
I hit this problem with some jobs. Such as this one:
It's still around:
Hmmm. Unless it's urgent I might wait for Bill to get back to have a look at this.
Happens here and there, urgent looks different.
I am seeing the same issue. It happened on the the xen testing in the KT1 tests.
See RecipeSet ID RS:19422
Is it RHTS taking machine from under our ...?
Workaround: reschedule the job.
I still see the issue:
Would it be possible to somehow fix this so it is transparent for users?
Thanks in advance,
I finally tracked this one down. Not an easy one to debug.
Here are two scenarios, scenario one works because its the only recipe being acted on in the loop. scenario two fails because if there are multiple recipes then session.close() doesn't get called till we leave the loop.
- Scheduler notices a free system for recipe
- between the time it enters the loop and the time it does the atomic operation to reserve the system, its taken by another user.
- atomic operation fails and we call session.rollback()
- we leave the loop and call session.close()
- Scheduler notices a free system for a couple of recipes
- Same thing happens above except after the rollback for the first recipe it succeeds on the second recipe.
- Problem is we don't call session.close() until outside of the loop.
Here is the progression of the calls:
That last commit and close seems to revert our previous rollback!
The correct calls are:
But now we hit another problem. the original recipes object is from outside of this session, so when we try and save anything back to the recipe object which originated from the outside session it bombs!
The correct thing seems to be this:
recipes = sqlalchemy query of recipes that have a matching free system
for _recipe in recipes:
recipe = Recipe.by_id(_recipe.id)
if atomic operation to reserve system:
notice we create the recipe object from inside our new session and only use the id from the original list to get it. The original list is fine to use to query from, we just can't save anything back with it.
Here is the working diff