The fact that this recipe has been deleted is just making it slightly more inconvenient to see what's going on, it's not the cause of the problem.
I think this is the same problem as bug 663698 and bug 645873, where a recipe is cancelled but the system and watchdog are not removed.
Looking in the database, we noticed that the recipe was cancelled, and had NULL start_time and finish_time, which means that it never reached the 'Running' state. However a system had been assigned and a watchdog created, which means it had been through the the queued_recipes routine in beakerd. So the recipe was probably cancelled in the 'Scheduled' state, before the scheduled_recipes routine had a chance to actually start it.
What I still can't figure out is why the system was not released when the recipe was cancelled. That should be happening in RecipeSet._update_status.
One possible way to mitigate these problems would be to let the external watchdog clean them up eventually. We noticed that the watchdog row for this recipe had NULL kill_time, because the queued_recipes routine in beakerd does not set the kill_time when it initially creates the watchdog. Bill, is there any reason we can't set a kill_time of (say) 2 hours in the future when we first create the watchdog?
I've returned the two systems in question.
The workaround is to use the bkr job-cancel command to cancel the already-cancelled recipe sets. Unfortunately, once the recipe has been deleted the only way to find its corresponding recipe set ID is to look in the beaker database.
thanks for returning the machines for me. Would it make sense to scan the DB for other recipes with NULL kill_time, just in case other machines are currently blocked by this issue?
(In reply to comment #3)
Yes, we will keep an eye out for similarly "stuck" systems in the database, until we can fix this properly.
This SQL query will find watchdogs corresponding to completed recipes:
SELECT watchdog.*, recipe.status_id FROM watchdog INNER JOIN recipe ON watchdog.recipe_id = recipe.id WHERE kill_time IS NULL AND recipe.status_id >= 7;
In our Beaker instance that finds 13 "stuck" watchdogs, which I have just cleaned up. So this is definitely a persistent problem.
Could this be a race condition due to transaction isolation? In one transaction beakerd is scheduling the recipe and assigning it a system, while at the same time the user has cancelled the recipe so another transaction calls recipe.release_system() -- but that sees no system has been assigned yet, so it becomes a no-op. Then the transactions are serialised and we get a Cancelled recipe with a system and watchdog still assigned to it.
At the very least, we should probably lock the recipe row in release_system() before we check whether it has a system. Though this makes me wonder where else we might have similar race conditions -- should we actually be SELECTing recipes FOR UPDATE at the top of every loop in beakerd? And what about recipe sets and jobs?...
*** Bug 740397 has been marked as a duplicate of this bug. ***
I just triggered this on stage, cancelling a job via the web UI as beakerd was scheduling it.
2011-09-22 21:19:54,252 beakerd INFO recipe ID 7441 moved from Processed to Queued
2011-09-22 21:19:54,336 beakerd DEBUG Exiting processed_recipes routine
2011-09-22 21:20:00,634 beakerd DEBUG Entering queued_recipes routine
2011-09-22 21:20:00,704 beakerd DEBUG System : a is available for Recipe 7441
2011-09-22 21:20:05,513 beakerd DEBUG Created watchdog for recipe id: 7441 and system: a
2011-09-22 21:20:05,533 beakerd INFO recipe ID 7441 moved from Queued to Scheduled
2011-09-22 21:20:05,636 beakerd DEBUG Exiting queued_recipes routine
2011-09-22 21:20:05,672 beakerd DEBUG Entering scheduled_recipes routine
2011-09-22 21:20:05,672 beakerd INFO scheduled_recipes: RS:5303
2011-09-22 21:20:06,935 beakerd ERROR Failed to commit in scheduled_recipes
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') u'UPDATE job SET status_id=%s WHERE job.id = %s' [6L, 3813L]
Cancelling the RS freed up the machine and got it moving again.
*** Bug 746955 has been marked as a duplicate of this bug. ***
*** Bug 749743 has been marked as a duplicate of this bug. ***
The reason we are prone to deadlocks here is that when we update the status of a job or any of its components, we are not always operating at the same level in the hierarchy. Cancelling happens at the job or recipe set level, and so the status change is made there first and then bubbles down, whereas beakerd operates (mostly) at the recipe level and so the change begins at the recipe and bubbles up and down.
A golden rule of writing concurrent code is to always acquire resources in the same order in each thread, to avoid deadlock. Unfortunately we are not doing that at present.
I think the cleanest solution is to store and update the status only at the recipe-task level, and make the status of everything else (recipe, recipe set, job) be computed from those. It remains to be seen if we can do that without compromising performance or efficiency.
(In reply to comment #11)
> I think the cleanest solution is to store and update the status only at the
> recipe-task level, and make the status of everything else (recipe, recipe set,
> job) be computed from those. It remains to be seen if we can do that without
> compromising performance or efficiency.
This would also allow us to re-order our code such that operations on recipes are serialized properly, preventing the situation where beakerd assigns a system to a recipe at the same time as a user cancels it, which is what this bug is really about.
*** Bug 663698 has been marked as a duplicate of this bug. ***
*** Bug 711686 has been marked as a duplicate of this bug. ***
*** Bug 921898 has been marked as a duplicate of this bug. ***
I am going to dupe this onto bug 807237. It's the same underlying cause (race condition in updating recipe status) even though the end result is different in this bug. But I have a test case which reproduces this, and the solution for bug 807237 fixes this one too.
*** This bug has been marked as a duplicate of bug 807237 ***