Since 0.12.0 was released beakerd has been occasionally reporting deadlocks. There appear to be two distinct exceptions, which suggests they are the two halves of the deadlock. 2013-04-30 15:03:01,354 beakerd ERROR Error in process_new_recipe(865041) Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 131, in process_new_recipes session.commit() [...] OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO system_recipe_map (system_id, recipe_id) VALUES (%s, %s)' ((2566L, 865041L), (3190L, 865041L)) ... and a total of 96 bound parameter sets 2013-04-30 10:48:22,281 beakerd ERROR Error in update_dirty_job(410965) Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 102, in update_dirty_jobs update_dirty_job(job_id) File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 117, in update_dirty_job job.update_status() File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4421, in update_status self._update_status() File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4445, in _update_status recipeset._update_status() File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4788, in _update_status recipe.cleanup() File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5429, in cleanup self.resource.release() File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 6214, in release user=self.recipe.recipeset.job.owner) File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 2700, in unreserve session.flush() [...] OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE system SET user_id=%s WHERE system.id = %s' (None, 4467L)
Trawling through this (and assuming I am reading it correctly), the main suspicious(!) piece of code I found is the call to self.resource.system.suspicious_abort() in the update code, which then may trigger a call to self.mark_broken(). The "session.flush()" in the second traceback is the first flush after that call. If this is accurate, we would see a report of a suspicious abort in the logs just before the deadlocks occur.
We have some more recent occurrences of this and there's no sign of any suspicious abort messages in the preceding log entries. Something else is going on :P
Created attachment 748048 [details] latest deadlock
I've seen a couple of outputs now from SHOW ENGINE INNODB STATUS now, but they have all been the same as the above attachment. So my theory is that there is a session S1 that updates system A, then tries to update system B and waits. At the same time there is another session S2 (I'm assuming this is process_new_recipes) that gets an S lock on system B (via adding a row in system_recipe_map), then tries to get an S lock on system A and waits. This is what gives us the deadlock. Still unsure of the code path that creates S1.
OK ncoghlan just enlightened me that it is most likely in the schedule_queued_recipes(). As of 0.12, we now wrap the whole block in one big session, so all we need to do is in S1 schedule two systems that are also being inserted into system_recipe_map as part of S2. The timing could easily be such that we end up with a deadlock. The reasons for the above traceback (in update_dirty_jobs) are for the exact same reason, updating multiple system rows in one transaction.
Reproducer posted as http://gerrit.beaker-project.org/#/c/2023/
This appears to be a consequence of the way we currently do scheduling, which requires a cached mapping of recipes to possible systems to attain reasonable performance (the system_recipe_map table). Eliminating the need for that cache is one of the goals of the event driven scheduling [1] redesign. We can live with this misbehaviour until that is in place. [1] http://beaker-project.org/dev/proposals/event-driven-scheduler.html
Closing this issue. We are not planning to address this problem in the Beaker development lifecycle. Instead of that, we are planning to continue our effort in building Beaker.NEXT. If you have any questions, feel free to reach out to me. Best regards, Martin Styk