Bug 958362 - database deadlock in beakerd
Summary: database deadlock in beakerd
Keywords:
Status: CLOSED EOL
Alias: None
Product: Beaker
Classification: Retired
Component: scheduler
Version: 0.12
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: beaker-dev-list
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-01 07:04 UTC by Dan Callaghan
Modified: 2020-04-21 08:59 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-04-21 08:47:41 UTC
Embargoed:


Attachments (Terms of Use)
latest deadlock (4.88 KB, text/plain)
2013-05-15 03:43 UTC, Raymond Mancy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 950895 0 high CLOSED database error in schedule_queued_recipe causes beakerd to die: SAVEPOINT sa_savepoint_9 does not exist 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 962920 1 None None None 2021-01-20 06:05:38 UTC

Internal Links: 950895 962920

Description Dan Callaghan 2013-05-01 07:04:34 UTC
Since 0.12.0 was released beakerd has been occasionally reporting deadlocks. There appear to be two distinct exceptions, which suggests they are the two halves of the deadlock.

2013-04-30 15:03:01,354 beakerd ERROR Error in process_new_recipe(865041)
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 131, in process_new_recipes
    session.commit()
[...]
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO system_recipe_map (system_id, recipe_id) VALUES (%s, %s)' ((2566L, 865041L), (3190L, 865041L)) ... and a total of 96 bound parameter sets

2013-04-30 10:48:22,281 beakerd ERROR Error in update_dirty_job(410965)
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 102, in update_dirty_jobs
    update_dirty_job(job_id)
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 117, in update_dirty_job
    job.update_status()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4421, in update_status
    self._update_status()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4445, in _update_status
    recipeset._update_status()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4788, in _update_status
    recipe.cleanup()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5429, in cleanup
    self.resource.release()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 6214, in release
    user=self.recipe.recipeset.job.owner)
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 2700, in unreserve
    session.flush()
[...]
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE system SET user_id=%s WHERE system.id = %s' (None, 4467L)

Comment 3 Nick Coghlan 2013-05-02 08:07:41 UTC
Trawling through this (and assuming I am reading it correctly), the main suspicious(!) piece of code I found is the call to self.resource.system.suspicious_abort() in the update code, which then may trigger a call to self.mark_broken(). The "session.flush()" in the second traceback is the first flush after that call.

If this is accurate, we would see a report of a suspicious abort in the logs just before the deadlocks occur.

Comment 4 Nick Coghlan 2013-05-15 01:54:32 UTC
We have some more recent occurrences of this and there's no sign of any suspicious abort messages in the preceding log entries. Something else is going on :P

Comment 5 Raymond Mancy 2013-05-15 03:43:51 UTC
Created attachment 748048 [details]
latest deadlock

Comment 6 Raymond Mancy 2013-05-15 03:59:21 UTC
I've seen a couple of outputs now from SHOW ENGINE INNODB STATUS now, but they have all been the same as the above attachment.

So my theory is that there is a session S1 that updates system A, then tries to update system B and waits. At the same time there is another session S2 (I'm assuming this is process_new_recipes) that gets an S lock on system B (via adding a row in system_recipe_map), then tries to get an S lock on system A and waits. This is what gives us the deadlock.

Still unsure of the code path that creates S1.

Comment 7 Raymond Mancy 2013-05-15 04:25:32 UTC
OK ncoghlan just enlightened me that it is most likely in the schedule_queued_recipes(). As of 0.12, we now wrap the whole block in one big session, so all we need to do is in S1 schedule two systems that are also being inserted into system_recipe_map as part of S2. The timing could easily be such that we end up with a deadlock.

The reasons for the above traceback (in update_dirty_jobs) are for the exact same reason, updating multiple system rows in one transaction.

Comment 8 Nick Coghlan 2013-06-13 07:23:03 UTC
Reproducer posted as http://gerrit.beaker-project.org/#/c/2023/

Comment 9 Nick Coghlan 2013-06-14 00:35:10 UTC
This appears to be a consequence of the way we currently do scheduling, which requires a cached mapping of recipes to possible systems to attain reasonable performance (the system_recipe_map table).

Eliminating the need for that cache is one of the goals of the event driven scheduling [1] redesign. We can live with this misbehaviour until that is in place.

[1] http://beaker-project.org/dev/proposals/event-driven-scheduler.html

Comment 10 Martin Styk 2020-04-21 08:47:41 UTC
Closing this issue.
We are not planning to address this problem in the Beaker development lifecycle.

Instead of that, we are planning to continue our effort in building Beaker.NEXT.
If you have any questions, feel free to reach out to me.

Best regards,
Martin Styk


Note You need to log in before you can comment on or make changes to this bug.