Bug 958362

Summary: database deadlock in beakerd
Product: [Retired] Beaker Reporter: Dan Callaghan <dcallagh>
Component: schedulerAssignee: beaker-dev-list
Status: CLOSED EOL QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.12CC: mastyk, mboswell, qwan, tools-bugs
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-21 08:47:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
latest deadlock none

Description Dan Callaghan 2013-05-01 07:04:34 UTC
Since 0.12.0 was released beakerd has been occasionally reporting deadlocks. There appear to be two distinct exceptions, which suggests they are the two halves of the deadlock.

2013-04-30 15:03:01,354 beakerd ERROR Error in process_new_recipe(865041)
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 131, in process_new_recipes
    session.commit()
[...]
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'INSERT INTO system_recipe_map (system_id, recipe_id) VALUES (%s, %s)' ((2566L, 865041L), (3190L, 865041L)) ... and a total of 96 bound parameter sets

2013-04-30 10:48:22,281 beakerd ERROR Error in update_dirty_job(410965)
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 102, in update_dirty_jobs
    update_dirty_job(job_id)
  File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 117, in update_dirty_job
    job.update_status()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4421, in update_status
    self._update_status()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4445, in _update_status
    recipeset._update_status()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4788, in _update_status
    recipe.cleanup()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5429, in cleanup
    self.resource.release()
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 6214, in release
    user=self.recipe.recipeset.job.owner)
  File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 2700, in unreserve
    session.flush()
[...]
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE system SET user_id=%s WHERE system.id = %s' (None, 4467L)

Comment 3 Nick Coghlan 2013-05-02 08:07:41 UTC
Trawling through this (and assuming I am reading it correctly), the main suspicious(!) piece of code I found is the call to self.resource.system.suspicious_abort() in the update code, which then may trigger a call to self.mark_broken(). The "session.flush()" in the second traceback is the first flush after that call.

If this is accurate, we would see a report of a suspicious abort in the logs just before the deadlocks occur.

Comment 4 Nick Coghlan 2013-05-15 01:54:32 UTC
We have some more recent occurrences of this and there's no sign of any suspicious abort messages in the preceding log entries. Something else is going on :P

Comment 5 Raymond Mancy 2013-05-15 03:43:51 UTC
Created attachment 748048 [details]
latest deadlock

Comment 6 Raymond Mancy 2013-05-15 03:59:21 UTC
I've seen a couple of outputs now from SHOW ENGINE INNODB STATUS now, but they have all been the same as the above attachment.

So my theory is that there is a session S1 that updates system A, then tries to update system B and waits. At the same time there is another session S2 (I'm assuming this is process_new_recipes) that gets an S lock on system B (via adding a row in system_recipe_map), then tries to get an S lock on system A and waits. This is what gives us the deadlock.

Still unsure of the code path that creates S1.

Comment 7 Raymond Mancy 2013-05-15 04:25:32 UTC
OK ncoghlan just enlightened me that it is most likely in the schedule_queued_recipes(). As of 0.12, we now wrap the whole block in one big session, so all we need to do is in S1 schedule two systems that are also being inserted into system_recipe_map as part of S2. The timing could easily be such that we end up with a deadlock.

The reasons for the above traceback (in update_dirty_jobs) are for the exact same reason, updating multiple system rows in one transaction.

Comment 8 Nick Coghlan 2013-06-13 07:23:03 UTC
Reproducer posted as http://gerrit.beaker-project.org/#/c/2023/

Comment 9 Nick Coghlan 2013-06-14 00:35:10 UTC
This appears to be a consequence of the way we currently do scheduling, which requires a cached mapping of recipes to possible systems to attain reasonable performance (the system_recipe_map table).

Eliminating the need for that cache is one of the goals of the event driven scheduling [1] redesign. We can live with this misbehaviour until that is in place.

[1] http://beaker-project.org/dev/proposals/event-driven-scheduler.html

Comment 10 Martin Styk 2020-04-21 08:47:41 UTC
Closing this issue.
We are not planning to address this problem in the Beaker development lifecycle.

Instead of that, we are planning to continue our effort in building Beaker.NEXT.
If you have any questions, feel free to reach out to me.

Best regards,
Martin Styk