Bug 1065202 - database deadlock between updating recipe.status and recipe.log_server
Summary: database deadlock between updating recipe.status and recipe.log_server
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: scheduler
Version: 0.15
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: 25.5
Assignee: Dan Callaghan
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-14 05:36 UTC by xjia
Modified: 2018-07-23 06:40 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-23 06:40:53 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Beaker Project Gerrit 6131 0 None None None 2018-06-07 07:46:42 UTC

Description xjia 2014-02-14 05:36:48 UTC
Description of problem:
Go through beakerd.log on beaker-devel env. Find one deadlock error. Please analyse it. We may get more information from production environment.

Feb 12 08:49:13 beaker-devel beakerd[17795]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(3293)
Feb 12 08:49:13 beaker-devel beakerd[17795]:  Traceback (most recent call last):
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 97, in update_dirty_jobs
Feb 12 08:49:13 beaker-devel beakerd[17795]:      update_dirty_job(job_id)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 112, in update_dirty_job
Feb 12 08:49:13 beaker-devel beakerd[17795]:      job.update_status()
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5053, in update_status
Feb 12 08:49:13 beaker-devel beakerd[17795]:      self._update_status()
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5077, in _update_status
Feb 12 08:49:13 beaker-devel beakerd[17795]:      recipeset._update_status()
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5482, in _update_status
Feb 12 08:49:13 beaker-devel beakerd[17795]:      recipe._update_status()
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5988, in _update_status
Feb 12 08:49:13 beaker-devel beakerd[17795]:      status_changed = self._change_status(min_status)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4406, in _change_status
Feb 12 08:49:13 beaker-devel beakerd[17795]:      .update({'status': new_status}, synchronize_session=False) \
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2313, in update
Feb 12 08:49:13 beaker-devel beakerd[17795]:      result = session.execute(update_stmt, params=self._params)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 724, in execute
Feb 12 08:49:13 beaker-devel beakerd[17795]:      clause, params or {})
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1191, in execute
Feb 12 08:49:13 beaker-devel beakerd[17795]:      params)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
Feb 12 08:49:13 beaker-devel beakerd[17795]:      return self.__execute_context(context)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
Feb 12 08:49:13 beaker-devel beakerd[17795]:      context.parameters[0], context=context)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
Feb 12 08:49:13 beaker-devel beakerd[17795]:      context)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute
Feb 12 08:49:13 beaker-devel beakerd[17795]:      context)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute
Feb 12 08:49:13 beaker-devel beakerd[17795]:      cursor.execute(statement, parameters)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 173, in execute
Feb 12 08:49:13 beaker-devel beakerd[17795]:      self.errorhandler(self, exc, value)
Feb 12 08:49:13 beaker-devel beakerd[17795]:    File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
Feb 12 08:49:13 beaker-devel beakerd[17795]:      raise errorclass, errorvalue
Feb 12 08:49:13 beaker-devel beakerd[17795]:  OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE recipe SET status=%s WHERE recipe.status = %s AND recipe.id = %s' ('Running', 'Waiting', 5375L)


Version-Release number of selected component (if applicable):
0.15.3

How reproducible:
Hardly

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Dan Callaghan 2014-02-14 06:20:11 UTC
InnoDB deadlock info follows:

------------------------
LATEST DETECTED DEADLOCK
------------------------
140212  3:49:13
*** (1) TRANSACTION:
TRANSACTION 0 200444509, ACTIVE 0 sec, process no 1534, OS thread id 140639819372288 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 368, 1 row lock(s)
MySQL thread id 22230, query id 14316996 beaker-devel.app.eng.bos.redhat.com 10.16.101.20 beaker_devel Updating
UPDATE recipe SET status='Running' WHERE recipe.status = 'Waiting' AND recipe.id = 5375
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 117 page no 253 n bits 72 index `PRIMARY` of table `beaker_devel`.`recipe` trx id 0 200444509 lock_mode X locks rec but not gap waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 29; compact format; info bits 0
 0: len 4; hex 800014ff; asc     ;; 1: len 6; hex 00000bf28a24; asc      $;; 2: len 7; hex 00000000390f61; asc     9 a;; 3: len 4; hex 8000120c; asc     ;; 4: len 4; hex 80001dcd; asc     ;; 5: len 4; hex 80000e60; asc    `;; 6: len 1; hex 01; asc  ;; 7: len 1; hex 05; asc  ;; 8: SQL NULL; 9: SQL NULL; 10: len 30; hex 3c686f737452657175697265733e3c73797374656d5f747970652076616c; asc <hostRequires><system_type val;...(truncated); 11: len 30; hex 3c64697374726f52657175697265733e3c64697374726f5f66616d696c79; asc <distroRequires><distro_family;...(truncated); 12: SQL NULL; 13: len 14; hex 6d616368696e655f726563697065; asc machine_recipe;; 14: len 4; hex 80000003; asc     ;; 15: len 4; hex 80000000; asc     ;; 16: len 4; hex 80000000; asc     ;; 17: len 4; hex 80000000; asc     ;; 18: len 4; hex 80000000; asc     ;; 19: len 10; hex 4e6f7365205465737473; asc Nose Tests;; 20: len 16; hex 73656c696e75783d64697361626c6564; asc selinux=disabled;; 21: SQL NULL; 22: SQL NULL; 23: len 4; hex 4e6f6e65; asc None;; 24: SQL NULL; 25: len 25; hex 3c706172746974696f6e733e3c2f706172746974696f6e733e; asc <partitions></partitions>;; 26: len 1; hex 80; asc  ;; 27: SQL NULL; 28: len 1; hex 02; asc  ;;

*** (2) TRANSACTION:
TRANSACTION 0 200444506, ACTIVE 0 sec, process no 1534, OS thread id 140639512151808 starting index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
9 lock struct(s), heap size 1216, 7 row lock(s), undo log entries 2
MySQL thread id 22234, query id 14317000 beaker-devel.app.eng.bos.redhat.com 10.16.101.20 beaker_devel Updating
UPDATE recipe SET log_server='lab-devel-02.rhts.eng.bos.redhat.com' WHERE recipe.id = 5375
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 117 page no 253 n bits 72 index `PRIMARY` of table `beaker_devel`.`recipe` trx id 0 200444506 lock mode S locks rec but not gap
Record lock, heap no 3 PHYSICAL RECORD: n_fields 29; compact format; info bits 0
 0: len 4; hex 800014ff; asc     ;; 1: len 6; hex 00000bf28a24; asc      $;; 2: len 7; hex 00000000390f61; asc     9 a;; 3: len 4; hex 8000120c; asc     ;; 4: len 4; hex 80001dcd; asc     ;; 5: len 4; hex 80000e60; asc    `;; 6: len 1; hex 01; asc  ;; 7: len 1; hex 05; asc  ;; 8: SQL NULL; 9: SQL NULL; 10: len 30; hex 3c686f737452657175697265733e3c73797374656d5f747970652076616c; asc <hostRequires><system_type val;...(truncated); 11: len 30; hex 3c64697374726f52657175697265733e3c64697374726f5f66616d696c79; asc <distroRequires><distro_family;...(truncated); 12: SQL NULL; 13: len 14; hex 6d616368696e655f726563697065; asc machine_recipe;; 14: len 4; hex 80000003; asc     ;; 15: len 4; hex 80000000; asc     ;; 16: len 4; hex 80000000; asc     ;; 17: len 4; hex 80000000; asc     ;; 18: len 4; hex 80000000; asc     ;; 19: len 10; hex 4e6f7365205465737473; asc Nose Tests;; 20: len 16; hex 73656c696e75783d64697361626c6564; asc selinux=disabled;; 21: SQL NULL; 22: SQL NULL; 23: len 4; hex 4e6f6e65; asc None;; 24: SQL NULL; 25: len 25; hex 3c706172746974696f6e733e3c2f706172746974696f6e733e; asc <partitions></partitions>;; 26: len 1; hex 80; asc  ;; 27: SQL NULL; 28: len 1; hex 02; asc  ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 117 page no 253 n bits 72 index `PRIMARY` of table `beaker_devel`.`recipe` trx id 0 200444506 lock_mode X locks rec but not gap waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 29; compact format; info bits 0
 0: len 4; hex 800014ff; asc     ;; 1: len 6; hex 00000bf28a24; asc      $;; 2: len 7; hex 00000000390f61; asc     9 a;; 3: len 4; hex 8000120c; asc     ;; 4: len 4; hex 80001dcd; asc     ;; 5: len 4; hex 80000e60; asc    `;; 6: len 1; hex 01; asc  ;; 7: len 1; hex 05; asc  ;; 8: SQL NULL; 9: SQL NULL; 10: len 30; hex 3c686f737452657175697265733e3c73797374656d5f747970652076616c; asc <hostRequires><system_type val;...(truncated); 11: len 30; hex 3c64697374726f52657175697265733e3c64697374726f5f66616d696c79; asc <distroRequires><distro_family;...(truncated); 12: SQL NULL; 13: len 14; hex 6d616368696e655f726563697065; asc machine_recipe;; 14: len 4; hex 80000003; asc     ;; 15: len 4; hex 80000000; asc     ;; 16: len 4; hex 80000000; asc     ;; 17: len 4; hex 80000000; asc     ;; 18: len 4; hex 80000000; asc     ;; 19: len 10; hex 4e6f7365205465737473; asc Nose Tests;; 20: len 16; hex 73656c696e75783d64697361626c6564; asc selinux=disabled;; 21: SQL NULL; 22: SQL NULL; 23: len 4; hex 4e6f6e65; asc None;; 24: SQL NULL; 25: len 25; hex 3c706172746974696f6e733e3c2f706172746974696f6e733e; asc <partitions></partitions>;; 26: len 1; hex 80; asc  ;; 27: SQL NULL; 28: len 1; hex 02; asc  ;;

*** WE ROLL BACK TRANSACTION (1)

Comment 3 Nick Coghlan 2014-02-14 07:12:07 UTC
Dan has analysed this, and as per the updated BZ description, there's currently a genuine race condition between beaker-watchdog setting the log_server and the job status being updated.

If it is the job status update that gets aborted, this is harmless: the status will just be updated on the next iteration of the scheduler loop.

If it is the log_server update that is aborted, then the logs for that job will remain on the lab controller instead of eventually being moved to the archive server by beaker-transfer.

We should still try to eliminate the deadlock, but the practical impact of the fault is currently low.

Comment 5 xjia 2014-02-14 07:19:38 UTC
OK,due to #comment 3, lower the priority and severity.

Comment 7 Dan Callaghan 2018-06-07 07:46:43 UTC
Looking again at comment 2 -- the deadlock is between update_dirty_job() and one of the XMLRPC methods to register a log file (recipes.register_file, recipetasks.register_file, recipetasks.register_result_file).

The register_file XMLRPC method already holds a shared lock on the recipe row, because it SELECT'ed the recipe row before UPDATE'ing it. Then update_dirty_job() wants an exclusive lock on the recipe row, and at the same time the register_file method wants to upgrade its lock to exclusive to complete its UPDATE.

The register_file XMLRPC methods should either avoid SELECT'ing the row out first, or else SELECT...FOR UPDATE it. It should really be the latter, because they are doing checks based on the state of the recipe anyway.

https://gerrit.beaker-project.org/6131

Comment 8 Dan Callaghan 2018-06-07 23:33:16 UTC
Alternatively, we could just consider this WONTFIX.

The patch is fairly low risk, but then changes like this have a tendency to produce unanticipated side effects because it is exceedingly difficult to reason about the interaction between all the different concurrent transactions in flight in a busy Beaker instance.

Also it's worth noting that this is *cosmetic only*.

In comment 3 Nick mentioned that beaker-watchdog was on the other side of the race (and it is) but then concluded that beaker-transfer will fix it later (beaker-transfer is unrelated). In fact if the register_files XMLRPC call is on the losing side of the deadlock detector it means we have lost a log file and it will not be recovered. *But* in this deadlock that will never happen. The register_files XMLRPC call already holds a shared lock that update_dirty_job() wants, so the deadlock will always be decided in *favour* of register_files, update_dirty_job() will always be the loser. And that is harmless because it will just be retried on the next loop iteration.

So the only effect of this bug is a scary looking exception in beakerd.log.

Comment 9 Roman Joost 2018-06-11 23:45:38 UTC
Yeah I think it'll help for clarity reasons to get away with the scary looking exception. This will help us and admins looking in our looks to not chase up things which are not a real problem. So in all I think it'll be good to merge and ship it.

Comment 12 Roman Joost 2018-07-23 06:40:53 UTC
Released with Beaker 25.5: https://beaker-project.org/docs/whats-new/release-25.html#beaker-25-5


Note You need to log in before you can comment on or make changes to this bug.