Description of problem: Go through beakerd.log on beaker-devel env. Find one deadlock error. Please analyse it. We may get more information from production environment. Feb 12 08:49:13 beaker-devel beakerd[17795]: bkr.server.tools.beakerd ERROR Error in update_dirty_job(3293) Feb 12 08:49:13 beaker-devel beakerd[17795]: Traceback (most recent call last): Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 97, in update_dirty_jobs Feb 12 08:49:13 beaker-devel beakerd[17795]: update_dirty_job(job_id) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 112, in update_dirty_job Feb 12 08:49:13 beaker-devel beakerd[17795]: job.update_status() Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5053, in update_status Feb 12 08:49:13 beaker-devel beakerd[17795]: self._update_status() Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5077, in _update_status Feb 12 08:49:13 beaker-devel beakerd[17795]: recipeset._update_status() Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5482, in _update_status Feb 12 08:49:13 beaker-devel beakerd[17795]: recipe._update_status() Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 5988, in _update_status Feb 12 08:49:13 beaker-devel beakerd[17795]: status_changed = self._change_status(min_status) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib/python2.6/site-packages/bkr/server/model.py", line 4406, in _change_status Feb 12 08:49:13 beaker-devel beakerd[17795]: .update({'status': new_status}, synchronize_session=False) \ Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2313, in update Feb 12 08:49:13 beaker-devel beakerd[17795]: result = session.execute(update_stmt, params=self._params) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 724, in execute Feb 12 08:49:13 beaker-devel beakerd[17795]: clause, params or {}) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1191, in execute Feb 12 08:49:13 beaker-devel beakerd[17795]: params) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement Feb 12 08:49:13 beaker-devel beakerd[17795]: return self.__execute_context(context) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context Feb 12 08:49:13 beaker-devel beakerd[17795]: context.parameters[0], context=context) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute Feb 12 08:49:13 beaker-devel beakerd[17795]: context) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute Feb 12 08:49:13 beaker-devel beakerd[17795]: context) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute Feb 12 08:49:13 beaker-devel beakerd[17795]: cursor.execute(statement, parameters) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/MySQLdb/cursors.py", line 173, in execute Feb 12 08:49:13 beaker-devel beakerd[17795]: self.errorhandler(self, exc, value) Feb 12 08:49:13 beaker-devel beakerd[17795]: File "/usr/lib64/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler Feb 12 08:49:13 beaker-devel beakerd[17795]: raise errorclass, errorvalue Feb 12 08:49:13 beaker-devel beakerd[17795]: OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE recipe SET status=%s WHERE recipe.status = %s AND recipe.id = %s' ('Running', 'Waiting', 5375L) Version-Release number of selected component (if applicable): 0.15.3 How reproducible: Hardly Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
InnoDB deadlock info follows: ------------------------ LATEST DETECTED DEADLOCK ------------------------ 140212 3:49:13 *** (1) TRANSACTION: TRANSACTION 0 200444509, ACTIVE 0 sec, process no 1534, OS thread id 140639819372288 starting index read mysql tables in use 1, locked 1 LOCK WAIT 2 lock struct(s), heap size 368, 1 row lock(s) MySQL thread id 22230, query id 14316996 beaker-devel.app.eng.bos.redhat.com 10.16.101.20 beaker_devel Updating UPDATE recipe SET status='Running' WHERE recipe.status = 'Waiting' AND recipe.id = 5375 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 117 page no 253 n bits 72 index `PRIMARY` of table `beaker_devel`.`recipe` trx id 0 200444509 lock_mode X locks rec but not gap waiting Record lock, heap no 3 PHYSICAL RECORD: n_fields 29; compact format; info bits 0 0: len 4; hex 800014ff; asc ;; 1: len 6; hex 00000bf28a24; asc $;; 2: len 7; hex 00000000390f61; asc 9 a;; 3: len 4; hex 8000120c; asc ;; 4: len 4; hex 80001dcd; asc ;; 5: len 4; hex 80000e60; asc `;; 6: len 1; hex 01; asc ;; 7: len 1; hex 05; asc ;; 8: SQL NULL; 9: SQL NULL; 10: len 30; hex 3c686f737452657175697265733e3c73797374656d5f747970652076616c; asc <hostRequires><system_type val;...(truncated); 11: len 30; hex 3c64697374726f52657175697265733e3c64697374726f5f66616d696c79; asc <distroRequires><distro_family;...(truncated); 12: SQL NULL; 13: len 14; hex 6d616368696e655f726563697065; asc machine_recipe;; 14: len 4; hex 80000003; asc ;; 15: len 4; hex 80000000; asc ;; 16: len 4; hex 80000000; asc ;; 17: len 4; hex 80000000; asc ;; 18: len 4; hex 80000000; asc ;; 19: len 10; hex 4e6f7365205465737473; asc Nose Tests;; 20: len 16; hex 73656c696e75783d64697361626c6564; asc selinux=disabled;; 21: SQL NULL; 22: SQL NULL; 23: len 4; hex 4e6f6e65; asc None;; 24: SQL NULL; 25: len 25; hex 3c706172746974696f6e733e3c2f706172746974696f6e733e; asc <partitions></partitions>;; 26: len 1; hex 80; asc ;; 27: SQL NULL; 28: len 1; hex 02; asc ;; *** (2) TRANSACTION: TRANSACTION 0 200444506, ACTIVE 0 sec, process no 1534, OS thread id 140639512151808 starting index read, thread declared inside InnoDB 500 mysql tables in use 1, locked 1 9 lock struct(s), heap size 1216, 7 row lock(s), undo log entries 2 MySQL thread id 22234, query id 14317000 beaker-devel.app.eng.bos.redhat.com 10.16.101.20 beaker_devel Updating UPDATE recipe SET log_server='lab-devel-02.rhts.eng.bos.redhat.com' WHERE recipe.id = 5375 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 117 page no 253 n bits 72 index `PRIMARY` of table `beaker_devel`.`recipe` trx id 0 200444506 lock mode S locks rec but not gap Record lock, heap no 3 PHYSICAL RECORD: n_fields 29; compact format; info bits 0 0: len 4; hex 800014ff; asc ;; 1: len 6; hex 00000bf28a24; asc $;; 2: len 7; hex 00000000390f61; asc 9 a;; 3: len 4; hex 8000120c; asc ;; 4: len 4; hex 80001dcd; asc ;; 5: len 4; hex 80000e60; asc `;; 6: len 1; hex 01; asc ;; 7: len 1; hex 05; asc ;; 8: SQL NULL; 9: SQL NULL; 10: len 30; hex 3c686f737452657175697265733e3c73797374656d5f747970652076616c; asc <hostRequires><system_type val;...(truncated); 11: len 30; hex 3c64697374726f52657175697265733e3c64697374726f5f66616d696c79; asc <distroRequires><distro_family;...(truncated); 12: SQL NULL; 13: len 14; hex 6d616368696e655f726563697065; asc machine_recipe;; 14: len 4; hex 80000003; asc ;; 15: len 4; hex 80000000; asc ;; 16: len 4; hex 80000000; asc ;; 17: len 4; hex 80000000; asc ;; 18: len 4; hex 80000000; asc ;; 19: len 10; hex 4e6f7365205465737473; asc Nose Tests;; 20: len 16; hex 73656c696e75783d64697361626c6564; asc selinux=disabled;; 21: SQL NULL; 22: SQL NULL; 23: len 4; hex 4e6f6e65; asc None;; 24: SQL NULL; 25: len 25; hex 3c706172746974696f6e733e3c2f706172746974696f6e733e; asc <partitions></partitions>;; 26: len 1; hex 80; asc ;; 27: SQL NULL; 28: len 1; hex 02; asc ;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 117 page no 253 n bits 72 index `PRIMARY` of table `beaker_devel`.`recipe` trx id 0 200444506 lock_mode X locks rec but not gap waiting Record lock, heap no 3 PHYSICAL RECORD: n_fields 29; compact format; info bits 0 0: len 4; hex 800014ff; asc ;; 1: len 6; hex 00000bf28a24; asc $;; 2: len 7; hex 00000000390f61; asc 9 a;; 3: len 4; hex 8000120c; asc ;; 4: len 4; hex 80001dcd; asc ;; 5: len 4; hex 80000e60; asc `;; 6: len 1; hex 01; asc ;; 7: len 1; hex 05; asc ;; 8: SQL NULL; 9: SQL NULL; 10: len 30; hex 3c686f737452657175697265733e3c73797374656d5f747970652076616c; asc <hostRequires><system_type val;...(truncated); 11: len 30; hex 3c64697374726f52657175697265733e3c64697374726f5f66616d696c79; asc <distroRequires><distro_family;...(truncated); 12: SQL NULL; 13: len 14; hex 6d616368696e655f726563697065; asc machine_recipe;; 14: len 4; hex 80000003; asc ;; 15: len 4; hex 80000000; asc ;; 16: len 4; hex 80000000; asc ;; 17: len 4; hex 80000000; asc ;; 18: len 4; hex 80000000; asc ;; 19: len 10; hex 4e6f7365205465737473; asc Nose Tests;; 20: len 16; hex 73656c696e75783d64697361626c6564; asc selinux=disabled;; 21: SQL NULL; 22: SQL NULL; 23: len 4; hex 4e6f6e65; asc None;; 24: SQL NULL; 25: len 25; hex 3c706172746974696f6e733e3c2f706172746974696f6e733e; asc <partitions></partitions>;; 26: len 1; hex 80; asc ;; 27: SQL NULL; 28: len 1; hex 02; asc ;; *** WE ROLL BACK TRANSACTION (1)
Dan has analysed this, and as per the updated BZ description, there's currently a genuine race condition between beaker-watchdog setting the log_server and the job status being updated. If it is the job status update that gets aborted, this is harmless: the status will just be updated on the next iteration of the scheduler loop. If it is the log_server update that is aborted, then the logs for that job will remain on the lab controller instead of eventually being moved to the archive server by beaker-transfer. We should still try to eliminate the deadlock, but the practical impact of the fault is currently low.
OK,due to #comment 3, lower the priority and severity.
Looking again at comment 2 -- the deadlock is between update_dirty_job() and one of the XMLRPC methods to register a log file (recipes.register_file, recipetasks.register_file, recipetasks.register_result_file). The register_file XMLRPC method already holds a shared lock on the recipe row, because it SELECT'ed the recipe row before UPDATE'ing it. Then update_dirty_job() wants an exclusive lock on the recipe row, and at the same time the register_file method wants to upgrade its lock to exclusive to complete its UPDATE. The register_file XMLRPC methods should either avoid SELECT'ing the row out first, or else SELECT...FOR UPDATE it. It should really be the latter, because they are doing checks based on the state of the recipe anyway. https://gerrit.beaker-project.org/6131
Alternatively, we could just consider this WONTFIX. The patch is fairly low risk, but then changes like this have a tendency to produce unanticipated side effects because it is exceedingly difficult to reason about the interaction between all the different concurrent transactions in flight in a busy Beaker instance. Also it's worth noting that this is *cosmetic only*. In comment 3 Nick mentioned that beaker-watchdog was on the other side of the race (and it is) but then concluded that beaker-transfer will fix it later (beaker-transfer is unrelated). In fact if the register_files XMLRPC call is on the losing side of the deadlock detector it means we have lost a log file and it will not be recovered. *But* in this deadlock that will never happen. The register_files XMLRPC call already holds a shared lock that update_dirty_job() wants, so the deadlock will always be decided in *favour* of register_files, update_dirty_job() will always be the loser. And that is harmless because it will just be retried on the next loop iteration. So the only effect of this bug is a scary looking exception in beakerd.log.
Yeah I think it'll help for clarity reasons to get away with the scary looking exception. This will help us and admins looking in our looks to not chase up things which are not a real problem. So in all I think it'll be good to merge and ship it.
Released with Beaker 25.5: https://beaker-project.org/docs/whats-new/release-25.html#beaker-25-5