2012-11-08 13:44:59,069 bkr.server.xmlrpccontroller DEBUG Time: 0:00:00.101242 labcontrollers.add_distro_tree ({'osmajor': 'RedHatEnterpriseLinux6', 'name': 'RH 2012-11-08 13:44:59,123 bkr.server.xmlrpccontroller DEBUG Time: 0:00:00.143379 labcontrollers.add_distro_tree ({'osmajor': 'RedHatEnterpriseLinux6', 'name': 'RH 2012-11-08 13:44:59,159 cherrypy.msg INFO HTTP: Page handler: <bound method Root.RPC2 of <bkr.server.controllers.Root object at 0x7fab72d5b610>> Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/CherryPy-2.3.0-py2.6.egg/cherrypy/_cphttptools.py", line 121, in _run self.main() File "/usr/lib/python2.6/site-packages/CherryPy-2.3.0-py2.6.egg/cherrypy/_cphttptools.py", line 264, in main body = page_handler(*virtual_path, **self.params) File "<string>", line 3, in RPC2 File "/usr/lib/python2.6/site-packages/turbogears/controllers.py", line 361, in expose *args, **kw) File "<generated code>", line 0, in run_with_transaction File "/usr/lib/python2.6/site-packages/peak/rules/core.py", line 153, in __call__ return self.body(*args, **kw) File "/usr/lib/python2.6/site-packages/turbogears/database.py", line 478, in sa_rwt session.commit() [...] OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE distro SET date_created=%s WHERE distro.id = %s' (datetime.datetime(2012, 11, 2, 6, 6, 44, 874126), 261L)
*** Bug 919350 has been marked as a duplicate of this bug. ***
Here is the relevant SHOW INNODB STATUS output. ------------------------ LATEST DETECTED DEADLOCK ------------------------ 130802 2:07:23 *** (1) TRANSACTION: TRANSACTION 0 170632547, ACTIVE 0 sec, process no 1729, OS thread id 140072535258880 inserting mysql tables in use 1, locked 1 LOCK WAIT 12 lock struct(s), heap size 3024, 7 row lock(s), undo log entries 1 MySQL thread id 20998, query id 12654512 beaker-devel.CENSORED.redhat.com 10.16.101.20 beaker_devel update INSERT INTO distro_tree (distro_id, arch_id, variant, ks_meta, kernel_options, kernel_options_post, date_created) VALUES (5594, 2, '', NULL, NULL, NULL, '2013-08-02 06:07:23') *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 16 page no 28 n bits 680 index `distro_id` of table `beaker_devel`.`distro_tree` trx id 0 170632547 lock mode S waiting Record lock, heap no 612 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 4; hex 800015da; asc ;; 1: len 4; hex 80000002; asc ;; 2: len 0; hex ; asc ;; 3: len 4; hex 80001ef5; asc ;; *** (2) TRANSACTION: TRANSACTION 0 170632546, ACTIVE 0 sec, process no 1729, OS thread id 140072534460160 starting index read, thread declared inside InnoDB 500 mysql tables in use 1, locked 1 14 lock struct(s), heap size 3024, 9 row lock(s), undo log entries 1 MySQL thread id 20973, query id 12654527 beaker-devel.CENSORED.redhat.com 10.16.101.20 beaker_devel Updating UPDATE distro SET date_created='2013-08-02 04:45:07' WHERE distro.id = 5594 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 16 page no 28 n bits 680 index `distro_id` of table `beaker_devel`.`distro_tree` trx id 0 170632546 lock_mode X locks rec but not gap Record lock, heap no 612 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 4; hex 800015da; asc ;; 1: len 4; hex 80000002; asc ;; 2: len 0; hex ; asc ;; 3: len 4; hex 80001ef5; asc ;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 12 page no 8 n bits 320 index `PRIMARY` of table `beaker_devel`.`distro` trx id 0 170632546 lock_mode X locks rec but not gap waiting Record lock, heap no 250 PHYSICAL RECORD: n_fields 6; compact format; info bits 0 0: len 4; hex 800015da; asc ;; 1: len 6; hex 00000a2ba55c; asc + \;; 2: len 7; hex 000000010027bf; asc ' ;; 3: len 28; hex 5248454c352e31302d436c69656e742d32303133303830322e302e6e; asc RHEL5.10-Client-20130802.0.n;; 4: len 4; hex 800000f4; asc ;; 5: len 8; hex 8000124f114e0bfe; asc O N ;;
There is a whole bunch of related races when importing distro trees concurrently, which I have uncovered while writing a reproducer for this one. The deadlock above is only one possible failure. The other failures appear as "NoResultFound: No row was found for one()" from Distro.lazy_create, OSVersion.lazy_create, or OSMajor.lazy_create. These ones happen because the lazily created row *didn't* exist at the start of the transaction, but was since created in another transaction, so the INSERT fails (as expected). But SELECTing the row back also fails due to transaction isolation. This is actually a condition which lazy_create in general fails to handle. The other failure I can reproduce is: OperationalError: (OperationalError) (1305, 'SAVEPOINT sa_savepoint_5 does not exist') 'ROLLBACK TO SAVEPOINT sa_savepoint_5' () but I don't understand it, since there is no indication of why MySQL rolled back the enclosing transaction.
The deadlock happens because we are updating distro.date_created after we call DistroTreee.lazy_create, so we are violating the rule of deadlock avoidance (always lock in a consistent order) because we can end up locking distro and distro_tree in a different order depending on whether the distro_tree row existed or not. I think we can fix all deadlocks in that method by reordering all our update operations to group them so that there is no intervening lazy_creates for unrelated rows, and also making sure dependent rows are not updated after their dependencies. That is: # OSMajor.lazy_create # update osmajor # OSVersion.lazy_create # update osversion # Distro.lazy_create # update distro # DistroTree.lazy_create # update distro_tree instead of the mish-mash of operations in that method at present. Reordering the operations like this is correct, and necessary, but it doesn't fully solve the problem :-( because it just converts the deadlocks into the NoResultFound failure due to tx isolation. So we will need a solution for that also.
(In reply to Dan Callaghan from comment #4) > The other failures appear as "NoResultFound: No row was found for one()" > from Distro.lazy_create, OSVersion.lazy_create, or OSMajor.lazy_create. > These ones happen because the lazily created row *didn't* exist at the start > of the transaction, but was since created in another transaction, so the > INSERT fails (as expected). But SELECTing the row back also fails due to > transaction isolation. This is actually a condition which lazy_create in > general fails to handle. I think the solution here is to do SELECT ... FOR UPDATE in the SELECT-back path of lazy_create, to force our snapshot view to be upgraded to the latest state (which includes the row which has been newly inserted by the other transaction). But I'm not sure whether SELECT ... FOR UPDATE is the best way to do this, since I can't find any docs about repeatable-read semantics for SELECT ... FOR UPDATE or how to upgrade the isolated view.
(In reply to Dan Callaghan from comment #4) > The other failures appear as "NoResultFound: No row was found for one()" > from Distro.lazy_create, OSVersion.lazy_create, or OSMajor.lazy_create. > These ones happen because the lazily created row *didn't* exist at the start > of the transaction, but was since created in another transaction, so the > INSERT fails (as expected). But SELECTing the row back also fails due to > transaction isolation. This is actually a condition which lazy_create in > general fails to handle. As a simple illustration, here is how to reproduce this using two terminals with mysql client connected to the same Beaker database: In terminal 1: start a transaction, issue an unrelated SELECT mysql> begin; Query OK, 0 rows affected (0.00 sec) mysql> select user_name from tg_user where user_id = 1; +-----------+ | user_name | +-----------+ | dcallagh | +-----------+ 1 row in set (0.00 sec) In terminal 2: start a transaction, issue an unrelated SELECT mysql> begin; Query OK, 0 rows affected (0.00 sec) mysql> select user_name from tg_user where user_id = 1; +-----------+ | user_name | +-----------+ | dcallagh | +-----------+ 1 row in set (0.00 sec) In terminal 1: insert a row mysql> insert into osmajor (osmajor) values ('Fedora21'); Query OK, 1 row affected (0.00 sec) In terminal 2: insert the same row (this will block) mysql> insert into osmajor (osmajor) values ('Fedora21'); In terminal 1: commit mysql> commit; Query OK, 0 rows affected (0.04 sec) In terminal 2: the insert fails, try SELECTing it back ERROR 1062 (23000): Duplicate entry 'Fedora21' for key 2 mysql> select * from osmajor where osmajor = 'Fedora21'; Empty set (0.00 sec) This is the "No row was found for one()" exception. But, SELECT FOR UPDATE apparently upgrades the snapshot view to include the new row: mysql> select * from osmajor where osmajor = 'Fedora21' for update; +----+----------+-------+ | id | osmajor | alias | +----+----------+-------+ | 22 | Fedora21 | NULL | +----+----------+-------+ 1 row in set (0.00 sec)
Created attachment 783211 [details] test cases to reproduce race conditions I almost got this working properly, by using SELECT ... LOCK IN SHARED MODE in the SELECT-back path of lazy_create. But we end up needing to use lazy_create *everywhere* in add_distro_tree, even for the one-to-many relationships like osversion.arches which ends up being a complete nightmare. So I think the most reasonable approach here is to just completely serialize all distro imports, by locking something (not sure what exactly, maybe a row in a dummy table) at the start of add_distro_tree. For reference I'm attaching a patch with the test cases I am using to reproduce these. Note that the time.sleep() calls in the patch *cannot* be commented out at this time because testutil.set_identity_user() is not thread-safe (sigh!). But my intention is that once we have our own identity implementation (coming in 0.15) we can use a thread-safe approach to testutil.set_identity_user() and avoid the sleeps, which slow down the tests.
On Gerrit: http://gerrit.beaker-project.org/2157 This approach uses conditional INSERT + UPDATE, so it avoids all nested transactions (and their associated overhead) and is also completely race-free I believe.
Increasing priority, this one is important for us because we hit it very often in our devel environment.
Steps to reproduce: not reproducible by hand, however the devel environment typically hits it several times during each nightly distro import. We can verify this fix by putting it on beaker-devel for a while and making sure we have no more 500s/deadlocks during distro import.
It's easy to reproduce, just start import scripts at the same time from multiple LCs. My success rate was nearly 100% :).
Not fixed: Aug 21 10:05:18 beaker-devel beaker-server[9045]: bkr.server ERROR Exception on /client/ [POST] Aug 21 10:05:18 beaker-devel beaker-server[9045]: Traceback (most recent call last): Aug 21 10:05:18 beaker-devel beaker-server[9045]: File "/usr/lib/python2.6/site-packages/flask/app.py", line 1817, in wsgi_app Aug 21 10:05:18 beaker-devel beaker-server[9045]: response = self.full_dispatch_request() Aug 21 10:05:18 beaker-devel beaker-server[9045]: File "/usr/lib/python2.6/site-packages/flask/app.py", line 1479, in full_dispatch_request Aug 21 10:05:18 beaker-devel beaker-server[9045]: response = self.process_response(response) Aug 21 10:05:18 beaker-devel beaker-server[9045]: File "/usr/lib/python2.6/site-packages/flask/app.py", line 1691, in process_response Aug 21 10:05:18 beaker-devel beaker-server[9045]: response = handler(response) Aug 21 10:05:18 beaker-devel beaker-server[9045]: File "/usr/lib/python2.6/site-packages/bkr/server/wsgi.py", line 90, in commit_or_rollback_session Aug 21 10:05:18 beaker-devel beaker-server[9045]: session.commit() ... Aug 21 10:05:18 beaker-devel beaker-server[9045]: IntegrityError: (IntegrityError) (1062, "Duplicate entry '8079-36' for key 'PRIMARY'") 'INSERT INTO distro_tag_map (distro_id, distro_tag_id) VALUES (%s, %s)' (8079L, 36L)
New patch is on Gerrit: http://gerrit.beaker-project.org/2180
Verify: Check last 24 hour log in beaker server, didn't find any error. So verify this bug.
Beaker 0.15 has been released.