Bug 874386 - database deadlock when distro tree is imported simultaneously in two labs
Summary: database deadlock when distro tree is imported simultaneously in two labs
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Retired
Component: scheduler
Version: 0.9
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: 0.15
Assignee: Dan Callaghan
QA Contact: tools-bugs
URL:
Whiteboard: DistroManagement
: 919350 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-08 03:48 UTC by Dan Callaghan
Modified: 2018-02-06 00:41 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2013-10-03 02:30:08 UTC
Embargoed:


Attachments (Terms of Use)
test cases to reproduce race conditions (7.36 KB, patch)
2013-08-06 07:03 UTC, Dan Callaghan
no flags Details | Diff

Description Dan Callaghan 2012-11-08 03:48:43 UTC
2012-11-08 13:44:59,069 bkr.server.xmlrpccontroller DEBUG Time: 0:00:00.101242 labcontrollers.add_distro_tree ({'osmajor': 'RedHatEnterpriseLinux6', 'name': 'RH
2012-11-08 13:44:59,123 bkr.server.xmlrpccontroller DEBUG Time: 0:00:00.143379 labcontrollers.add_distro_tree ({'osmajor': 'RedHatEnterpriseLinux6', 'name': 'RH
2012-11-08 13:44:59,159 cherrypy.msg INFO HTTP: Page handler: <bound method Root.RPC2 of <bkr.server.controllers.Root object at 0x7fab72d5b610>>
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/CherryPy-2.3.0-py2.6.egg/cherrypy/_cphttptools.py", line 121, in _run
    self.main()
  File "/usr/lib/python2.6/site-packages/CherryPy-2.3.0-py2.6.egg/cherrypy/_cphttptools.py", line 264, in main
    body = page_handler(*virtual_path, **self.params)
  File "<string>", line 3, in RPC2
  File "/usr/lib/python2.6/site-packages/turbogears/controllers.py", line 361, in expose
    *args, **kw)
  File "<generated code>", line 0, in run_with_transaction
  File "/usr/lib/python2.6/site-packages/peak/rules/core.py", line 153, in __call__
    return self.body(*args, **kw)
  File "/usr/lib/python2.6/site-packages/turbogears/database.py", line 478, in sa_rwt
    session.commit()
[...]
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') 'UPDATE distro SET date_created=%s WHERE distro.id = %s' (datetime.datetime(2012, 11, 2, 6, 6, 44, 874126), 261L)

Comment 2 Dan Callaghan 2013-08-02 06:26:37 UTC
*** Bug 919350 has been marked as a duplicate of this bug. ***

Comment 3 Dan Callaghan 2013-08-02 06:41:54 UTC
Here is the relevant SHOW INNODB STATUS output.

------------------------
LATEST DETECTED DEADLOCK
------------------------
130802  2:07:23
*** (1) TRANSACTION:
TRANSACTION 0 170632547, ACTIVE 0 sec, process no 1729, OS thread id 140072535258880 inserting
mysql tables in use 1, locked 1
LOCK WAIT 12 lock struct(s), heap size 3024, 7 row lock(s), undo log entries 1
MySQL thread id 20998, query id 12654512 beaker-devel.CENSORED.redhat.com 10.16.101.20 beaker_devel update
INSERT INTO distro_tree (distro_id, arch_id, variant, ks_meta, kernel_options, kernel_options_post, date_created) VALUES (5594, 2, '', NULL, NULL, NULL, '2013-08-02 06:07:23')
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 16 page no 28 n bits 680 index `distro_id` of table `beaker_devel`.`distro_tree` trx id 0 170632547 lock mode S waiting
Record lock, heap no 612 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 800015da; asc     ;; 1: len 4; hex 80000002; asc     ;; 2: len 0; hex ; asc ;; 3: len 4; hex 80001ef5; asc     ;;
 
*** (2) TRANSACTION:
TRANSACTION 0 170632546, ACTIVE 0 sec, process no 1729, OS thread id 140072534460160 starting index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
14 lock struct(s), heap size 3024, 9 row lock(s), undo log entries 1
MySQL thread id 20973, query id 12654527 beaker-devel.CENSORED.redhat.com 10.16.101.20 beaker_devel Updating
UPDATE distro SET date_created='2013-08-02 04:45:07' WHERE distro.id = 5594
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 16 page no 28 n bits 680 index `distro_id` of table `beaker_devel`.`distro_tree` trx id 0 170632546 lock_mode X locks rec but not gap
Record lock, heap no 612 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 800015da; asc     ;; 1: len 4; hex 80000002; asc     ;; 2: len 0; hex ; asc ;; 3: len 4; hex 80001ef5; asc     ;;
 
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 12 page no 8 n bits 320 index `PRIMARY` of table `beaker_devel`.`distro` trx id 0 170632546 lock_mode X locks rec but not gap waiting
Record lock, heap no 250 PHYSICAL RECORD: n_fields 6; compact format; info bits 0
 0: len 4; hex 800015da; asc     ;; 1: len 6; hex 00000a2ba55c; asc    + \;; 2: len 7; hex 000000010027bf; asc      ' ;; 3: len 28; hex 5248454c352e31302d436c69656e742d32303133303830322e302e6e; asc RHEL5.10-Client-20130802.0.n;; 4: len 4; hex 800000f4; asc     ;; 5: len 8; hex 8000124f114e0bfe; asc    O N  ;;

Comment 4 Dan Callaghan 2013-08-05 08:15:55 UTC
There is a whole bunch of related races when importing distro trees concurrently, which I have uncovered while writing a reproducer for this one. The deadlock above is only one possible failure.

The other failures appear as "NoResultFound: No row was found for one()" from Distro.lazy_create, OSVersion.lazy_create, or OSMajor.lazy_create. These ones happen because the lazily created row *didn't* exist at the start of the transaction, but was since created in another transaction, so the INSERT fails (as expected). But SELECTing the row back also fails due to transaction isolation. This is actually a condition which lazy_create in general fails to handle.

The other failure I can reproduce is:

OperationalError: (OperationalError) (1305, 'SAVEPOINT sa_savepoint_5 does not exist') 'ROLLBACK TO SAVEPOINT sa_savepoint_5' ()

but I don't understand it, since there is no indication of why MySQL rolled back the enclosing transaction.

Comment 5 Dan Callaghan 2013-08-05 08:33:53 UTC
The deadlock happens because we are updating distro.date_created after we call DistroTreee.lazy_create, so we are violating the rule of deadlock avoidance (always lock in a consistent order) because we can end up locking distro and distro_tree in a different order depending on whether the distro_tree row existed or not.

I think we can fix all deadlocks in that method by reordering all our update operations to group them so that there is no intervening lazy_creates for unrelated rows, and also making sure dependent rows are not updated after their dependencies. That is:

    # OSMajor.lazy_create
    # update osmajor
    # OSVersion.lazy_create
    # update osversion
    # Distro.lazy_create
    # update distro
    # DistroTree.lazy_create
    # update distro_tree

instead of the mish-mash of operations in that method at present.

Reordering the operations like this is correct, and necessary, but it doesn't fully solve the problem :-( because it just converts the deadlocks into the NoResultFound failure due to tx isolation. So we will need a solution for that also.

Comment 6 Dan Callaghan 2013-08-06 00:27:40 UTC
(In reply to Dan Callaghan from comment #4)
> The other failures appear as "NoResultFound: No row was found for one()"
> from Distro.lazy_create, OSVersion.lazy_create, or OSMajor.lazy_create.
> These ones happen because the lazily created row *didn't* exist at the start
> of the transaction, but was since created in another transaction, so the
> INSERT fails (as expected). But SELECTing the row back also fails due to
> transaction isolation. This is actually a condition which lazy_create in
> general fails to handle.

I think the solution here is to do SELECT ... FOR UPDATE in the SELECT-back path of lazy_create, to force our snapshot view to be upgraded to the latest state (which includes the row which has been newly inserted by the other transaction). But I'm not sure whether SELECT ... FOR UPDATE is the best way to do this, since I can't find any docs about repeatable-read semantics for SELECT ... FOR UPDATE or how to upgrade the isolated view.

Comment 7 Dan Callaghan 2013-08-06 03:11:44 UTC
(In reply to Dan Callaghan from comment #4)
> The other failures appear as "NoResultFound: No row was found for one()"
> from Distro.lazy_create, OSVersion.lazy_create, or OSMajor.lazy_create.
> These ones happen because the lazily created row *didn't* exist at the start
> of the transaction, but was since created in another transaction, so the
> INSERT fails (as expected). But SELECTing the row back also fails due to
> transaction isolation. This is actually a condition which lazy_create in
> general fails to handle.

As a simple illustration, here is how to reproduce this using two terminals with mysql client connected to the same Beaker database:

In terminal 1: start a transaction, issue an unrelated SELECT

    mysql> begin;
    Query OK, 0 rows affected (0.00 sec)

    mysql> select user_name from tg_user where user_id = 1;
    +-----------+
    | user_name |
    +-----------+
    | dcallagh  | 
    +-----------+
    1 row in set (0.00 sec)

In terminal 2: start a transaction, issue an unrelated SELECT

    mysql> begin;
    Query OK, 0 rows affected (0.00 sec)

    mysql> select user_name from tg_user where user_id = 1;
    +-----------+
    | user_name |
    +-----------+
    | dcallagh  | 
    +-----------+
    1 row in set (0.00 sec)

In terminal 1: insert a row

    mysql> insert into osmajor (osmajor) values ('Fedora21');
    Query OK, 1 row affected (0.00 sec)

In terminal 2: insert the same row (this will block)

    mysql> insert into osmajor (osmajor) values ('Fedora21');

In terminal 1: commit

    mysql> commit;
    Query OK, 0 rows affected (0.04 sec)

In terminal 2: the insert fails, try SELECTing it back

    ERROR 1062 (23000): Duplicate entry 'Fedora21' for key 2
    mysql> select * from osmajor where osmajor = 'Fedora21';
    Empty set (0.00 sec)

This is the "No row was found for one()" exception. But, SELECT FOR UPDATE apparently upgrades the snapshot view to include the new row:

    mysql> select * from osmajor where osmajor = 'Fedora21' for update;
    +----+----------+-------+
    | id | osmajor  | alias |
    +----+----------+-------+
    | 22 | Fedora21 | NULL  | 
    +----+----------+-------+
    1 row in set (0.00 sec)

Comment 8 Dan Callaghan 2013-08-06 07:03:08 UTC
Created attachment 783211 [details]
test cases to reproduce race conditions

I almost got this working properly, by using SELECT ... LOCK IN SHARED MODE in the SELECT-back path of lazy_create. But we end up needing to use lazy_create *everywhere* in add_distro_tree, even for the one-to-many relationships like osversion.arches which ends up being a complete nightmare. So I think the most reasonable approach here is to just completely serialize all distro imports, by locking something (not sure what exactly, maybe a row in a dummy table) at the start of add_distro_tree.

For reference I'm attaching a patch with the test cases I am using to reproduce these. Note that the time.sleep() calls in the patch *cannot* be commented out at this time because testutil.set_identity_user() is not thread-safe (sigh!). But my intention is that once we have our own identity implementation (coming in 0.15) we can use a thread-safe approach to testutil.set_identity_user() and avoid the sleeps, which slow down the tests.

Comment 9 Dan Callaghan 2013-08-12 04:39:56 UTC
On Gerrit: http://gerrit.beaker-project.org/2157

This approach uses conditional INSERT + UPDATE, so it avoids all nested transactions (and their associated overhead) and is also completely race-free I believe.

Comment 10 Dan Callaghan 2013-08-19 01:55:31 UTC
Increasing priority, this one is important for us because we hit it very often in our devel environment.

Comment 11 Dan Callaghan 2013-08-19 01:57:35 UTC
Steps to reproduce:
not reproducible by hand, however the devel environment typically hits it several times during each nightly distro import.

We can verify this fix by putting it on beaker-devel for a while and making sure we have no more 500s/deadlocks during distro import.

Comment 12 Jaroslav Kortus 2013-08-19 07:14:26 UTC
It's easy to reproduce, just start import scripts at the same time from multiple LCs. My success rate was nearly 100% :).

Comment 15 Dan Callaghan 2013-08-21 23:22:23 UTC
Not fixed:

Aug 21 10:05:18 beaker-devel beaker-server[9045]: bkr.server ERROR Exception on /client/ [POST]
Aug 21 10:05:18 beaker-devel beaker-server[9045]:  Traceback (most recent call last):
Aug 21 10:05:18 beaker-devel beaker-server[9045]:    File "/usr/lib/python2.6/site-packages/flask/app.py", line 1817, in wsgi_app
Aug 21 10:05:18 beaker-devel beaker-server[9045]:      response = self.full_dispatch_request()
Aug 21 10:05:18 beaker-devel beaker-server[9045]:    File "/usr/lib/python2.6/site-packages/flask/app.py", line 1479, in full_dispatch_request
Aug 21 10:05:18 beaker-devel beaker-server[9045]:      response = self.process_response(response)
Aug 21 10:05:18 beaker-devel beaker-server[9045]:    File "/usr/lib/python2.6/site-packages/flask/app.py", line 1691, in process_response
Aug 21 10:05:18 beaker-devel beaker-server[9045]:      response = handler(response)
Aug 21 10:05:18 beaker-devel beaker-server[9045]:    File "/usr/lib/python2.6/site-packages/bkr/server/wsgi.py", line 90, in commit_or_rollback_session
Aug 21 10:05:18 beaker-devel beaker-server[9045]:      session.commit()
...
Aug 21 10:05:18 beaker-devel beaker-server[9045]:  IntegrityError: (IntegrityError) (1062, "Duplicate entry '8079-36' for key 'PRIMARY'") 'INSERT INTO distro_tag_map (distro_id, distro_tag_id) VALUES (%s, %s)' (8079L, 36L)

Comment 16 Dan Callaghan 2013-08-23 01:37:41 UTC
New patch is on Gerrit: http://gerrit.beaker-project.org/2180

Comment 18 xjia 2013-08-30 02:19:17 UTC
Verify:
Check last 24 hour log in beaker server, didn't find any error.
So verify this bug.

Comment 19 Nick Coghlan 2013-10-03 02:30:08 UTC
Beaker 0.15 has been released.


Note You need to log in before you can comment on or make changes to this bug.