1362439 – DELETE ... IN (SELECT...) queries from beaker-log-delete are too slow, cause lock wait timeouts in beakerd

Bug 1362439 - DELETE ... IN (SELECT...) queries from beaker-log-delete are too slow, cause lock wait timeouts in beakerd

Summary: DELETE ... IN (SELECT...) queries from beaker-log-delete are too slow, cause ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	general
Sub Component:
Version:	23
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	23.1
Assignee:	Dan Callaghan
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-02 08:06 UTC by Dan Callaghan
Modified:	2016-08-11 01:25 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-11 01:25:02 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2016-08-02 08:06:16 UTC

Version-Release number of selected component (if applicable):
23.0

How reproducible:
not much chance

Steps to Reproduce:
1. On a large, heavily loaded production Beaker instance run: beaker-log-delete
2. Watch beakerd logs

Actual results:
bkr.server.tools.beakerd DEBUG Updating dirty job 1426531
bkr.server.tools.beakerd ERROR Error in update_dirty_job(1426531)
 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 92, in update_dirty_jobs
     session.commit()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/scoping.py", line 149, in do
     return getattr(self.registry(), name)(*args, **kwargs)
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 721, in commit
     self.transaction.commit()
[...]
 OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE recipe_task SET result=%s WHERE recipe_task.id = %s' ('Pass', 43855295L)

beakerd doesn't die from this one (it recovers and continues iterating) but the timeouts prevent it from actually making proper progress.

We also observed that INSERT INTO recipe_task_result queries from the recipes.tasks.result XMLRPC method were also being held up and in some cases hitting lock wait timeouts:

bkr.server.xmlrpccontroller ERROR Error handling XML-RPC method
 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/bkr/server/xmlrpccontroller.py", line 53, in RPC2
     response = self.process_rpc(method,params)
   File "/usr/lib/python2.6/site-packages/bkr/server/xmlrpccontroller.py", line 42, in process_rpc
     response = obj(*params)
   File "<string>", line 2, in result
   File "/usr/lib/python2.6/site-packages/bkr/server/identity.py", line 288, in require
     return func(*args, **kwargs)
   File "/usr/lib/python2.6/site-packages/bkr/server/recipetasks.py", line 236, in result
     return getattr(task,result_type)(**kwargs)
   File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 3483, in pass_
     return self._result(TaskResult.pass_, path, score, summary)
   File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 3519, in _result
     session.flush()
[...]
 OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'INSERT INTO recipe_task_result (recipe_task_id, path, result, score, log, start_time) VALUES (%s, %s, %s, %s, %s, %s)' (43856003L, 'MultiDir', 'Pass', None, None, datetime.datetime(2016, 8, 2, 3, 35, 58, 223259))
bkr.server.xmlrpccontroller DEBUG Time: 0:00:51.601433 recipes.tasks.result ('43856003', 'pass_', 'MultiDir', None, None)

Expected results:
Nothing hits lock wait timeouts.

Additional info:
In SHOW PROCESSLIST; we saw that there was a DELETE query which was running for more than 700 seconds. I should have saved a copy of the query but I didn't, which was foolish. I'm pretty sure it was: DELETE FROM recipe_task_result WHERE id IN (SELECT...)

The query was issued by beaker-log-delete, which had been triggered by a cron job. The DELETE statements were eventually succeeding but because they were so slow they were preventing other transactions from acquiring locks on recipe_task_result. Killing beaker-log-delete allowed everything to proceed as normal.

Comment 1 Dan Callaghan 2016-08-02 08:07:12 UTC

We need to identify exactly which query from beaker-log-delete was running so slowly, and find a way to make it commit in a reasonable amount of time (50 seconds is the lock wait timeout but it really needs to be less than that or we will still face lock contention under production load).

Comment 2 Dan Callaghan 2016-08-03 07:28:08 UTC

So the problematic part is not hard to find, it's the new DELETE recipe_task_result statement which we added in 23.0 for bug 1322700:

https://git.beaker-project.org/cgit/beaker/diff/Server/bkr/server/model/scheduler.py?id=750432e5b9759e313ab717578f5bf39cb7f18be3

The resulting SQL looks like:

2016-08-03 17:21:56,382 sqlalchemy.engine.base.Engine INFO DELETE FROM recipe_task_result WHERE recipe_task_result.recipe_task_id IN (SELECT recipe_task.id 
FROM recipe_task 
WHERE recipe_task.recipe_id = %s)
2016-08-03 17:21:56,382 sqlalchemy.engine.base.Engine INFO (255141L,)

and reproducing the slowness is trivial against a production db dump. I made one small change:

diff --git a/Server/bkr/server/tools/log_delete.py b/Server/bkr/server/tools/log_delete.py
index 036a313..e671980 100755
--- a/Server/bkr/server/tools/log_delete.py
+++ b/Server/bkr/server/tools/log_delete.py
@@ -165,8 +165,8 @@ def log_delete(print_logs=False, dry=False, limit=None):
                                 pass
                 if print_logs:
                     print path
+            job.delete()
             if not dry:
-                job.delete()
                 session.commit()
                 session.close()
             else:

and then turned on sqlalchemy.engine logging, and ran: beaker-log-delete -v --debug --dry-run

That DELETE statement has been sitting in MySQL process list in "preparing" state for 341 seconds so far and still counting...

Comment 3 Dan Callaghan 2016-08-03 07:35:22 UTC

Converting this DELETE statement to its equivalent SELECT:

SELECT id FROM recipe_task_result WHERE recipe_task_result.recipe_task_id IN (SELECT recipe_task.id  FROM recipe_task  WHERE recipe_task.recipe_id = 255141);

shows the same performance problem. The issue seems to be the IN (SELECT...) which MySQL fails to optimize properly. The EXPLAIN output shows it is still scanning the entire recipe_task_result table:

+----+--------------------+--------------------+-----------------+-------------------+----------------+---------+------+-----------+--------------------------+
| id | select_type        | table              | type            | possible_keys     | key            | key_len | ref  | rows      | Extra                    |
+----+--------------------+--------------------+-----------------+-------------------+----------------+---------+------+-----------+--------------------------+
|  1 | PRIMARY            | recipe_task_result | index           | NULL              | recipe_task_id | 5       | NULL | 206681955 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | recipe_task        | unique_subquery | PRIMARY,recipe_id | PRIMARY        | 4       | func |         1 | Using where              |
+----+--------------------+--------------------+-----------------+-------------------+----------------+---------+------+-----------+--------------------------+

The obvious equivalent using a JOIN instead would be:

SELECT recipe_task_result.id FROM recipe_task_result INNER JOIN recipe_task ON recipe_task_result.recipe_task_id = recipe_task.id WHERE recipe_task.recipe_id = 255141;

which does indeed execute instantly, because it's optimized as you would expect:

+----+-------------+--------------------+------+-------------------+----------------+---------+--------------------------------+------+--------------------------+
| id | select_type | table              | type | possible_keys     | key            | key_len | ref                            | rows | Extra                    |
+----+-------------+--------------------+------+-------------------+----------------+---------+--------------------------------+------+--------------------------+
|  1 | SIMPLE      | recipe_task        | ref  | PRIMARY,recipe_id | recipe_id      | 4       | const                          |    3 | Using index              |
|  1 | SIMPLE      | recipe_task_result | ref  | recipe_task_id    | recipe_task_id | 5       | beaker_dcallagh.recipe_task.id |    2 | Using where; Using index |
+----+-------------+--------------------+------+-------------------+----------------+---------+--------------------------------+------+--------------------------+

Comment 4 Dan Callaghan 2016-08-03 07:42:11 UTC

I even remember that we discussed the potential issue with the IN (SELECT...) when implementing bug 1322700 originally -- the internet is full of warnings not to do this on MySQL because of its performance implications, and I'm sure I knew that as well. So I'm not sure why we went ahead with it anyway. I can't find any discussion of it on the original Gerrit review or bz.

Regardless, the solution seems to be to refactor the DELETE to use the equivalent DELETE FROM... JOIN:

DELETE recipe_task_result FROM recipe_task_result INNER JOIN recipe_task ON recipe_task_result.recipe_task_id = recipe_task.id WHERE recipe_task.recipe_id = 255141;

This completes instantly against a production db dump, compared with the IN (SELECT...) which runs for many minutes.

Comment 5 Dan Callaghan 2016-08-03 08:00:12 UTC

http://gerrit.beaker-project.org/5111

This patch produces the more optimal DELETE... JOIN query as described above. Tested against a production db dump with beaker-log-delete --dry-run, it performs admirably.

Comment 8 Dan Callaghan 2016-08-11 01:25:02 UTC

Beaker 23.1 has been released.

Note You need to log in before you can comment on or make changes to this bug.