Bug 1362439
Summary: | DELETE ... IN (SELECT...) queries from beaker-log-delete are too slow, cause lock wait timeouts in beakerd | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Dan Callaghan <dcallagh> |
Component: | general | Assignee: | Dan Callaghan <dcallagh> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 23 | CC: | dcallagh, dowang, mjia, rjoost |
Target Milestone: | 23.1 | Keywords: | Patch |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-08-11 01:25:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Callaghan
2016-08-02 08:06:16 UTC
We need to identify exactly which query from beaker-log-delete was running so slowly, and find a way to make it commit in a reasonable amount of time (50 seconds is the lock wait timeout but it really needs to be less than that or we will still face lock contention under production load). So the problematic part is not hard to find, it's the new DELETE recipe_task_result statement which we added in 23.0 for bug 1322700: https://git.beaker-project.org/cgit/beaker/diff/Server/bkr/server/model/scheduler.py?id=750432e5b9759e313ab717578f5bf39cb7f18be3 The resulting SQL looks like: 2016-08-03 17:21:56,382 sqlalchemy.engine.base.Engine INFO DELETE FROM recipe_task_result WHERE recipe_task_result.recipe_task_id IN (SELECT recipe_task.id FROM recipe_task WHERE recipe_task.recipe_id = %s) 2016-08-03 17:21:56,382 sqlalchemy.engine.base.Engine INFO (255141L,) and reproducing the slowness is trivial against a production db dump. I made one small change: diff --git a/Server/bkr/server/tools/log_delete.py b/Server/bkr/server/tools/log_delete.py index 036a313..e671980 100755 --- a/Server/bkr/server/tools/log_delete.py +++ b/Server/bkr/server/tools/log_delete.py @@ -165,8 +165,8 @@ def log_delete(print_logs=False, dry=False, limit=None): pass if print_logs: print path + job.delete() if not dry: - job.delete() session.commit() session.close() else: and then turned on sqlalchemy.engine logging, and ran: beaker-log-delete -v --debug --dry-run That DELETE statement has been sitting in MySQL process list in "preparing" state for 341 seconds so far and still counting... Converting this DELETE statement to its equivalent SELECT: SELECT id FROM recipe_task_result WHERE recipe_task_result.recipe_task_id IN (SELECT recipe_task.id FROM recipe_task WHERE recipe_task.recipe_id = 255141); shows the same performance problem. The issue seems to be the IN (SELECT...) which MySQL fails to optimize properly. The EXPLAIN output shows it is still scanning the entire recipe_task_result table: +----+--------------------+--------------------+-----------------+-------------------+----------------+---------+------+-----------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+--------------------+-----------------+-------------------+----------------+---------+------+-----------+--------------------------+ | 1 | PRIMARY | recipe_task_result | index | NULL | recipe_task_id | 5 | NULL | 206681955 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | recipe_task | unique_subquery | PRIMARY,recipe_id | PRIMARY | 4 | func | 1 | Using where | +----+--------------------+--------------------+-----------------+-------------------+----------------+---------+------+-----------+--------------------------+ The obvious equivalent using a JOIN instead would be: SELECT recipe_task_result.id FROM recipe_task_result INNER JOIN recipe_task ON recipe_task_result.recipe_task_id = recipe_task.id WHERE recipe_task.recipe_id = 255141; which does indeed execute instantly, because it's optimized as you would expect: +----+-------------+--------------------+------+-------------------+----------------+---------+--------------------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------------+------+-------------------+----------------+---------+--------------------------------+------+--------------------------+ | 1 | SIMPLE | recipe_task | ref | PRIMARY,recipe_id | recipe_id | 4 | const | 3 | Using index | | 1 | SIMPLE | recipe_task_result | ref | recipe_task_id | recipe_task_id | 5 | beaker_dcallagh.recipe_task.id | 2 | Using where; Using index | +----+-------------+--------------------+------+-------------------+----------------+---------+--------------------------------+------+--------------------------+ I even remember that we discussed the potential issue with the IN (SELECT...) when implementing bug 1322700 originally -- the internet is full of warnings not to do this on MySQL because of its performance implications, and I'm sure I knew that as well. So I'm not sure why we went ahead with it anyway. I can't find any discussion of it on the original Gerrit review or bz. Regardless, the solution seems to be to refactor the DELETE to use the equivalent DELETE FROM... JOIN: DELETE recipe_task_result FROM recipe_task_result INNER JOIN recipe_task ON recipe_task_result.recipe_task_id = recipe_task.id WHERE recipe_task.recipe_id = 255141; This completes instantly against a production db dump, compared with the IN (SELECT...) which runs for many minutes. http://gerrit.beaker-project.org/5111 This patch produces the more optimal DELETE... JOIN query as described above. Tested against a production db dump with beaker-log-delete --dry-run, it performs admirably. Beaker 23.1 has been released. |