Bug 1573081 - beakerd queries to find processed recipe sets and scheduled recipesets are slow
Summary: beakerd queries to find processed recipe sets and scheduled recipesets are slow
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Beaker
Classification: Community
Component: scheduler
Version: 25
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified vote
Target Milestone: 25.4
Assignee: Dan Callaghan
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-30 07:27 UTC by Dan Callaghan
Modified: 2018-05-24 22:52 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-24 22:52:49 UTC


Attachments (Terms of Use)

Description Dan Callaghan 2018-04-30 07:27:53 UTC
Description of problem:
While examining beakerd's query performance for bug 1097861 I noticed that the queries issued at the top of queue_processed_recipesets() and provision_scheduled_recipesets() both take a long time (> 30 seconds) to find candidate recipe sets -- even when there are *no* matching recipe sets.

Version-Release number of selected component (if applicable):
25.2

How reproducible:
easily with a production-like database

Steps to Reproduce:
1. Run beakerd (ideally in foreground with sqlalchemy.engine logging)
2. Watch statement timing
*or*
just run the offending queries against a product db snapshot and observe their timing.

Actual results:
MySQL [beaker_dcallagh]> SELECT recipe_set.id AS recipe_set_id 
    -> FROM recipe_set INNER JOIN job ON job.id = recipe_set.job_id 
    -> WHERE job.deleted IS NULL AND NOT (EXISTS (SELECT 1 
    -> FROM recipe 
    -> WHERE recipe_set.id = recipe.recipe_set_id AND recipe.status != 'Scheduled'));
Empty set (24.99 sec)
MySQL [beaker_dcallagh]> SELECT recipe_set.id AS recipe_set_id  FROM recipe_set INNER JOIN job ON job.id = recipe_set.job_id  WHERE job.deleted IS NULL AND NOT (EXISTS (SELECT 1  FROM recipe  WHERE recipe_set.id = recipe.recipe_set_id AND recipe.status != 'Processed')) ORDER BY recipe_set.priority DESC, recipe_set.id;
Empty set (27.57 sec)

Expected results:
These happen at the top of every scheduler iteration so they should be more efficient, to avoid burning MySQL CPU time wastefully.

Additional info:
This is probably unrelated to bug 1097861 which is about lock wait contention within the various inner loops, whereas these queries happen *before* the loops to find recipe sets to work on.

Comment 1 Dan Callaghan 2018-04-30 07:29:45 UTC
(In reply to Dan Callaghan from comment #0)
> MySQL [beaker_dcallagh]> SELECT recipe_set.id AS recipe_set_id 
>     -> FROM recipe_set INNER JOIN job ON job.id = recipe_set.job_id 
>     -> WHERE job.deleted IS NULL AND NOT (EXISTS (SELECT 1 
>     -> FROM recipe 
>     -> WHERE recipe_set.id = recipe.recipe_set_id AND recipe.status !=
> 'Scheduled'));
> Empty set (24.99 sec)

This is the query from provision_scheduled_recipesets(). The backwards-looking double negative with NOT EXISTS ... != is actually saying, "give me all recipe sets where all the recipes in the set are Scheduled".

> MySQL [beaker_dcallagh]> SELECT recipe_set.id AS recipe_set_id  FROM
> recipe_set INNER JOIN job ON job.id = recipe_set.job_id  WHERE job.deleted
> IS NULL AND NOT (EXISTS (SELECT 1  FROM recipe  WHERE recipe_set.id =
> recipe.recipe_set_id AND recipe.status != 'Processed')) ORDER BY
> recipe_set.priority DESC, recipe_set.id;
> Empty set (27.57 sec)

This is the query from queue_processed_recipesets(). Its structure is the same as above, "give me all recipe sets where all the recipes in the set are Processed".

Comment 2 Dan Callaghan 2018-04-30 07:30:46 UTC
Interestingly, even though the EXPLAIN output indicates the condition on job.deleted is problematic (and we could potentially get rid of it), taking it out of the queries actually makes them about *three times slower* even when they are still matching 0 rows.

Comment 3 Dan Callaghan 2018-05-14 04:37:04 UTC
The problem is, as always, the NOT EXISTS clause. The EXPLAIN shows that it is scanning all non-deleted rows in the job table (2 million or so). It doesn't pick the index on recipe.status.

The usual trick to avoid problematic dependent subqueries is to convert them to a join. In this case, we are only selecting back recipe_set IDs (not the whole rows) which gives us a lot of leeway to do GROUP BY/DISTINCT hackery.

And indeed, I came up with the following which works nicely.

The point of the NOT EXISTS is that we want to select only recipe *sets* where all recipes in the set are Processed. We can express that with some COUNT()s and GROUP BY recipe_set.id:

SELECT recipe_set.id
FROM recipe_set
INNER JOIN recipe all_recipes on recipe_set.id = all_recipes.recipe_set_id
INNER JOIN recipe processed_recipes on recipe_set.id = processed_recipes.recipe_set_id and processed_recipes.status = 'Processed'
GROUP BY recipe_set.id
HAVING COUNT(DISTINCT all_recipes.id) = COUNT(DISTINCT processed_recipes.id);

I think this gives us the right answer, and it comes back in 0.25 seconds. It tricks MySQL into using the recipe.status index first (which is efficient because there is always a very small number of Processed recipes).

+------+-------------+-------------------+--------+----------------------+---------------+---------+----------------------------------------+------+----------------------------------------------------+
| id   | select_type | table             | type   | possible_keys        | key           | key_len | ref                                    | rows | Extra                                              |
+------+-------------+-------------------+--------+----------------------+---------------+---------+----------------------------------------+------+----------------------------------------------------+
|    1 | SIMPLE      | processed_recipes | ref    | recipe_set_id,status | status        | 1       | const                                  |    5 | Using index condition; Using where; Using filesort |
|    1 | SIMPLE      | recipe_set        | eq_ref | PRIMARY              | PRIMARY       | 4       | beaker.processed_recipes.recipe_set_id |    1 | Using index                                        |
|    1 | SIMPLE      | all_recipes       | ref    | recipe_set_id        | recipe_set_id | 4       | beaker.processed_recipes.recipe_set_id |    1 | Using index                                        |
+------+-------------+-------------------+--------+----------------------+---------------+---------+----------------------------------------+------+----------------------------------------------------+
3 rows in set (0.26 sec)

Comment 4 Dan Callaghan 2018-05-14 05:55:37 UTC
We can safely drop the condition on Job.is_deleted as well. It was only added as an optimisation hack (incidentally, in a previous attempt at fixing this same problem) back in bug 1062480.

All the code paths for marking a job as deleted correctly enforce that the job is already in a finished state. And I confirmed in our production database that all deleted jobs (and their recipe sets and recipes) are Completed, Cancelled, or Aborted.

MariaDB [beaker]> select count(*) from job where deleted is not null and status not in ('Completed', 'Aborted', 'Cancelled');
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (2.13 sec)

MariaDB [beaker]> select count(*) from job inner join recipe_set on job.id = recipe_set.job_id where job.deleted is not null and recipe_set.status not in ('Completed', 'Aborted', 'Cancelled');
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (26.47 sec)

MariaDB [beaker]> select count(*) from job inner join recipe_set on job.id = recipe_set.job_id inner join recipe on recipe_set.id = recipe.recipe_set_id where job.deleted is not null and recipe.status not in ('Completed', 'Aborted', 'Cancelled');
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (1 min 8.50 sec)

Comment 5 Dan Callaghan 2018-05-14 06:16:49 UTC
https://gerrit.beaker-project.org/6096


Note You need to log in before you can comment on or make changes to this bug.