Bug 1574772 - process_new_recipe() query to SELECT candidate systems sometimes randomly runs very slowly
Summary: process_new_recipe() query to SELECT candidate systems sometimes randomly run...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Beaker
Classification: Retired
Component: scheduler
Version: 25
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Dan Callaghan
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-04 02:31 UTC by Dan Callaghan
Modified: 2018-06-11 00:50 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-11 00:50:37 UTC
Embargoed:


Attachments (Terms of Use)

Description Dan Callaghan 2018-05-04 02:31:47 UTC
Version-Release number of selected component (if applicable):
25.2

How reproducible:
not at all

Additional info:
In our production Beaker environment today we encountered an issue where the scheduler was taking an inordinately long amount of time to in each iteration of process_new_recipe():

May  3 23:06:00 beaker-server beakerd[4063]: bkr.server.tools.beakerd INFO recipe ID 5108221 moved from New to Processed
May  3 23:06:01 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Counting candidate systems for recipe 5108222
May  3 23:06:37 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds
May  3 23:07:18 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds
May  3 23:08:00 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds
May  3 23:08:41 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds
May  3 23:09:23 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds
May  3 23:09:37 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Computing all candidate systems for recipe 5108222
May  3 23:10:05 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 37 seconds
May  3 23:10:48 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 38 seconds
May  3 23:11:31 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 37 seconds
May  3 23:12:14 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 37 seconds
May  3 23:12:55 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds
May  3 23:13:21 beaker-server beakerd[4063]: bkr.server.tools.beakerd INFO recipe ID 5108222 moved from New to Processed

Observe that the "counting" then "computing" steps were taking over 4 minutes, where normally they take less than 1 second.

Currently not clear what factors cause these SELECT queries to go bad. The key difference seems to be visible in the EXPLAIN, specifically this row:

BAD:
|  5 | DEPENDENT SUBQUERY | exclude_osmajor                | index_merge | system_id,osmajor_id,arch_id                                                                                      | osmajor_id,arch_id                     | 4,4     | NULL                                  |   20 | Using intersect(osmajor_id,arch_id); Using where; Using index |

GOOD:
|  5 | DEPENDENT SUBQUERY | exclude_osmajor                | ref    | system_id,osmajor_id,arch_id                                                                                      | system_id                              | 4       | beaker.system.id                      |   28 | Using where                                  |

More detailed pastes to follow.

Comment 5 Dan Callaghan 2018-05-04 02:55:01 UTC
If we assume that the long-held transaction did cause this problem, Matt Tyson pointed out that Mariadb 10.3 introduced some new tx timeout features specifically to avoid this kind of problem:

https://mariadb.com/kb/en/library/transaction-timeouts/

So that would be another nice thing we can have eventually when we get onto a newer database version.

Comment 6 Dan Callaghan 2018-05-22 01:14:23 UTC
We hit this issue in our production environment again today.

We found a leaked transaction that had been held open for about 2.6 hours, and at 23:08 UTC 21 May we killed that session using mysql KILL statement.

However by 01:00 UTC the queries had *not* magically fixed themselves. We see that Innodb has no large transaction backlog:

Trx id counter 0 2750736108
Purge done for trx's n:o < 0 2750735077 undo n:o < 0 0

but the EXPLAIN for the slow running selects still shows the badness noted in comment 0, specifically:

|  4 | DEPENDENT SUBQUERY | exclude_osmajor           | index_merge | system_id,osmajor_id,arch_id                                                                                      | arch_id,osmajor_id                     | 4,4     | NULL                                  |   36 | Using intersect(arch_id,osmajor_id); Using where |

After we restarted the entire mysqld process at 01:10 the query magically fixed itself and the EXPLAIN output changes back to the good state.

|  4 | DEPENDENT SUBQUERY | exclude_osmajor           | ref    | system_id,osmajor_id,arch_id                                                                                      | system_id                              | 4       | beaker.system.id                      |   37 | Using where                                  |

Comment 7 Dan Callaghan 2018-05-22 03:14:24 UTC
I figured out one cause of the idle connections which are holding an active transaction open. It is the outer loops in beakerd.

https://gerrit.beaker-project.org/6106

The above patch fixes them to release the connection after fetching the list of relevant ids (then the inner loop will BEGIN and COMMIT each individual iteration as expected).

Since we have not yet *proven* that the long-held transactions are the direct cause of the bad query plan (with index_merge), I am not going to consider that patch to be the final solution to this bug. But it should help at least.

Comment 9 Dan Callaghan 2018-05-22 07:23:39 UTC
This afternoon we have observed that the patch in comment 7 *has* eliminated the long-held active idle transactions that we were seeing. Innodb history length is back to a reasonably small level.

(There is one other source of long active idle transactions, which is beaker-log-delete, which I need to go back and examine again... but there is a simple workaround for it which is to just kill the process and let it run another time.)

However even with that fix in place, and having restarted mysqld, we are still seeing these queries running slowly (~1 minute) for a large proportion of New recipes. The EXPLAIN output still shows the "bad" query plan with index_merge on exclude_osversion or exclude_osmajor or table.

So that really just means, we do not currently understand what is causing MySQL to randomly pick the bad query plan.

Comment 10 Dan Callaghan 2018-05-22 07:29:20 UTC
Incidentally, I compared the same query against our Mariadb 10 replica. Its EXPLAIN output looks like this. In particular note that it has the index_merge on exclude_osversion which was a characteristic of the "bad" query plan above *but* it still manages to execute this "bad" query in 2.5 seconds. Presumably this is as a result of Mariadb 10 being generally smarter at planning and executing its queries.

+------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+
| id   | select_type        | table                     | type        | possible_keys                                                                                                     | key                  | key_len | ref                                   | rows  | Extra                                                     |
+------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+
|    1 | PRIMARY            | system_arch_map_1         | ref         | PRIMARY,system_id,arch_id                                                                                         | arch_id              | 4       | const                                 |  7182 | Using where; Using index; Using temporary; Using filesort |
|    1 | PRIMARY            | system                    | eq_ref      | PRIMARY,owner_id,loan_id,user_id                                                                                  | PRIMARY              | 4       | beaker.system_arch_map_1.system_id    |     1 | Using where                                               |
|    1 | PRIMARY            | <subquery4>               | eq_ref      | distinct_key                                                                                                      | distinct_key         | 4       | func                                  |     1 |                                                           |
|    1 | PRIMARY            | system_access_policy      | eq_ref      | PRIMARY                                                                                                           | PRIMARY              | 4       | beaker.system.active_access_policy_id |     1 | Using where; Using index                                  |
|    1 | PRIMARY            | cpu                       | ref         | system_id                                                                                                         | system_id            | 4       | beaker.system_arch_map_1.system_id    |     1 |                                                           |
|    4 | MATERIALIZED       | device                    | index       | PRIMARY                                                                                                           | device_class_id      | 4       | NULL                                  | 28369 | Using where; Using index                                  |
|    4 | MATERIALIZED       | system_device_map         | ref         | PRIMARY,system_id,device_id                                                                                       | device_id            | 4       | beaker.device.id                      |     7 | Using index                                               |
|    8 | DEPENDENT SUBQUERY | system_pool_map           | ref         | PRIMARY,pool_id                                                                                                   | PRIMARY              | 4       | beaker.system.id                      |     1 | Using index                                               |
|    8 | DEPENDENT SUBQUERY | system_pool               | eq_ref      | PRIMARY                                                                                                           | PRIMARY              | 4       | beaker.system_pool_map.pool_id        |     1 | Using index                                               |
|    7 | MATERIALIZED       | osmajor                   | const       | PRIMARY,osmajor                                                                                                   | osmajor              | 768     | const                                 |     1 | Using index                                               |
|    7 | MATERIALIZED       | osversion                 | const       | PRIMARY,osversion_uix_1,osmajor_id                                                                                | osversion_uix_1      | 772     | const,const                           |     1 | Using index                                               |
|    7 | MATERIALIZED       | exclude_osversion         | index_merge | system_id,arch_id,osversion_id                                                                                    | osversion_id,arch_id | 4,4     | NULL                                  |    18 | Using intersect(osversion_id,arch_id); Using where        |
|    6 | MATERIALIZED       | osmajor                   | const       | PRIMARY,osmajor                                                                                                   | osmajor              | 768     | const                                 |     1 | Using index                                               |
|    6 | MATERIALIZED       | exclude_osmajor           | index_merge | system_id,osmajor_id,arch_id                                                                                      | osmajor_id,arch_id   | 4,4     | NULL                                  |   636 | Using intersect(osmajor_id,arch_id); Using where          |
|    5 | DEPENDENT SUBQUERY | device_class              | const       | PRIMARY,device_class                                                                                              | device_class         | 74      | const                                 |     1 | Using index                                               |
|    3 | MATERIALIZED       | system_access_policy_rule | ALL         | system_access_policy_rule_policy_id_fk,system_access_policy_rule_user_id_fk,system_access_policy_rule_group_id_fk | NULL                 | NULL    | NULL                                  | 66693 | Using where                                               |
|    2 | MATERIALIZED       | system_access_policy_rule | ALL         | system_access_policy_rule_policy_id_fk,system_access_policy_rule_user_id_fk,system_access_policy_rule_group_id_fk | NULL                 | NULL    | NULL                                  | 66693 | Using where                                               |
+------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+

Comment 11 Dan Callaghan 2018-05-30 02:53:06 UTC
(In reply to Dan Callaghan from comment #9)
> (There is one other source of long active idle transactions, which is
> beaker-log-delete, which I need to go back and examine again... but there is
> a simple workaround for it which is to just kill the process and let it run
> another time.)

https://gerrit.beaker-project.org/6114

Comment 12 Roman Joost 2018-06-11 00:50:37 UTC
This bug was filed on the basis that the scheduler would still show performance problems with the jump to MariaDB. Now we're using MariaDB and it's fast. It seems the problems we've seen with MySQL and it's long holding transactions is not a problem any more on MariaDB.

Closing this as NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.