Version-Release number of selected component (if applicable): 25.2 How reproducible: not at all Additional info: In our production Beaker environment today we encountered an issue where the scheduler was taking an inordinately long amount of time to in each iteration of process_new_recipe(): May 3 23:06:00 beaker-server beakerd[4063]: bkr.server.tools.beakerd INFO recipe ID 5108221 moved from New to Processed May 3 23:06:01 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Counting candidate systems for recipe 5108222 May 3 23:06:37 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds May 3 23:07:18 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds May 3 23:08:00 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds May 3 23:08:41 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds May 3 23:09:23 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds May 3 23:09:37 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Computing all candidate systems for recipe 5108222 May 3 23:10:05 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 37 seconds May 3 23:10:48 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 38 seconds May 3 23:11:31 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 37 seconds May 3 23:12:14 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 37 seconds May 3 23:12:55 beaker-server beakerd[4063]: bkr.server.tools.beakerd DEBUG Metrics collection took 36 seconds May 3 23:13:21 beaker-server beakerd[4063]: bkr.server.tools.beakerd INFO recipe ID 5108222 moved from New to Processed Observe that the "counting" then "computing" steps were taking over 4 minutes, where normally they take less than 1 second. Currently not clear what factors cause these SELECT queries to go bad. The key difference seems to be visible in the EXPLAIN, specifically this row: BAD: | 5 | DEPENDENT SUBQUERY | exclude_osmajor | index_merge | system_id,osmajor_id,arch_id | osmajor_id,arch_id | 4,4 | NULL | 20 | Using intersect(osmajor_id,arch_id); Using where; Using index | GOOD: | 5 | DEPENDENT SUBQUERY | exclude_osmajor | ref | system_id,osmajor_id,arch_id | system_id | 4 | beaker.system.id | 28 | Using where | More detailed pastes to follow.
If we assume that the long-held transaction did cause this problem, Matt Tyson pointed out that Mariadb 10.3 introduced some new tx timeout features specifically to avoid this kind of problem: https://mariadb.com/kb/en/library/transaction-timeouts/ So that would be another nice thing we can have eventually when we get onto a newer database version.
We hit this issue in our production environment again today. We found a leaked transaction that had been held open for about 2.6 hours, and at 23:08 UTC 21 May we killed that session using mysql KILL statement. However by 01:00 UTC the queries had *not* magically fixed themselves. We see that Innodb has no large transaction backlog: Trx id counter 0 2750736108 Purge done for trx's n:o < 0 2750735077 undo n:o < 0 0 but the EXPLAIN for the slow running selects still shows the badness noted in comment 0, specifically: | 4 | DEPENDENT SUBQUERY | exclude_osmajor | index_merge | system_id,osmajor_id,arch_id | arch_id,osmajor_id | 4,4 | NULL | 36 | Using intersect(arch_id,osmajor_id); Using where | After we restarted the entire mysqld process at 01:10 the query magically fixed itself and the EXPLAIN output changes back to the good state. | 4 | DEPENDENT SUBQUERY | exclude_osmajor | ref | system_id,osmajor_id,arch_id | system_id | 4 | beaker.system.id | 37 | Using where |
I figured out one cause of the idle connections which are holding an active transaction open. It is the outer loops in beakerd. https://gerrit.beaker-project.org/6106 The above patch fixes them to release the connection after fetching the list of relevant ids (then the inner loop will BEGIN and COMMIT each individual iteration as expected). Since we have not yet *proven* that the long-held transactions are the direct cause of the bad query plan (with index_merge), I am not going to consider that patch to be the final solution to this bug. But it should help at least.
This afternoon we have observed that the patch in comment 7 *has* eliminated the long-held active idle transactions that we were seeing. Innodb history length is back to a reasonably small level. (There is one other source of long active idle transactions, which is beaker-log-delete, which I need to go back and examine again... but there is a simple workaround for it which is to just kill the process and let it run another time.) However even with that fix in place, and having restarted mysqld, we are still seeing these queries running slowly (~1 minute) for a large proportion of New recipes. The EXPLAIN output still shows the "bad" query plan with index_merge on exclude_osversion or exclude_osmajor or table. So that really just means, we do not currently understand what is causing MySQL to randomly pick the bad query plan.
Incidentally, I compared the same query against our Mariadb 10 replica. Its EXPLAIN output looks like this. In particular note that it has the index_merge on exclude_osversion which was a characteristic of the "bad" query plan above *but* it still manages to execute this "bad" query in 2.5 seconds. Presumably this is as a result of Mariadb 10 being generally smarter at planning and executing its queries. +------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+ | 1 | PRIMARY | system_arch_map_1 | ref | PRIMARY,system_id,arch_id | arch_id | 4 | const | 7182 | Using where; Using index; Using temporary; Using filesort | | 1 | PRIMARY | system | eq_ref | PRIMARY,owner_id,loan_id,user_id | PRIMARY | 4 | beaker.system_arch_map_1.system_id | 1 | Using where | | 1 | PRIMARY | <subquery4> | eq_ref | distinct_key | distinct_key | 4 | func | 1 | | | 1 | PRIMARY | system_access_policy | eq_ref | PRIMARY | PRIMARY | 4 | beaker.system.active_access_policy_id | 1 | Using where; Using index | | 1 | PRIMARY | cpu | ref | system_id | system_id | 4 | beaker.system_arch_map_1.system_id | 1 | | | 4 | MATERIALIZED | device | index | PRIMARY | device_class_id | 4 | NULL | 28369 | Using where; Using index | | 4 | MATERIALIZED | system_device_map | ref | PRIMARY,system_id,device_id | device_id | 4 | beaker.device.id | 7 | Using index | | 8 | DEPENDENT SUBQUERY | system_pool_map | ref | PRIMARY,pool_id | PRIMARY | 4 | beaker.system.id | 1 | Using index | | 8 | DEPENDENT SUBQUERY | system_pool | eq_ref | PRIMARY | PRIMARY | 4 | beaker.system_pool_map.pool_id | 1 | Using index | | 7 | MATERIALIZED | osmajor | const | PRIMARY,osmajor | osmajor | 768 | const | 1 | Using index | | 7 | MATERIALIZED | osversion | const | PRIMARY,osversion_uix_1,osmajor_id | osversion_uix_1 | 772 | const,const | 1 | Using index | | 7 | MATERIALIZED | exclude_osversion | index_merge | system_id,arch_id,osversion_id | osversion_id,arch_id | 4,4 | NULL | 18 | Using intersect(osversion_id,arch_id); Using where | | 6 | MATERIALIZED | osmajor | const | PRIMARY,osmajor | osmajor | 768 | const | 1 | Using index | | 6 | MATERIALIZED | exclude_osmajor | index_merge | system_id,osmajor_id,arch_id | osmajor_id,arch_id | 4,4 | NULL | 636 | Using intersect(osmajor_id,arch_id); Using where | | 5 | DEPENDENT SUBQUERY | device_class | const | PRIMARY,device_class | device_class | 74 | const | 1 | Using index | | 3 | MATERIALIZED | system_access_policy_rule | ALL | system_access_policy_rule_policy_id_fk,system_access_policy_rule_user_id_fk,system_access_policy_rule_group_id_fk | NULL | NULL | NULL | 66693 | Using where | | 2 | MATERIALIZED | system_access_policy_rule | ALL | system_access_policy_rule_policy_id_fk,system_access_policy_rule_user_id_fk,system_access_policy_rule_group_id_fk | NULL | NULL | NULL | 66693 | Using where | +------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+
(In reply to Dan Callaghan from comment #9) > (There is one other source of long active idle transactions, which is > beaker-log-delete, which I need to go back and examine again... but there is > a simple workaround for it which is to just kill the process and let it run > another time.) https://gerrit.beaker-project.org/6114
This bug was filed on the basis that the scheduler would still show performance problems with the jump to MariaDB. Now we're using MariaDB and it's fast. It seems the problems we've seen with MySQL and it's long holding transactions is not a problem any more on MariaDB. Closing this as NOTABUG.