Bug 1574772
Summary: | process_new_recipe() query to SELECT candidate systems sometimes randomly runs very slowly | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Dan Callaghan <dcallagh> |
Component: | scheduler | Assignee: | Dan Callaghan <dcallagh> |
Status: | CLOSED NOTABUG | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 25 | CC: | dcallagh, mtyson, rjoost |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-11 00:50:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Dan Callaghan
2018-05-04 02:31:47 UTC
If we assume that the long-held transaction did cause this problem, Matt Tyson pointed out that Mariadb 10.3 introduced some new tx timeout features specifically to avoid this kind of problem: https://mariadb.com/kb/en/library/transaction-timeouts/ So that would be another nice thing we can have eventually when we get onto a newer database version. We hit this issue in our production environment again today. We found a leaked transaction that had been held open for about 2.6 hours, and at 23:08 UTC 21 May we killed that session using mysql KILL statement. However by 01:00 UTC the queries had *not* magically fixed themselves. We see that Innodb has no large transaction backlog: Trx id counter 0 2750736108 Purge done for trx's n:o < 0 2750735077 undo n:o < 0 0 but the EXPLAIN for the slow running selects still shows the badness noted in comment 0, specifically: | 4 | DEPENDENT SUBQUERY | exclude_osmajor | index_merge | system_id,osmajor_id,arch_id | arch_id,osmajor_id | 4,4 | NULL | 36 | Using intersect(arch_id,osmajor_id); Using where | After we restarted the entire mysqld process at 01:10 the query magically fixed itself and the EXPLAIN output changes back to the good state. | 4 | DEPENDENT SUBQUERY | exclude_osmajor | ref | system_id,osmajor_id,arch_id | system_id | 4 | beaker.system.id | 37 | Using where | I figured out one cause of the idle connections which are holding an active transaction open. It is the outer loops in beakerd. https://gerrit.beaker-project.org/6106 The above patch fixes them to release the connection after fetching the list of relevant ids (then the inner loop will BEGIN and COMMIT each individual iteration as expected). Since we have not yet *proven* that the long-held transactions are the direct cause of the bad query plan (with index_merge), I am not going to consider that patch to be the final solution to this bug. But it should help at least. This afternoon we have observed that the patch in comment 7 *has* eliminated the long-held active idle transactions that we were seeing. Innodb history length is back to a reasonably small level. (There is one other source of long active idle transactions, which is beaker-log-delete, which I need to go back and examine again... but there is a simple workaround for it which is to just kill the process and let it run another time.) However even with that fix in place, and having restarted mysqld, we are still seeing these queries running slowly (~1 minute) for a large proportion of New recipes. The EXPLAIN output still shows the "bad" query plan with index_merge on exclude_osversion or exclude_osmajor or table. So that really just means, we do not currently understand what is causing MySQL to randomly pick the bad query plan. Incidentally, I compared the same query against our Mariadb 10 replica. Its EXPLAIN output looks like this. In particular note that it has the index_merge on exclude_osversion which was a characteristic of the "bad" query plan above *but* it still manages to execute this "bad" query in 2.5 seconds. Presumably this is as a result of Mariadb 10 being generally smarter at planning and executing its queries. +------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+ | 1 | PRIMARY | system_arch_map_1 | ref | PRIMARY,system_id,arch_id | arch_id | 4 | const | 7182 | Using where; Using index; Using temporary; Using filesort | | 1 | PRIMARY | system | eq_ref | PRIMARY,owner_id,loan_id,user_id | PRIMARY | 4 | beaker.system_arch_map_1.system_id | 1 | Using where | | 1 | PRIMARY | <subquery4> | eq_ref | distinct_key | distinct_key | 4 | func | 1 | | | 1 | PRIMARY | system_access_policy | eq_ref | PRIMARY | PRIMARY | 4 | beaker.system.active_access_policy_id | 1 | Using where; Using index | | 1 | PRIMARY | cpu | ref | system_id | system_id | 4 | beaker.system_arch_map_1.system_id | 1 | | | 4 | MATERIALIZED | device | index | PRIMARY | device_class_id | 4 | NULL | 28369 | Using where; Using index | | 4 | MATERIALIZED | system_device_map | ref | PRIMARY,system_id,device_id | device_id | 4 | beaker.device.id | 7 | Using index | | 8 | DEPENDENT SUBQUERY | system_pool_map | ref | PRIMARY,pool_id | PRIMARY | 4 | beaker.system.id | 1 | Using index | | 8 | DEPENDENT SUBQUERY | system_pool | eq_ref | PRIMARY | PRIMARY | 4 | beaker.system_pool_map.pool_id | 1 | Using index | | 7 | MATERIALIZED | osmajor | const | PRIMARY,osmajor | osmajor | 768 | const | 1 | Using index | | 7 | MATERIALIZED | osversion | const | PRIMARY,osversion_uix_1,osmajor_id | osversion_uix_1 | 772 | const,const | 1 | Using index | | 7 | MATERIALIZED | exclude_osversion | index_merge | system_id,arch_id,osversion_id | osversion_id,arch_id | 4,4 | NULL | 18 | Using intersect(osversion_id,arch_id); Using where | | 6 | MATERIALIZED | osmajor | const | PRIMARY,osmajor | osmajor | 768 | const | 1 | Using index | | 6 | MATERIALIZED | exclude_osmajor | index_merge | system_id,osmajor_id,arch_id | osmajor_id,arch_id | 4,4 | NULL | 636 | Using intersect(osmajor_id,arch_id); Using where | | 5 | DEPENDENT SUBQUERY | device_class | const | PRIMARY,device_class | device_class | 74 | const | 1 | Using index | | 3 | MATERIALIZED | system_access_policy_rule | ALL | system_access_policy_rule_policy_id_fk,system_access_policy_rule_user_id_fk,system_access_policy_rule_group_id_fk | NULL | NULL | NULL | 66693 | Using where | | 2 | MATERIALIZED | system_access_policy_rule | ALL | system_access_policy_rule_policy_id_fk,system_access_policy_rule_user_id_fk,system_access_policy_rule_group_id_fk | NULL | NULL | NULL | 66693 | Using where | +------+--------------------+---------------------------+-------------+-------------------------------------------------------------------------------------------------------------------+----------------------+---------+---------------------------------------+-------+-----------------------------------------------------------+ (In reply to Dan Callaghan from comment #9) > (There is one other source of long active idle transactions, which is > beaker-log-delete, which I need to go back and examine again... but there is > a simple workaround for it which is to just kill the process and let it run > another time.) https://gerrit.beaker-project.org/6114 This bug was filed on the basis that the scheduler would still show performance problems with the jump to MariaDB. Now we're using MariaDB and it's fast. It seems the problems we've seen with MySQL and it's long holding transactions is not a problem any more on MariaDB. Closing this as NOTABUG. |