The update_dirty_jobs thread in beakerd is structured as a polling loop which regularly checks for dirty jobs and then processes any which are found. Checking for dirty jobs means querying the job table for dirty_version != clean_version. The dirty_version != clean_version condition does use an index according to the EXPLAIN output, although it is not a constant lookup so MySQL still has to examine many rows. However it is still able to do this quickly. The query to actually grab the dirty job ids: SELECT job.id AS job_id FROM job WHERE job.dirty_version != job.clean_version does show up in our slow query log (presumably because it's not using a constant index lookup) but it averages 0.32 seconds running time which is acceptable. MySQL [beaker_dcallagh]> explain SELECT job.id AS job_id FROM job WHERE job.dirty_version != job.clean_version; +----+-------------+-------+-------+---------------+----------------------------+---------+------+--------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+----------------------------+---------+------+--------+--------------------------+ | 1 | SIMPLE | job | index | NULL | ix_job_dirty_clean_version | 32 | NULL | 846116 | Using where; Using index | +----+-------------+-------+-------+---------------+----------------------------+---------+------+--------+--------------------------+ 1 row in set (0.01 sec) However beakerd's polling loop actually calls .count() on this first, to determine if any dirty job rows exist, and then actually looks up their ids if the .count() call returned greater than zero. The current query generated by .count() is: SELECT count(*) AS count_1 FROM (SELECT job.id AS job_id, job.dirty_version AS job_dirty_version, job.clean_version AS job_clean_version, job.owner_id AS job_owner_id, job.submitter_id AS job_submitter_id, job.group_id AS job_group_id, job.whiteboard AS job_whiteboard, job.retention_tag_id AS job_retention_tag_id, job.product_id AS job_product_id, job.result AS job_result, job.status AS job_status, job.deleted AS job_deleted, job.to_delete AS job_to_delete, job.ttasks AS job_ttasks, job.ntasks AS job_ntasks, job.ptasks AS job_ptasks, job.wtasks AS job_wtasks, job.ftasks AS job_ftasks, job.ktasks AS job_ktasks FROM job WHERE job.dirty_version != job.clean_version) AS anon_1 It seems that the subquery wrapping defeats MySQL's optimizer and produces a much slower query. The EXPLAIN output indicates that this version does not use any index at all: MySQL [beaker_dcallagh]> explain SELECT count(*) AS count_1 FROM (SELECT job.id AS job_id, job.dirty_version AS job_dirty_version, job.clean_version AS job_clean_version, job.owner_id AS job_owner_id, job.submitter_id AS job_submitter_id, job.group_id AS job_group_id, job.whiteboard AS job_whiteboard, job.retention_tag_id AS job_retention_tag_id, job.product_id AS job_product_id, job.result AS job_result, job.status AS job_status, job.deleted AS job_deleted, job.to_delete AS job_to_delete, job.ttasks AS job_ttasks, job.ntasks AS job_ntasks, job.ptasks AS job_ptasks, job.wtasks AS job_wtasks, job.ftasks AS job_ftasks, job.ktasks AS job_ktasks FROM job WHERE job.dirty_version != job.clean_version) AS anon_1; +----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+ | 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | | 2 | DERIVED | job | ALL | NULL | NULL | NULL | NULL | 846116 | Using where | +----+-------------+-------+------+---------------+------+---------+------+--------+------------------------------+ 2 rows in set (1.52 sec) Compared with a simpler COUNT(*) using no subquery: MySQL [beaker_dcallagh]> explain SELECT COUNT(*) FROM job WHERE job.dirty_version != job.clean_version; +----+-------------+-------+-------+---------------+----------------------------+---------+------+--------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+----------------------------+---------+------+--------+--------------------------+ | 1 | SIMPLE | job | index | NULL | ix_job_dirty_clean_version | 32 | NULL | 846116 | Using where; Using index | +----+-------------+-------+-------+---------------+----------------------------+---------+------+--------+--------------------------+ 1 row in set (0.00 sec) The .count() subquery wrapping, and its performance problems on MySQL, is a known issue <http://stackoverflow.com/a/14758255/120202> and one we have actually hit before (e.g. bug 1183239). In this particular case we know that the subquery wrapping is superfluous so we can either force SQLAlchemy to generate the simpler COUNT(*) query, as suggested in the Stack Overflow answer, or even better we could skip the .count() entirely and always just fetch the list of dirty job ids.
As of https://git.beaker-project.org/cgit/beaker/commit/?id=fa50dca70e604118c25158eeba7abc8ee5084690 we no longer issue this problematic COUNT() query. However the corresponding SELECT query itself, that is: SELECT job.id AS job_id FROM job WHERE job.dirty_version != job.clean_version came up in bug 1097861 as something which might be worth optimizing: https://bugzilla.redhat.com/show_bug.cgi?id=1097861#c36 Here is a patch to switch to the "naive" dirty flag approach: https://gerrit.beaker-project.org/6121 as described in: https://bugzilla.redhat.com/show_bug.cgi?id=1097861#c41 although it is not at all clear to me right now whether this change will make things better or worse overall. It might be worth seeking more data before deciding to merge that patch.
(In reply to Dan Callaghan from comment #1) > although it is not at all clear to me right now whether this change will > make things better or worse overall. It might be worth seeking more data > before deciding to merge that patch. We don't have any real data proving this approach will be better or worse, and I cannot come up with any ideas how to get it. We don't have any kind of realistic load testing mechanism that would allow us to reproduce production-like load in a production-like environment to compare the behaviour of the two schema design choices. But knowing that: * the current version does occasionally hit the lock wait issues * the proposed version is a simpler schema, without index pollution * the corresponding SELECT query is also simpler, and the EXPLAIN looks nicer ... I think we should assume it *probably* won't be any worse. And thus we should roll it out. If it turns out we are wrong, well... we can always go back to the previous approach by reverting this patch.
Beaker 26.0 has been released.