Bug 1224848
Summary: | Executed tasks page causes a query which produces a very large temp table, causing MySQL to fill up /tmp | ||
---|---|---|---|
Product: | [Retired] Beaker | Reporter: | Amit Saha <asaha> |
Component: | general | Assignee: | Jon Orris <jorris> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | tools-bugs <tools-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 20 | CC: | dcallagh, dowang, ebaak, jorris, jtluka, mjia, rjoost |
Target Milestone: | 24.0 | Keywords: | Patch |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-02-21 18:49:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 972409 |
Description
Amit Saha
2015-05-26 02:31:31 UTC
I think what we need to do is to rebuild the whole page by using “pageable JSON collections” API similar to pools page. Before pointing this page to Backgrid, we need to point the Task Library page first. (In reply to matt jia from comment #1) > I think what we need to do is to rebuild the whole page by using “pageable > JSON collections” API similar to pools page. Longer term that is the right approach, but we need to find a short term fix to make the query runnable without producing a massive temp table. The problem is the first query issued by @paginate which tries to count all recipe tasks: SELECT count(*) AS count_1 FROM (SELECT recipe_task.id AS recipe_task_id, recipe_task.recipe_id AS recipe_task_recipe_id, recipe_task.name AS recipe_task_name, recipe_task.version AS recipe_task_version, recipe_task.fetch_url AS recipe_task_fetch_url, recipe_task.fetch_subdir AS recipe_task_fetch_subdir, recipe_task.task_id AS recipe_task_task_id, recipe_task.start_time AS recipe_task_start_time, recipe_task.finish_time AS recipe_task_finish_time, recipe_task.result AS recipe_task_result, recipe_task.status AS recipe_task_status, recipe_task.role AS recipe_task_role FROM recipe_task WHERE EXISTS (SELECT 1 FROM recipe WHERE recipe.id = recipe_task.recipe_id AND (EXISTS (SELECT 1 FROM recipe_set WHERE recipe_set.id = recipe.recipe_set_id AND (EXISTS (SELECT 1 FROM job WHERE job.id = recipe_set.job_id AND job.to_delete IS NULL AND job.deleted IS NULL)))))) AS anon_1 We only noticed this after the recent hosting migration for Red Hat's Beaker instance but there is actually no way this could ever have completed in a reasonable amount of time. The real problem is the SELECT count(*) FROM (SELECT ...) construct, which has caused us grief before. It was a change in SQLAlchemy 0.7: http://docs.sqlalchemy.org/en/latest/changelog/migration_07.html#query-count-simplified-should-work-virtually-always As noted in the SQLAlchemy changelog, there is a fairly simple workaround available (query for func.count() instead of calling .count) but, as in bug 1101817, we have no control over the .count() invocation because it happens inside TurboGears @paginate. So there is no simple fix available, our only option is to reimplement the page avoiding TurboGears @paginate (as we have already done with the activity grid for example). We did already replace the Executed Tasks tab on the system page with a Backgrid, and were planning to do the same for the task page and the Executed Tasks page for bug 972409. However we need to be a little more careful -- the current Backgrid implementation does not show any results or logs, only the overall status/result for the whole recipe task. For the Executed Tasks tab on the system page we decided that was sufficient because the most common use case is just searching through the history of what was run on the system. The user can then click on the T:___ link to see the full results. However it seems like some users may be relying on the Executed Tasks tab as a way to view detailed results. Piwik stats show quite a few references to URLs like /tasks/executed?recipe_task_id=123 which presents quite a fast and convenient view of the detailed results for a single task. This is actually nicer than linking to /recipes/123#task456. Not sure if these URLs are being constructed by hand or are linked to from somewhere like job matrix, need to double check. Our current plan is to redirect URLs with a single recipe_task_id= parameter (like /tasks/executed?recipe_task_id=123) to the recipe page, after it is redesigned for Beaker 22, and then convert the Executed Tasks page to Backgrid in the same way as the Executed Tasks tab on the system page (no task results). This bug, and the executed tasks tab in general, need some more thought... Ideally we would eliminate it in favour of the job matrix, or the recipe page, or both somehow. But we're not clear on that yet. The problem is that we can't remove the grid, because users are using this since there are not better alternatives. Third party tools is producing links to this grid which are embedded in TCMS test runs. If would come up with better alternatives to the grid we will most likely see a drop in the usage of this view. We need to find answers to questions in regards to user behaviour: * What do people use the executed tasks grid for and what goals do they need to solve by using it * Maybe use the Piwik stats to see what it is used for (In reply to Dan Callaghan from comment #8) > This bug, and the executed tasks tab in general, need some more thought... > Ideally we would eliminate it in favour of the job matrix, or the recipe > page, or both somehow. But we're not clear on that yet. As far as the count(*) problem, I think we can do something about it. Essentially it needs to compute: SELECT COUNT(*) FROM recipe_task INNER JOIN recipe ON recipe.id = recipe_task.recipe_id INNER JOIN recipe_set ON recipe_set.id = recipe.recipe_set_id INNER JOIN job ON job.id = recipe_set.job_id WHERE (job.to_delete is null and job.deleted is null) Which is far simpler than in comment 4. I can come close to this query by rearranging the base sqlalchemy ORM to: tasks = RecipeTask.query.join(Recipe, RecipeTask.recipe_id==Recipe.id)\ .join(RecipeSet, Recipe.recipe_set_id==RecipeSet.id)\ .join(Job, RecipeSet.job_id==Job.id)\ .filter(text('recipe_task.recipe_id = recipe.id and recipe.recipe_set_id = recipe_set.id and recipe_set.job_id = job.id and job.to_delete is null and job.deleted is null'))\ .options(joinedload(RecipeTask.task), joinedload(RecipeTask.results).joinedload(RecipeTaskResult.logs)) tasks.count() returns the same # of rows, and in far less time than the existing setup. It consistently cuts 20-30 seconds out of the response time for the webui. NOTE: Not tested using query parameters in the UI, so this option may require further changes to the query code that follows, or may be outright impractical. There are two other issues with query speed: * The default 'order by recipe_task.id desc' causes temp table and filesort usage. Not an easy way around that, as that's the sane ordering. * There are a number of N+1 queries caused by accessing attributes of RecipeTask in the webui. In theory, these can be improved by joinedload & contains_eager usage, but then 'order by' makes the query significantly worse, eating up 20+ gigs before / ran out of room. A bit of further investigation reveals that a foreign key index is having significant negative impact on performance. Specifically the recipie_id foreign key on recipie_task => recipe. With this foreign key in place, the fundamental query that fetches results has the following query plan: +------+-------------+--------------------------+-------------+------------------------------------------------+---------------------------------+---------+--------------------------------+--------+-------------------------------------------------------------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+--------------------------+-------------+------------------------------------------------+---------------------------------+---------+--------------------------------+--------+-------------------------------------------------------------------------------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 30 | Using filesort | | 1 | PRIMARY | task_1 | eq_ref | PRIMARY | PRIMARY | 4 | anon_1.recipe_task_task_id | 1 | Using where | | 1 | PRIMARY | recipe_task_result_1 | ref | recipe_task_id | recipe_task_id | 5 | anon_1.recipe_task_id | 2 | | | 1 | PRIMARY | log_recipe_task_result_1 | ref | recipe_task_result_id_id,recipe_task_result_id | recipe_task_result_id_id | 4 | beaker.recipe_task_result_1.id | 1 | Using where | | 2 | DERIVED | job | index_merge | PRIMARY,ix_job_deleted,ix_job_to_delete | ix_job_deleted,ix_job_to_delete | 9,9 | NULL | 223678 | Using intersect(ix_job_deleted,ix_job_to_delete); Using where; Using index; Using temporary; Using filesort | | 2 | DERIVED | recipe_set | ref | PRIMARY,job_id | job_id | 4 | beaker.job.id | 1 | Using index | | 2 | DERIVED | recipe | ref | PRIMARY,recipe_set_id | recipe_set_id | 4 | beaker.recipe_set.id | 1 | Using index | | 2 | DERIVED | recipe_task | ref | recipe_id_fk | recipe_id_fk | 4 | beaker.recipe.id | 6 | | +------+-------------+--------------------------+-------------+------------------------------------------------+---------------------------------+---------+--------------------------------+--------+-------------------------------------------------------------------------------------------------------------+ Removing the foreign key & index results in the following query plan: +------+-------------+--------------------------+--------+------------------------------------------------+--------------------------+---------+--------------------------------+------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+--------------------------+--------+------------------------------------------------+--------------------------+---------+--------------------------------+------+----------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 30 | Using filesort | | 1 | PRIMARY | task_1 | eq_ref | PRIMARY | PRIMARY | 4 | anon_1.recipe_task_task_id | 1 | Using where | | 1 | PRIMARY | recipe_task_result_1 | ref | recipe_task_id | recipe_task_id | 5 | anon_1.recipe_task_id | 2 | | | 1 | PRIMARY | log_recipe_task_result_1 | ref | recipe_task_result_id_id,recipe_task_result_id | recipe_task_result_id_id | 4 | beaker.recipe_task_result_1.id | 1 | Using where | | 2 | DERIVED | recipe_task | index | NULL | PRIMARY | 4 | NULL | 30 | | | 2 | DERIVED | recipe | eq_ref | PRIMARY,recipe_set_id | PRIMARY | 4 | beaker.recipe_task.recipe_id | 1 | | | 2 | DERIVED | recipe_set | eq_ref | PRIMARY,job_id | PRIMARY | 4 | beaker.recipe.recipe_set_id | 1 | | | 2 | DERIVED | job | eq_ref | PRIMARY,ix_job_deleted,ix_job_to_delete | PRIMARY | 4 | beaker.recipe_set.job_id | 1 | Using where | +------+-------------+--------------------------+--------+------------------------------------------------+--------------------------+---------+--------------------------------+------+----------------+ The end result is that the /tasks/executed page takes around 2 minutes to load with the database as is, while removing the foreign key drops it to around 20 seconds. While I'm ordinarily a big fan of referential constraints, that's a huge difference. Query for reference: explain SELECT anon_1.recipe_task_id AS anon_1_recipe_task_id, anon_1.recipe_task_recipe_id AS anon_1_recipe_task_recipe_id, anon_1.recipe_task_name AS anon_1_recipe_task_name, anon_1.recipe_task_version AS anon_1_recipe_task_version, anon_1.recipe_task_fetch_url AS anon_1_recipe_task_fetch_url, anon_1.recipe_task_fetch_subdir AS anon_1_recipe_task_fetch_subdir, anon_1.recipe_task_task_id AS anon_1_recipe_task_task_id, anon_1.recipe_task_start_time AS anon_1_recipe_task_start_time, anon_1.recipe_task_finish_time AS anon_1_recipe_task_finish_time, anon_1.recipe_task_result AS anon_1_recipe_task_result, anon_1.recipe_task_status AS anon_1_recipe_task_status, anon_1.recipe_task_role AS anon_1_recipe_task_role, task_1.id AS task_1_id, task_1.name AS task_1_name, task_1.rpm AS task_1_rpm, task_1.path AS task_1_path, task_1.description AS task_1_description, task_1.repo AS task_1_repo, task_1.avg_time AS task_1_avg_time, task_1.destructive AS task_1_destructive, task_1.nda AS task_1_nda, task_1.creation_date AS task_1_creation_date, task_1.update_date AS task_1_update_date, task_1.uploader_id AS task_1_uploader_id, task_1.owner AS task_1_owner, task_1.version AS task_1_version, task_1.license AS task_1_license, task_1.priority AS task_1_priority, task_1.valid AS task_1_valid, log_recipe_task_result_1.id AS log_recipe_task_result_1_id, log_recipe_task_result_1.path AS log_recipe_task_result_1_path, log_recipe_task_result_1.filename AS log_recipe_task_result_1_filename, log_recipe_task_result_1.start_time AS log_recipe_task_result_1_start_time, log_recipe_task_result_1.server AS log_recipe_task_result_1_server, log_recipe_task_result_1.basepath AS log_recipe_task_result_1_basepath, log_recipe_task_result_1.recipe_task_result_id AS log_recipe_task_result_1_recipe_task_result_id, recipe_task_result_1.id AS recipe_task_result_1_id, recipe_task_result_1.recipe_task_id AS recipe_task_result_1_recipe_task_id, recipe_task_result_1.path AS recipe_task_result_1_path, recipe_task_result_1.result AS recipe_task_result_1_result, recipe_task_result_1.score AS recipe_task_result_1_score, recipe_task_result_1.log AS recipe_task_result_1_log, recipe_task_result_1.start_time AS recipe_task_result_1_start_time FROM (SELECT recipe_task.id AS recipe_task_id, recipe_task.recipe_id AS recipe_task_recipe_id, recipe_task.name AS recipe_task_name, recipe_task.version AS recipe_task_version, recipe_task.fetch_url AS recipe_task_fetch_url, recipe_task.fetch_subdir AS recipe_task_fetch_subdir, recipe_task.task_id AS recipe_task_task_id, recipe_task.start_time AS recipe_task_start_time, recipe_task.finish_time AS recipe_task_finish_time, recipe_task.result AS recipe_task_result, recipe_task.status AS recipe_task_status, recipe_task.role AS recipe_task_role FROM recipe_task INNER JOIN recipe ON recipe_task.recipe_id = recipe.id INNER JOIN recipe_set ON recipe.recipe_set_id = recipe_set.id INNER JOIN job ON recipe_set.job_id = job.id WHERE recipe_task.recipe_id = recipe.id and recipe.recipe_set_id = recipe_set.id and recipe_set.job_id = job.id and job.to_delete is null and job.deleted is null ORDER BY recipe_task.id DESC LIMIT 30) AS anon_1 LEFT OUTER JOIN task AS task_1 ON task_1.id = anon_1.recipe_task_task_id LEFT OUTER JOIN recipe_task_result AS recipe_task_result_1 ON anon_1.recipe_task_id = recipe_task_result_1.recipe_task_id LEFT OUTER JOIN log_recipe_task_result AS log_recipe_task_result_1 ON recipe_task_result_1.id = log_recipe_task_result_1.recipe_task_result_id ORDER BY anon_1.recipe_task_id DESC; Yeah, the frustrating thing with MySQL is that once the queries get beyond a certain complexity its query planner seems to just start making some random guesses which are very often wrong. You can tweak the query in various ways to try and trick it into picking the "right" query plan (certainly I have spent a lot of my time doing that for various bugs in the past) but it's usually not a very productive approach because the slightest tweak further down the track, or even a change in the relative sizes of some tables, can throw everything out of whack again. Usually if we are hitting some badly optimised query the best approach is to just try to find ways to simplify it. I definitely *don't* think that dropping the FK on recipe_task.recipe_id, just to optimise this query, is a good option. Seeing this, I wonder if we could improve things by getting job.to_delete or job.deleted out of the WHERE clause (if job.deleted is not null, job.to_delete should also be not NULL): | 2 | DERIVED | job | index_merge | PRIMARY,ix_job_deleted,ix_job_to_delete | ix_job_deleted,ix_job_to_delete | 9,9 | NULL | 223678 | Using intersect(ix_job_deleted,ix_job_to_delete); Using where; Using index; Using temporary; Using filesort | But there are other issues going on here too. Back when I last looked at this one of the biggest problems was the SELECT COUNT(*) FROM (SELECT...) problem as described in comment 4. That's not something we can fix until we get it away from TG @paginate. The other suspicious thing is what you noted above in comment 10, the nested EXISTS clauses vs. the more obvious direct join. Ironically, that code was already changed *from* the joins to the EXISTS version for performance reasons: http://gerrit.beaker-project.org/1267 although looking back at it now, I'm not sure that it was actually necessary or desired as part of that bug fix... (In reply to Dan Callaghan from comment #12) > I definitely *don't* think that dropping the FK on recipe_task.recipe_id, > just to optimise this query, is a good option. Oh, I agree, though my phrasing may have suggested otherwise. I'm more astonished at how much of a difference it makes, though I shouldn't be given my experience with MySQL. > Seeing this, I wonder if we could improve things by getting job.to_delete or > job.deleted out of the WHERE clause (if job.deleted is not null, > job.to_delete should also be not NULL): I did try this, but it doesn't make any difference in query time. > But there are other issues going on here too. Back when I last looked at > this one of the biggest problems was the SELECT COUNT(*) FROM (SELECT...) > problem as described in comment 4. That's not something we can fix until we > get it away from TG @paginate. Well, we can't fix the select count(*), but as I noted in comment 10 we may be able to fix the large temptable consequence of it. If the rearranged 'tasks = RecipeTask.query' works properly with the other filter parameters, we can at least eliminate the problem of filling up /tmp and get a modest speed improvement. > The other suspicious thing is what you noted above in comment 10, the nested > EXISTS clauses vs. the more obvious direct join. Ironically, that code was > already changed *from* the joins to the EXISTS version for performance > reasons: > > http://gerrit.beaker-project.org/1267 > > although looking back at it now, I'm not sure that it was actually necessary > or desired as part of that bug fix... Beaker 24.0 has been released. |