Bug 1224848

Summary:	Executed tasks page causes a query which produces a very large temp table, causing MySQL to fill up /tmp
Product:	[Retired] Beaker	Reporter:	Amit Saha <asaha>
Component:	general	Assignee:	Jon Orris <jorris>
Status:	CLOSED CURRENTRELEASE	QA Contact:	tools-bugs <tools-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	20	CC:	dcallagh, dowang, ebaak, jorris, jtluka, mjia, rjoost
Target Milestone:	24.0	Keywords:	Patch
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-21 18:49:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	972409

Description Amit Saha 2015-05-26 02:31:31 UTC

Description of problem:


During the migration to new hardware and apparently less disk space, it became apparent that calling https://<beaker-server-url>/tasks/executed results in the 500 ISE. In the backend, it's filling up disk space (close to 151G space for the temporary table).

May be the fix for bug 835179 needs to be considered again in this light.


Version-Release number of selected component (if applicable):


How reproducible:



Steps to Reproduce:
1. https://<beaker-server-url>/tasks/executed
2. 
3.

Actual results:


Expected results:


Additional info:

Comment 1 matt jia 2015-05-27 01:29:24 UTC

I think what we need to do is to rebuild the whole page by using “pageable JSON collections” API similar to pools page.

Comment 2 matt jia 2015-06-04 04:16:16 UTC

Before pointing this page to Backgrid, we need to point the Task Library page first.

Comment 3 Dan Callaghan 2015-06-09 01:07:23 UTC

(In reply to matt jia from comment #1)
> I think what we need to do is to rebuild the whole page by using “pageable
> JSON collections” API similar to pools page.

Longer term that is the right approach, but we need to find a short term fix to make the query runnable without producing a massive temp table.

Comment 4 Dan Callaghan 2015-06-10 07:58:58 UTC

The problem is the first query issued by @paginate which tries to count all recipe tasks:

SELECT count(*) AS count_1 
FROM (SELECT recipe_task.id AS recipe_task_id, recipe_task.recipe_id AS recipe_task_recipe_id, recipe_task.name AS recipe_task_name, recipe_task.version AS recipe_task_version, recipe_task.fetch_url AS recipe_task_fetch_url, recipe_task.fetch_subdir AS recipe_task_fetch_subdir, recipe_task.task_id AS recipe_task_task_id, recipe_task.start_time AS recipe_task_start_time, recipe_task.finish_time AS recipe_task_finish_time, recipe_task.result AS recipe_task_result, recipe_task.status AS recipe_task_status, recipe_task.role AS recipe_task_role 
FROM recipe_task 
WHERE EXISTS (SELECT 1 
FROM recipe 
WHERE recipe.id = recipe_task.recipe_id AND (EXISTS (SELECT 1 
FROM recipe_set 
WHERE recipe_set.id = recipe.recipe_set_id AND (EXISTS (SELECT 1 
FROM job 
WHERE job.id = recipe_set.job_id AND job.to_delete IS NULL AND job.deleted IS NULL)))))) AS anon_1

We only noticed this after the recent hosting migration for Red Hat's Beaker instance but there is actually no way this could ever have completed in a reasonable amount of time.

The real problem is the SELECT count(*) FROM (SELECT ...) construct, which has caused us grief before. It was a change in SQLAlchemy 0.7:

http://docs.sqlalchemy.org/en/latest/changelog/migration_07.html#query-count-simplified-should-work-virtually-always

As noted in the SQLAlchemy changelog, there is a fairly simple workaround available (query for func.count() instead of calling .count) but, as in bug 1101817, we have no control over the .count() invocation because it happens inside TurboGears @paginate.

So there is no simple fix available, our only option is to reimplement the page avoiding TurboGears @paginate (as we have already done with the activity grid for example).

Comment 5 Dan Callaghan 2015-06-10 08:04:01 UTC

We did already replace the Executed Tasks tab on the system page with a Backgrid, and were planning to do the same for the task page and the Executed Tasks page for bug 972409. However we need to be a little more careful -- the current Backgrid implementation does not show any results or logs, only the overall status/result for the whole recipe task.

For the Executed Tasks tab on the system page we decided that was sufficient because the most common use case is just searching through the history of what was run on the system. The user can then click on the T:___ link to see the full results.

However it seems like some users may be relying on the Executed Tasks tab as a way to view detailed results. Piwik stats show quite a few references to URLs like /tasks/executed?recipe_task_id=123 which presents quite a fast and convenient view of the detailed results for a single task. This is actually nicer than linking to /recipes/123#task456. Not sure if these URLs are being constructed by hand or are linked to from somewhere like job matrix, need to double check.

Comment 6 Dan Callaghan 2015-07-16 07:16:32 UTC

Our current plan is to redirect URLs with a single recipe_task_id= parameter (like /tasks/executed?recipe_task_id=123) to the recipe page, after it is redesigned for Beaker 22, and then convert the Executed Tasks page to Backgrid in the same way as the Executed Tasks tab on the system page (no task results).

Comment 8 Dan Callaghan 2016-04-05 05:14:49 UTC

This bug, and the executed tasks tab in general, need some more thought... Ideally we would eliminate it in favour of the job matrix, or the recipe page, or both somehow. But we're not clear on that yet.

Comment 9 Roman Joost 2016-10-10 23:43:33 UTC

The problem is that we can't remove the grid, because users are using this since there are not better alternatives. Third party tools is producing links to this grid which are embedded in TCMS test runs.

If would come up with better alternatives to the grid we will most likely see a drop in the usage of this view.

We need to find answers to questions in regards to user behaviour:

* What do people use the executed tasks grid for and what goals do they need to solve by using it
* Maybe use the Piwik stats to see what it is used for

Comment 10 Jon Orris 2017-01-05 23:51:29 UTC

(In reply to Dan Callaghan from comment #8)
> This bug, and the executed tasks tab in general, need some more thought...
> Ideally we would eliminate it in favour of the job matrix, or the recipe
> page, or both somehow. But we're not clear on that yet.

As far as the count(*) problem, I think we can do something about it. Essentially it needs to compute:

SELECT COUNT(*) FROM recipe_task
INNER JOIN recipe ON recipe.id = recipe_task.recipe_id
INNER JOIN recipe_set ON recipe_set.id = recipe.recipe_set_id
INNER JOIN job ON job.id = recipe_set.job_id WHERE (job.to_delete is null and job.deleted is null)

Which is far simpler than in comment 4. I can come close to this query by rearranging
the base sqlalchemy ORM to:

tasks = RecipeTask.query.join(Recipe, RecipeTask.recipe_id==Recipe.id)\
            .join(RecipeSet, Recipe.recipe_set_id==RecipeSet.id)\
            .join(Job, RecipeSet.job_id==Job.id)\
            .filter(text('recipe_task.recipe_id = recipe.id and recipe.recipe_set_id = recipe_set.id and recipe_set.job_id = job.id and job.to_delete is null and job.deleted is null'))\
            .options(joinedload(RecipeTask.task),
                     joinedload(RecipeTask.results).joinedload(RecipeTaskResult.logs))

tasks.count() returns the same # of rows, and in far less time than the existing setup.
It consistently cuts 20-30 seconds out of the response time for the webui.

NOTE: Not tested using query parameters in the UI, so this option may require further changes to the query code that follows, or may be outright impractical.

There are two other issues with query speed:
 * The default 'order by recipe_task.id desc' causes temp table and filesort usage. Not an easy way around that, as that's the sane ordering.
 
 * There are a number of N+1 queries caused by accessing attributes of RecipeTask in the webui. In theory, these can be improved by joinedload & contains_eager usage, but then 'order by' makes the query significantly worse, eating up 20+ gigs before / ran out of room.

Comment 11 Jon Orris 2017-01-08 22:31:44 UTC

A bit of further investigation reveals that a foreign key index is having significant negative impact on performance. Specifically the recipie_id foreign key on recipie_task => recipe. With this foreign key in place, the fundamental query that fetches results has the following query plan:



+------+-------------+--------------------------+-------------+------------------------------------------------+---------------------------------+---------+--------------------------------+--------+-------------------------------------------------------------------------------------------------------------+
| id   | select_type | table                    | type        | possible_keys                                  | key                             | key_len | ref                            | rows   | Extra                                                                                                       |
+------+-------------+--------------------------+-------------+------------------------------------------------+---------------------------------+---------+--------------------------------+--------+-------------------------------------------------------------------------------------------------------------+
|    1 | PRIMARY     | <derived2>               | ALL         | NULL                                           | NULL                            | NULL    | NULL                           |     30 | Using filesort                                                                                              |
|    1 | PRIMARY     | task_1                   | eq_ref      | PRIMARY                                        | PRIMARY                         | 4       | anon_1.recipe_task_task_id     |      1 | Using where                                                                                                 |
|    1 | PRIMARY     | recipe_task_result_1     | ref         | recipe_task_id                                 | recipe_task_id                  | 5       | anon_1.recipe_task_id          |      2 |                                                                                                             |
|    1 | PRIMARY     | log_recipe_task_result_1 | ref         | recipe_task_result_id_id,recipe_task_result_id | recipe_task_result_id_id        | 4       | beaker.recipe_task_result_1.id |      1 | Using where                                                                                                 |
|    2 | DERIVED     | job                      | index_merge | PRIMARY,ix_job_deleted,ix_job_to_delete        | ix_job_deleted,ix_job_to_delete | 9,9     | NULL                           | 223678 | Using intersect(ix_job_deleted,ix_job_to_delete); Using where; Using index; Using temporary; Using filesort |
|    2 | DERIVED     | recipe_set               | ref         | PRIMARY,job_id                                 | job_id                          | 4       | beaker.job.id                  |      1 | Using index                                                                                                 |
|    2 | DERIVED     | recipe                   | ref         | PRIMARY,recipe_set_id                          | recipe_set_id                   | 4       | beaker.recipe_set.id           |      1 | Using index                                                                                                 |
|    2 | DERIVED     | recipe_task              | ref         | recipe_id_fk                                   | recipe_id_fk                    | 4       | beaker.recipe.id               |      6 |                                                                                                             |
+------+-------------+--------------------------+-------------+------------------------------------------------+---------------------------------+---------+--------------------------------+--------+-------------------------------------------------------------------------------------------------------------+


Removing the foreign key & index results in the following query plan:
+------+-------------+--------------------------+--------+------------------------------------------------+--------------------------+---------+--------------------------------+------+----------------+
| id   | select_type | table                    | type   | possible_keys                                  | key                      | key_len | ref                            | rows | Extra          |
+------+-------------+--------------------------+--------+------------------------------------------------+--------------------------+---------+--------------------------------+------+----------------+
|    1 | PRIMARY     | <derived2>               | ALL    | NULL                                           | NULL                     | NULL    | NULL                           |   30 | Using filesort |
|    1 | PRIMARY     | task_1                   | eq_ref | PRIMARY                                        | PRIMARY                  | 4       | anon_1.recipe_task_task_id     |    1 | Using where    |
|    1 | PRIMARY     | recipe_task_result_1     | ref    | recipe_task_id                                 | recipe_task_id           | 5       | anon_1.recipe_task_id          |    2 |                |
|    1 | PRIMARY     | log_recipe_task_result_1 | ref    | recipe_task_result_id_id,recipe_task_result_id | recipe_task_result_id_id | 4       | beaker.recipe_task_result_1.id |    1 | Using where    |
|    2 | DERIVED     | recipe_task              | index  | NULL                                           | PRIMARY                  | 4       | NULL                           |   30 |                |
|    2 | DERIVED     | recipe                   | eq_ref | PRIMARY,recipe_set_id                          | PRIMARY                  | 4       | beaker.recipe_task.recipe_id   |    1 |                |
|    2 | DERIVED     | recipe_set               | eq_ref | PRIMARY,job_id                                 | PRIMARY                  | 4       | beaker.recipe.recipe_set_id    |    1 |                |
|    2 | DERIVED     | job                      | eq_ref | PRIMARY,ix_job_deleted,ix_job_to_delete        | PRIMARY                  | 4       | beaker.recipe_set.job_id       |    1 | Using where    |
+------+-------------+--------------------------+--------+------------------------------------------------+--------------------------+---------+--------------------------------+------+----------------+


The end result is that the /tasks/executed page takes around 2 minutes to load with the database as is, while removing the foreign key drops it to around 20 seconds. While I'm ordinarily a big fan of referential constraints, that's a huge difference.

Query for reference:

explain SELECT anon_1.recipe_task_id AS anon_1_recipe_task_id,
 anon_1.recipe_task_recipe_id AS anon_1_recipe_task_recipe_id,
 anon_1.recipe_task_name AS anon_1_recipe_task_name,
 anon_1.recipe_task_version AS anon_1_recipe_task_version,
 anon_1.recipe_task_fetch_url AS anon_1_recipe_task_fetch_url,
 anon_1.recipe_task_fetch_subdir AS anon_1_recipe_task_fetch_subdir,
 anon_1.recipe_task_task_id AS anon_1_recipe_task_task_id,
 anon_1.recipe_task_start_time AS anon_1_recipe_task_start_time,
 anon_1.recipe_task_finish_time AS anon_1_recipe_task_finish_time,
 anon_1.recipe_task_result AS anon_1_recipe_task_result,
 anon_1.recipe_task_status AS anon_1_recipe_task_status,
 anon_1.recipe_task_role AS anon_1_recipe_task_role,
 task_1.id AS task_1_id,
 task_1.name AS task_1_name,
 task_1.rpm AS task_1_rpm,
 task_1.path AS task_1_path,
 task_1.description AS task_1_description,
 task_1.repo AS task_1_repo,
 task_1.avg_time AS task_1_avg_time,
 task_1.destructive AS task_1_destructive,
 task_1.nda AS task_1_nda,
 task_1.creation_date AS task_1_creation_date,
 task_1.update_date AS task_1_update_date,
 task_1.uploader_id AS task_1_uploader_id,
 task_1.owner AS task_1_owner,
 task_1.version AS task_1_version,
 task_1.license AS task_1_license,
 task_1.priority AS task_1_priority,
 task_1.valid AS task_1_valid,
 log_recipe_task_result_1.id AS log_recipe_task_result_1_id,
 log_recipe_task_result_1.path AS log_recipe_task_result_1_path,
 log_recipe_task_result_1.filename AS log_recipe_task_result_1_filename,
 log_recipe_task_result_1.start_time AS log_recipe_task_result_1_start_time,
 log_recipe_task_result_1.server AS log_recipe_task_result_1_server,
 log_recipe_task_result_1.basepath AS log_recipe_task_result_1_basepath,
 log_recipe_task_result_1.recipe_task_result_id AS log_recipe_task_result_1_recipe_task_result_id,
 recipe_task_result_1.id AS recipe_task_result_1_id,
 recipe_task_result_1.recipe_task_id AS recipe_task_result_1_recipe_task_id,
 recipe_task_result_1.path AS recipe_task_result_1_path,
 recipe_task_result_1.result AS recipe_task_result_1_result,
 recipe_task_result_1.score AS recipe_task_result_1_score,
 recipe_task_result_1.log AS recipe_task_result_1_log,
 recipe_task_result_1.start_time AS recipe_task_result_1_start_time  FROM (SELECT recipe_task.id AS recipe_task_id,
 recipe_task.recipe_id AS recipe_task_recipe_id,
 recipe_task.name AS recipe_task_name,
 recipe_task.version AS recipe_task_version,
 recipe_task.fetch_url AS recipe_task_fetch_url,
 recipe_task.fetch_subdir AS recipe_task_fetch_subdir,
 recipe_task.task_id AS recipe_task_task_id,
 recipe_task.start_time AS recipe_task_start_time,
 recipe_task.finish_time AS recipe_task_finish_time,
 recipe_task.result AS recipe_task_result,
 recipe_task.status AS recipe_task_status,
 recipe_task.role AS recipe_task_role  FROM recipe_task INNER JOIN recipe ON recipe_task.recipe_id = recipe.id INNER JOIN recipe_set ON recipe.recipe_set_id = recipe_set.id INNER JOIN job ON recipe_set.job_id = job.id  WHERE recipe_task.recipe_id = recipe.id and recipe.recipe_set_id = recipe_set.id and recipe_set.job_id = job.id and job.to_delete is null and job.deleted is null ORDER BY recipe_task.id DESC   LIMIT 30) AS anon_1 LEFT OUTER JOIN task AS task_1 ON task_1.id = anon_1.recipe_task_task_id LEFT OUTER JOIN recipe_task_result AS recipe_task_result_1 ON anon_1.recipe_task_id = recipe_task_result_1.recipe_task_id LEFT OUTER JOIN log_recipe_task_result AS log_recipe_task_result_1 ON recipe_task_result_1.id = log_recipe_task_result_1.recipe_task_result_id ORDER BY anon_1.recipe_task_id DESC;

Comment 12 Dan Callaghan 2017-01-09 07:10:52 UTC

Yeah, the frustrating thing with MySQL is that once the queries get beyond a certain complexity its query planner seems to just start making some random guesses which are very often wrong. You can tweak the query in various ways to try and trick it into picking the "right" query plan (certainly I have spent a lot of my time doing that for various bugs in the past) but it's usually not a very productive approach because the slightest tweak further down the track, or even a change in the relative sizes of some tables, can throw everything out of whack again.

Usually if we are hitting some badly optimised query the best approach is to just try to find ways to simplify it.

I definitely *don't* think that dropping the FK on recipe_task.recipe_id, just to optimise this query, is a good option.

Seeing this, I wonder if we could improve things by getting job.to_delete or job.deleted out of the WHERE clause (if job.deleted is not null, job.to_delete should also be not NULL):

But there are other issues going on here too. Back when I last looked at this one of the biggest problems was the SELECT COUNT(*) FROM (SELECT...) problem as described in comment 4. That's not something we can fix until we get it away from TG @paginate.

The other suspicious thing is what you noted above in comment 10, the nested EXISTS clauses vs. the more obvious direct join. Ironically, that code was already changed *from* the joins to the EXISTS version for performance reasons:

http://gerrit.beaker-project.org/1267

although looking back at it now, I'm not sure that it was actually necessary or desired as part of that bug fix...

Comment 13 Jon Orris 2017-01-09 18:09:55 UTC

(In reply to Dan Callaghan from comment #12)
> I definitely *don't* think that dropping the FK on recipe_task.recipe_id,
> just to optimise this query, is a good option.

Oh, I agree, though my phrasing may have suggested otherwise. I'm more astonished at how much of a difference it makes, though I shouldn't be given my experience with MySQL.

 
> Seeing this, I wonder if we could improve things by getting job.to_delete or
> job.deleted out of the WHERE clause (if job.deleted is not null,
> job.to_delete should also be not NULL):

I did try this, but it doesn't make any difference in query time.

> But there are other issues going on here too. Back when I last looked at
> this one of the biggest problems was the SELECT COUNT(*) FROM (SELECT...)
> problem as described in comment 4. That's not something we can fix until we
> get it away from TG @paginate.

Well, we can't fix the select count(*), but as I noted in comment 10 we may be able to fix the large temptable consequence of it. If the rearranged 'tasks = RecipeTask.query' works properly with the other filter parameters, we can at least eliminate the problem of filling up /tmp and get a modest speed improvement.

 
> The other suspicious thing is what you noted above in comment 10, the nested
> EXISTS clauses vs. the more obvious direct join. Ironically, that code was
> already changed *from* the joins to the EXISTS version for performance
> reasons:
> 
> http://gerrit.beaker-project.org/1267
> 
> although looking back at it now, I'm not sure that it was actually necessary
> or desired as part of that bug fix...

Comment 14 Jon Orris 2017-01-10 20:56:31 UTC

https://gerrit.beaker-project.org/#/c/5574/

Comment 20 Dan Callaghan 2017-02-21 18:49:00 UTC

Beaker 24.0 has been released.