Bug 1097094

Summary: [SLOW QUERY] activity grid on system page
Product: [Retired] Beaker Reporter: Raymond Mancy <rmancy>
Component: generalAssignee: Dan Callaghan <dcallagh>
Status: CLOSED CURRENTRELEASE QA Contact: tools-bugs <tools-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 0.15CC: aigao, asaha, dcallagh, ebaak, rmancy, xma
Target Milestone: 0.17.1   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-18 08:17:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Raymond Mancy 2014-05-13 07:04:15 UTC
Here is output from mysqlsla:


______________________________________________________________________ 008 ___
Count         : 71  (0.71%)
Time          : 1485 s total, 20.915493 s avg, 11 s to 77 s max  (0.86%)
  95% of Time : 1270 s total, 18.955224 s avg, 11 s to 37 s max
Lock Time (s) : 0 total, 0 avg, 0 to 0 max  (0.00%)
  95% of Lock : 0 total, 0 avg, 0 to 0 max
Rows sent     : 13.56k avg, 1.74k to 29.19k max  (51.00%)
Rows examined : 54.23k avg, 6.97k to 116.77k max  (0.02%)
Database      : beaker
Users         : 
	beaker.com 10.16.101.67 : 100.00% (71) of query, 100.00% (10029) of all users

Query abstract:
SELECT activity.id AS activity_id, system_activity.id AS system_activity_id, activity.user_id AS activity_user_id, activity.created AS activity_created, activity.type AS activity_type, activity.field_name AS activity_field_name, activity.service AS activity_service, activity.action AS activity_action, activity.old_value AS activity_old_value, activity.new_value AS activity_new_value, system_activity.system_id AS system_activity_system_id FROM activity INNER JOIN system_activity ON activity.id = system_activity.id WHERE N = system_activity.system_id ORDER BY activity.created DESC, activity.id DESC;

Query sample:
SELECT activity.id AS activity_id, system_activity.id AS system_activity_id, activity.user_id AS activity_user_id, activity.created AS activity_created, activity.type AS activity_type, activity.field_name AS activity_field_name, activity.service AS activity_service, activity.action AS activity_action, activity.old_value AS activity_old_value, activity.new_value AS activity_new_value, system_activity.system_id AS system_activity_system_id 
FROM activity INNER JOIN system_activity ON activity.id = system_activity.id 
WHERE 964 = system_activity.system_id ORDER BY activity.created DESC, activity.id DESC;


For a system that has a lot of activity records, the returned data set could be massive and it seems that this is the main reason for the slowness. We should look at whether or not we need the full result set, and/or page the the results.

Comment 2 Raymond Mancy 2014-05-23 04:59:45 UTC
This query is the default query issued when loading the page. As I can't see it ever making sense to want to see all system activites for all systems, I propose we show no activity records on the initial load of the page. It may even make sense to do this for all of the activity records (and at least keep the behaviour consistent, at least between the activity record pages)

Comment 3 Nick Coghlan 2014-05-23 05:07:12 UTC
Yeah, all of the top level Activity->* pages look like they will have the same problem - "show everything" by default could result in some massive queries.

Probably not an urgent change (how often do folks use those pages over the activity tab on the system page?), but +1 for the general idea of having the Activity pages show something like "Enter search parameters above" rather than default to showing everything.

Comment 4 Dan Callaghan 2014-05-23 06:41:15 UTC
Why is a query for all activity records slower than the same query with a bunch of extra complicated WHERE clauses applied?

In both causes we should be doing LIMIT 0, 20 (or whatever) so it's not like we are ever fetching back a huge result set.

What exactly about the "show every activity" query that we do on first page load is slow? Are we accidentally triggering a temp table because of an ORDER BY? In that case we will have the same problem when extra filtering is applied as well, so we are better off solving the problem in all cases instead of just trying to hide by not showing anything on first page load.

Comment 5 Dan Callaghan 2014-05-23 06:42:00 UTC
Also I think you missed this part of the WHERE clause in comment 0:

WHERE N = system_activity.system_id

which indicates that this is probably the system activity grid on the system page, not the Activity -> Systems page.

Comment 6 Raymond Mancy 2014-05-28 03:28:18 UTC
(In reply to Dan Callaghan from comment #5)
> Also I think you missed this part of the WHERE clause in comment 0:
> 
> WHERE N = system_activity.system_id
> 
> which indicates that this is probably the system activity grid on the system
> page, not the Activity -> Systems page.

In comment 0 I made reference to the problem of loading all the records of a given system. And your're right, this would be the history tab on the system page. I was mistaken in comment 2, as it's a different query.

(In reply to Dan Callaghan from comment #4)
> Why is a query for all activity records slower than the same query with a
> bunch of extra complicated WHERE clauses applied?
> 

Depending on what the WHERE clause is, it could make a massive difference. See my comment at the bottom.

> In both causes we should be doing LIMIT 0, 20 (or whatever) so it's not like
> we are ever fetching back a huge result set.
> 
We are.

The problem is in turbogears.paginate, specifically line 189 where it issues a count() on the query object *before* it has applied the paginate (i.e LIMIT) to the query. I guess this is kind of logical as TG's paginate is not really about helping DB performance, but about displaying data in a certain way.


> What exactly about the "show every activity" query that we do on first page
> load is slow? Are we accidentally triggering a temp table because of an
> ORDER BY? 

Not on the "show every activity" page, but we are on the system page. On the "show every activity page", it's just the count() that is causing us the issue.

> In that case we will have the same problem when extra filtering is
> applied as well, so we are better off solving the problem in all cases
> instead of just trying to hide by not showing anything on first page load.

Following on from my earlier comment, depending on what you are filtering on, it can either completely change the nature of the query, could just make it much faster, or could do nothing (maybe even slower under certain conditions, but I haven't yet seen a search combo on the activity page that does this). 

Filtering on 'System/Name' will be a much more efficient query than say Filtering on the 'Via' field (as well as it being a more efficient query than filtering on nothing). It's a completely different kind of query that it does. Also, having things like 'Via' (i.e the service column) not indexed does not help. However we can't realistically speed up every possibly query. After all, we still allow 'contains' on these fields and they will always be inefficient.


So in summary there are two issues, one is that we are returning all of the activity records for a system on the *system* page, when we probably do not have to return all of these records, or we could just make it faster. This is what this BZ refers to (ignoring comment #2) One possible solution to this is look at our ORDER BY statement, which is currently ordering by activity.created DESC, activity.id DESC which is using filesort() and a temp table. I assume this is deliberate because we are generating our timestamp in python and so there is a potential race between the insert of rows and the timestamp, or am I missing something? If that's an issue we could change to default_server, if it is not we could perhaps just ORDER BY system_activity.id, and that would fix our filesort issues. 


There is also the issue of the system activity page itself (which is what comment #2 is referring to), however that's a separate bug that I've filed here BZ#1101817.

Comment 7 Dan Callaghan 2014-05-28 04:21:39 UTC
If changing it to ORDER BY system_activity.id DESC will eliminate the filesort, I think that is fine. The only reason why we would be sorting by created DESC, id DESC is to handle a theoretical database where ids are not monotonically increasing over time.

I guess the question is, is changing the ORDER BY enough to solve all the possible slow queries issued by the system activity grid?

Comment 8 Raymond Mancy 2014-05-28 06:11:08 UTC
(In reply to Dan Callaghan from comment #7)
> If changing it to ORDER BY system_activity.id DESC will eliminate the
> filesort, I think that is fine. The only reason why we would be sorting by
> created DESC, id DESC is to handle a theoretical database where ids are not
> monotonically increasing over time.
> 

Yeah, I definitely don't think that we have to worry about. I don't think postgresql would be an issue here either.

> I guess the question is, is changing the ORDER BY enough to solve all the
> possible slow queries issued by the system activity grid?

No, I don't think that it will fix anything by itself.

Actually, I think the cheap way out of this is to change our activity query.
Currently, if there are no search arguments for the system activity history
we do the following:

    system.activity[:150]

However, the slice doesn't actually add a LIMIT clause. Nor can I see anywhere in the docs that indicate it should. What I think we should be doing is:

   system.dyn_activity.limit(150)

We should probably also fix our ORDER BY statement, but I think the above would be the biggest bang for our buck.

As far as what happens when we do an actual filtered search via the 'History' tab, I don't think we should worry about that at this stage. As that isn't the real issue here (nor perhaps is it a real issue at all).

Comment 9 Dan Callaghan 2014-05-29 00:18:29 UTC
(In reply to Raymond Mancy from comment #8)
> Currently, if there are no search arguments for the system activity history
> we do the following:
> 
>     system.activity[:150]
> 
> However, the slice doesn't actually add a LIMIT clause. Nor can I see
> anywhere in the docs that indicate it should. What I think we should be
> doing is:
> 
>    system.dyn_activity.limit(150)

Oh wow, nicely spotted! I can't believe we never found this before now. This probably explains why the system page has always been so slow to load...

About slicing queries, I found this old thread:

https://groups.google.com/forum/#!topic/sqlalchemy-devel/C4g4rup-P_c

I'm pretty sure I remember query slicing being documented, at least on earlier SQLAlchemy versions. I guess the feature does/did exist (documented or not) but doesn't actually work the way you would expect. Maybe it was dropped from the docs for that reason.

> We should probably also fix our ORDER BY statement, but I think the above
> would be the biggest bang for our buck.

Yes, we can fix both of those things for this bug.

> As far as what happens when we do an actual filtered search via the
> 'History' tab, I don't think we should worry about that at this stage. As
> that isn't the real issue here (nor perhaps is it a real issue at all).

Yes agreed. I'm sure there are plenty of ways to produce an expensive query from the activity search but they should happen fairly rarely. I would hesitate to add any new indexes on there as well, because the activity table is written to much more often than it is searched, and it's also one of our biggest tables so just adding the index would be very expensive.

Comment 10 Dan Callaghan 2014-05-29 00:21:39 UTC
> >     system.activity[:150]

Oh... I wasn't reading this properly, I just noticed that it's not slicing a query at all, it's slicing the relationship list. So SQLAlchemy is never going to do any LIMIT for this.

I wonder if we should just delete that property entirely -- we should only ever be inserting new activity, or doing a paginated query on the existing activity. Fetching every single activity row is always going to be slow and we should never need to do it.

Comment 11 Raymond Mancy 2014-05-29 02:54:30 UTC
The current relationship between System and Activity orders by two columns, one of them SystemActivity.id. However in the SQL it uses activity_table.id for ordering. These are logically equivalent, but to gain efficiencies in the query we actually need to order by system_activity.id.

Comment 12 Dan Callaghan 2014-07-02 02:18:30 UTC
On Gerrit: http://gerrit.beaker-project.org/3179

Comment 14 Dan Callaghan 2014-07-04 00:06:19 UTC
For reference, this is an example of the problematic query produced when you visit a system page:

SELECT activity.id AS activity_id, system_activity.id AS system_activity_id, activity.user_id AS activity_user_id, activity.created AS activity_created, activity.type AS activity_type, activity.field_name AS activity_field_name, activity.service AS activity_service, activity.action AS activity_action, activity.old_value AS activity_old_value, activity.new_value AS activity_new_value, system_activity.system_id AS system_activity_system_id 
FROM activity INNER JOIN system_activity ON activity.id = system_activity.id 
WHERE 3395 = system_activity.system_id ORDER BY activity.created DESC, activity.id DESC;

Here is the query produced after this patch:

SELECT activity.id AS activity_id, system_activity.id AS system_activity_id, activity.user_id AS activity_user_id, activity.created AS activity_created, activity.type AS activity_type, activity.field_name AS activity_field_name, activity.service AS activity_service, activity.action AS activity_action, activity.old_value AS activity_old_value, activity.new_value AS activity_new_value, system_activity.system_id AS system_activity_system_id 
FROM activity INNER JOIN system_activity ON activity.id = system_activity.id 
WHERE 3395 = system_activity.system_id ORDER BY system_activity.id DESC, activity.created DESC 
 LIMIT 0, 30;

Comment 15 Dan Callaghan 2014-07-04 00:11:09 UTC
However, I just noticed that this patch breaks column sorting in the activity grid, sigh...

Comment 16 Dan Callaghan 2014-07-07 03:26:03 UTC
(In reply to Dan Callaghan from comment #15)

Fix for column sorting: http://gerrit.beaker-project.org/3187

Comment 18 Dan Callaghan 2014-07-08 01:51:36 UTC
(In reply to Dan Callaghan from comment #14)

Here is the updated query produced after this patch. This is on first page load (no filtering or sorting applied to the activity grid). The WHERE and ORDER BY clauses will change if sorting or filtering is applied.

SELECT activity.id AS activity_id, system_activity.id AS system_activity_id, activity.user_id AS activity_user_id, activity.created AS activity_created, activity.type AS acti[480/1872]
 activity.field_name AS activity_field_name, activity.service AS activity_service, activity.action AS activity_action, activity.old_value AS activity_old_value, activity.new_value AS activity_new_value, system_activity.system_id AS system_activity_syste
m_id 
FROM activity INNER JOIN system_activity ON activity.id = system_activity.id 
WHERE 3395 = system_activity.system_id ORDER BY activity.id DESC 
 LIMIT 0, 30

Unfortunately we are back to sorting by activity.id rather than system_activity.id in this case, due to @paginate's default sorting. There doesn't seem to be any way to avoid that. However it doesn't seem to affect performance much now that we have the LIMIT in place.

Comment 19 Dan Callaghan 2014-07-08 01:52:31 UTC
(In reply to Dan Callaghan from comment #18)
> activity.type AS acti[480/1872]

Sorry that was a copy-paste error from tmux, let me try again...

SELECT activity.id AS activity_id, system_activity.id AS system_activity_id, activity.user_id AS activity_user_id, activity.created AS activity_created, activity.type AS activity_type,
 activity.field_name AS activity_field_name, activity.service AS activity_service, activity.action AS activity_action, activity.old_value AS activity_old_value, activity.new_value AS activity_new_value, system_activity.system_id AS system_activity_syste
m_id 
FROM activity INNER JOIN system_activity ON activity.id = system_activity.id 
WHERE 3395 = system_activity.system_id ORDER BY activity.id DESC 
 LIMIT 0, 30;

Comment 20 xuezhi ma 2014-07-08 02:54:13 UTC
(In reply to Dan Callaghan from comment #19)
> (In reply to Dan Callaghan from comment #18)
> > activity.type AS acti[480/1872]
> 
> Sorry that was a copy-paste error from tmux, let me try again...
> 
> SELECT activity.id AS activity_id, system_activity.id AS system_activity_id,
> activity.user_id AS activity_user_id, activity.created AS activity_created,
> activity.type AS activity_type,
>  activity.field_name AS activity_field_name, activity.service AS
> activity_service, activity.action AS activity_action, activity.old_value AS
> activity_old_value, activity.new_value AS activity_new_value,
> system_activity.system_id AS system_activity_syste
> m_id 
> FROM activity INNER JOIN system_activity ON activity.id = system_activity.id 
> WHERE 3395 = system_activity.system_id ORDER BY activity.id DESC 
>  LIMIT 0, 30;
Thanks Dan for providing the query, the query with limit 0, 30 rather than query all records, so change it to verified.

Comment 21 Dan Callaghan 2014-07-18 08:17:41 UTC
Beaker 0.17.1 has been released.