620603 – investigate root cause of "firstResult/maxResults specified with collection fetch; applying in memory!"

Bug 620603 - investigate root cause of "firstResult/maxResults specified with collection fetch; applying in memory!"

Summary: investigate root cause of "firstResult/maxResults specified with collection f...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	Performance
Sub Component:
Version:	JON 2.4.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	ER01
Target Release:	JON 3.2.0
Assignee:	Lukas Krejci
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	jon30-sprint9 rhq-perf
TreeView+	depends on / blocked

Reported:	2010-08-03 00:09 UTC by Joseph Marques
Modified:	2018-11-14 09:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	749589	0	medium	CLOSED	criteria API - a Criteria w/ an unlimited PageControl, as well as one or more fetches, returns duplicate results	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Legacy)	65206	0	None	None	None	Never

Internal Links: 749589

Description Joseph Marques 2010-08-03 00:09:03 UTC

Lately, developers have started to notice the following message printed in the server log:

2010-08-02 09:32:47,478 WARN  [org.hibernate.hql.ast.QueryTranslatorImpl] firstResult/maxResults specified with collection fetch; applying in memory!

According to several articles, this could be a performance bottleneck as it would indicate that **much** more data than we think is being fetched across the line, in order to apply the in-memory pagination (firstResult/maxResults) filter.  My suspicions are that we're doing the exact same thing mentioned in [1].  That is, we're trying to pagination a data set where we've also included 'FETCH JOIN' in the JPQL.  From the EJB3 spec:

"The effect of applying setMaxResults or setFirstResult to a query involving fetch joins over collections is undefined" (section 3.6.1 - Query Interface)

So the error we're getting shouldn't come as a surprise if we're doing something that is explicitly undefined by spec.  In this case, hibernate knows how to deal with the scenario (it fetches all the data, and applies the pagination controls in-memory), but that solution could be very expensive if the data set we're dealing with is large.

We should, thus, investigate where these potentially expensive queries are being used, so we can determine whether a fix is warranted, and what the fix should be depending on uses.

-----

[1] - http://java.dzone.com/articles/hibernate-tuning-queries-using

Comment 1 Charles Crouch 2010-08-03 18:33:33 UTC

(1:24:32 PM) ccrouch: joseph42: I wonder if this is related to the issue you were talking about this morning
(1:24:33 PM) ccrouch: https://bugzilla.redhat.com/show_bug.cgi?id=535592
(1:29:04 PM) joseph42: 535592 is definitely bad, but probably not related to 620603.  keep in mind, 603 is a more recent issue, so likely due to moving parts of the back-end over to using the query generation stuff instead of hand-crafting all the JPQL like we used to.

Worth a try :-)

Comment 2 Corey Welton 2010-09-28 12:00:27 UTC

dev task

Comment 7 John Mazzitelli 2011-10-26 19:36:57 UTC

http://java.dzone.com/articles/hibernate-tuning-queries-using?page=0,0

This warning comes out when we have paging set in the hibernate query, but the query also has join fetches. This requires hibernate to ask the DB for all records, and it applies the paging in memory. It sounds like there is an area here where we can eek out some more performance.

Comment 8 Ian Springer 2011-11-22 22:03:47 UTC

[master 2e5c0fe] temporarily turns the log level down from WARN to ERROR for org.hibernate.hql.ast.QueryTranslatorImpl to suppress the warning messages, since the Server log was getting flooded with them.

QE, please verify that the warnings are no longer in the Server log and that there are no other obvious regressions, then flip this back to ON_DEV so I can cherry-pick the commit to the JON3 release branch.

Comment 9 Sunil Kondkar 2011-11-24 11:55:58 UTC

Tested on latest master build#787

( https://hudson.qa.jboss.com/hudson/view/Jon-2.x/job/rhq-master/787 )

Still the warning message is observed in server log:

2011-11-24 15:29:25,214 WARN  [org.hibernate.hql.ast.QueryTranslatorImpl] firstResult/maxResults specified with collection fetch; applying in memory!

It seems the fix is not in the master branch.

Comment 10 Ian Springer 2011-11-28 14:09:08 UTC

Sorry about that. I committed it but forgot to push. It's in now:

http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=commitdiff;h=f1475db

Comment 11 John Mazzitelli 2011-11-28 21:27:43 UTC

> [master 2e5c0fe] temporarily turns the log level down from WARN to ERROR for

that git hash is not valid. this must have been the hash from the local branch??

Comment 12 John Mazzitelli 2011-11-28 21:33:03 UTC

ah.. from comment #10, the hash is f1475db

committed to the release_jon3.x branch : 3f33130

Comment 13 Sunil Kondkar 2011-11-29 12:54:23 UTC

Verified in latest master build#796

The warning messages are not observed in the Server log.

Will verify in next release_jon3.x branch build.

Comment 14 Sunil Kondkar 2011-11-30 09:54:29 UTC

Verified in build#111 in release_jon3.x branch

The warning messages are not observed in the Server log.

Marking as verified.

Comment 15 Ian Springer 2011-11-30 13:24:05 UTC

Suppressing the log messages was just a temporary fix. I'm flipping this back to NEW, so we can look into fixing the underlying issue post JON 3.0.

Comment 16 Jay Shaughnessy 2013-04-30 13:12:48 UTC

We use join fetch in quite a lot of places, but I would expect that Criteria fetches are typically the number one offender.  When requesting optional data via someCriteria.fetchXXX(true) we add a join fetch.  Also, we page by default (200 entries).

We don't always use paging and even when we do it's not always needed.

Do we know is this message is generated whenever there is a query executed with paging and join fetch?  Or only when it's actually performing the extra work (i.e. fetching page 2..N)?

The obvious workaround is to, only in places identified as a real problem, avoid the join fetch by potentially making an additional round trip to gather the optional data.

We need to be cautious here.

Comment 17 Lukas Krejci 2013-05-09 13:18:21 UTC

Hi all,

I've been studying the impact of our use of JOIN FETCH in queries to which we
simulatenously apply limits (captured by
https://bugzilla.redhat.com/show_bug.cgi?id=620603).

There are two of areas where such queries get executed:
1) Any remote API method with a PageControl can potentially cause such query
to get executed (if the query to be run by the API method includes a JOIN
FETCH - this is something I don't have a clear idea about yet - there are 122
places in the codebase that apply ordering and paging using the PageControl
and 70 occurances of "JOIN FETCH" in our domain classes, but obviously not all
of these will result in a possible JOIN FETCH with limit).

2) Any criteria query for criterias that support a fetch of additional
relationships.

The first type is quite hard to determine the impact of, because the vast number of uses of the code that applies the limits to the query that are spread all over the codebase.

The second type is very easy to analyze, because it is concentrated into 2 classes: CriteriaQueryGenerator and CriteriaQueryRunner.

To try and determine the impact of the first type, I implemented a custom QueryGenerator for Hibernate that enhances the default one to log some more detailed data about the query than just the fact that it is filtered in memory (which is what the default impl in Hibernate does). Namely I logged the JPA, SQL and the stacktrace where the DB query is generated from.

With this tool, I went ahead and generated a larg(ish) inventory of 30 agents with some 20000 resources in total. 30 of those resources had alert definitions on them with several thousands fired alerts for each resource. While this dataset is a bit artificial I think it can be used to simulate the problems of some larger-scale RHQ inventories, at least when it comes to query performance.

By "casually clicking" in the UI, trying to do basic tasks like paging through alert history, firing operations, etc, I've found no evidence of occurrences of the first type of potential limited-JOIN-FETCH queries (the reason for that is that the application of limits is co-located with the application of ordering to the queries and it seems that ordering is why that method is called in the majority of cases). On the other hand, even while browsing the UI there were a number criteria queries invoked that featured a limited-JOIN-FETCH query. (For example querying for the 5 most recent alerts on a resource summary page took approx. 2.5s).

This brought me to the conclusion that the criteria queries are the major contributor to our problem.

This is a quite fortunate coincidence because fixing the issue for criteria queries alone is much simpler than trying to fix it globally.

The fix is now in a bug branch https://git.fedorahosted.org/cgit/rhq/rhq.git/log/?h=bug/620603 and I would very much like you to comment review it before merging it into master.

The merit of the fix is the following:

1) Replacing the JOIN FETCH with a manual initialization of the lazily fetched fields.
2) Optimizing the loading of such lazy fields and lazy collections by configuring the "hibernate.default_batch_fetch_size" property in persistence.xml to load the data in batches (if this setting is set to a number N then Hibernate will load the data for a lazy field for N "owning" entities at once instead of 1 by 1. I.e. setting this number to 2 will cut the number of DB roundtrips required load "children" of N "parent" entities in half compared to the trivial case where access to a "children" of a single "parent" entity would require a DB call).

To test the affects of that change, I created a very simple benchmark that basically compared the times required to load collections with differently set limits in the server with or without the patch.

The basis for the benchmark was a ResourceCriteria that tried to fetch all the resources together with their alert definitions, e.g.:

var crit = new ResourceCriteria
crit.fetchAlertDefinitions(true)
crit.setPaging(x, y)

var results = ResourceManager.findResourcesByCriteria(crit)

Without further ado, let me present the results (based on running each test 10 times):

1) Current codebase (without the fix):
No paging: 47417.7
Paging(0, 10): 26225
Paging(0, 50): 25219.8
Paging(0, 200): 24192.1
Paging(0, 500): 24419.2
Paging(0, 2000): 27813.2

2) Fix with batch size 16:
No paging: 52745.2
Paging(0, 10): 83.2
Paging(0, 50): 192.4
Paging(0, 200): 545
Paging(0, 500): 1201
Paging(0, 2000): 4451.7

3) Fix with batch size 32:
No paging: 53401.5
Paging(0, 10): 76
Paging(0, 50): 166.5
Paging(0, 200): 503.8
Paging(0, 500): 1206.4
Paging(0, 2000): 4459

The results show some interesting "features" of the old implementation as well as some interesting facts about the fix:

1) The results of the old implementation are consistent with the behavior of JOIN FETCH with limits as described by Hibernate documentation. The collection is loaded as whole into the memory and only filtered in-memory after that fact. The difference between the "No Paging" and the rest of the results can most probably be atributted to the time required to serialize the huge result set in the "No Paging" case as opposed to the rest of the cases.

2) The times of the results with the increasing page size are much more intuitive here. The speed up over the current codebase is quite impressive. The slowdown in the no-paging scenario is actually not that bad considering the fact that it performs many more DB roundtrips than the current codebase.

3) The improvement over the 2) is within the error margin.

Comment 18 Lukas Krejci 2013-05-14 08:51:27 UTC

commit https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=9a3b4d105099320c6c6794ff04e12285c5399667
Author: Lukas Krejci <lkrejci>
Date:   Thu May 9 15:15:02 2013 +0200

    [BZ 620603] - Re-enabling the limited JOIN FETCH reporting
    with a more informative message, asking the users to co-operate in
    identifying the individual uses of the problematic JOIN FETCH queries that
    we potentially missed.

commit https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=db62629626fdd9049cf89cfa58dc5b14b2a98ea9
Author: Lukas Krejci <lkrejci>
Date:   Thu May 9 13:11:50 2013 +0200

    [BZ 620603] - Don't use JOIN FETCH in criteria queries with limits.
    
    Use the "manual" initialization of lazy fields and collections in
    that case instead.
    
    Together with the "hibernate.default_batch_fetch_size" setting in the
    hibernate configuration, this offers superior performance over the join
    fetch and in memory filtering of results when the number of rows in the
    "primary" tables times the number of rows in the "join-fetched" table
    is non-trivial.

Comment 19 Larry O'Leary 2013-09-06 14:33:04 UTC

As this is MODIFIED or ON_QA, setting milestone to ER1.

Comment 20 Mike Foley 2013-09-13 22:41:38 UTC

warning messages no longer observed

Note You need to log in before you can comment on or make changes to this bug.