855674 – search for operation schedule: PageList was passed an empty collection but 'totalSize' was 1

Bug 855674 - search for operation schedule: PageList was passed an empty collection but 'totalSize' was 1

Summary: search for operation schedule: PageList was passed an empty collection but 't...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Operations Network
Classification:	JBoss
Component:	CLI, Operations
Sub Component:
Version:	JON 3.1.1,JON 3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	Alpha-x
Target Release:	JON 3.2.0
Assignee:	Lukas Krejci
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-09-09 21:04 UTC by Libor Zoubek
Modified:	2015-11-02 00:43 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-01-02 20:33:18 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
server.log (623.57 KB, application/octet-stream) 2012-09-09 22:40 UTC, Libor Zoubek	no flags	Details
reproducer script (119.71 KB, application/x-javascript) 2013-08-19 13:42 UTC, Libor Zoubek	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	773626	high	CLOSED	Failed to load resource composite data error when refreshing Inventory page (Unavailable Servers, Servers, etc)	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	808548	low	CLOSED	Extremely low latency networks cause DataSource fetching/pagination issues	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	851288	low	CLOSED	Failed to load resource composite data error when refreshing Inventory page (Unavailable Servers, Servers, etc)	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	857032	unspecified	CLOSED	Server returned FAILURE with no error message during inventory listing	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	918906	unspecified	ON_QA	Purging of resource types fails	2022-03-31 04:28:30 UTC

Internal Links: 773626 808548 851288 857032 918906

Description Libor Zoubek 2012-09-09 21:04:34 UTC

Description of problem: I have a CLI script that calls an operation on resource and then it immediately starts querying for operation schedule id to check the result. From time to time my it fails.


Version-Release number of selected component (if applicable):
JON 3.1.1.CR2

How reproducible: 5-10%


Steps to Reproduce:
1. run a CLI script, that invokes "discovery" operation and starts checking operation result
  
Actual results:

This is what I get in CLI:

Exception in thread "main" org.rhq.enterprise.client.script.CLIScriptException: javax.script.ScriptException: sun.org.mozilla.javascript.internal.WrappedException: Wrapped javax.ejb.EJBException: [Warning] java.lang.IllegalArgumentException: PageList was passed an empty collection but 'totalSize' was 1, PageControl[page=0, size=1, sort[startTime DESC]] (<Unknown source>#980) in <Unknown source> at line number 980
	at org.rhq.enterprise.client.commands.ScriptCommand.executeScriptFile(ScriptCommand.java:298)
	at org.rhq.enterprise.client.commands.ScriptCommand.execute(ScriptCommand.java:108)
	at org.rhq.enterprise.client.ClientMain$StartupConfiguration.process(ClientMain.java:141)
	at org.rhq.enterprise.client.ClientMain.main(ClientMain.java:161)
Caused by: javax.script.ScriptException: sun.org.mozilla.javascript.internal.WrappedException: Wrapped javax.ejb.EJBException: [Warning] java.lang.IllegalArgumentException: PageList was passed an empty collection but 'totalSize' was 1, PageControl[page=0, size=1, sort[startTime DESC]]





Expected results: No exception


Additional info:

Comment 1 Libor Zoubek 2012-09-09 22:40:37 UTC

Created attachment 611285 [details]
server.log

Comment 3 Charles Crouch 2012-09-17 19:04:10 UTC

Targeting for consideration in jon312. If we have an automated way to cause this to fail 5-10% of the time, that's hopefully sufficient to give us some insight.

Comment 4 vlad crc 2013-05-24 12:30:40 UTC

I think is related to https://bugzilla.redhat.com/show_bug.cgi?id=773626.

Comment 5 John Mazzitelli 2013-07-03 13:16:53 UTC

Libor - can you attach your script that you used?

Comment 6 Heiko W. Rupp 2013-08-09 15:49:55 UTC

Seems virtually the same as Bug 972756

Comment 7 Libor Zoubek 2013-08-19 13:42:01 UTC

Created attachment 788074 [details]
reproducer script

Here is a repro-script. It does not retrieve operation history, but just asks for resources instead. When using "retrieve operation history" test I was not able to reproduce it even on JON 3.1.2 (we're randomly hitting this issue on this version).

Test works the way that it:
creates a new resource (network interface on as7)
and very often (with 10ms sleep time) asks server for createResourceHistory
and then for new child resource presence => here it fails

That file is large because it contains whole rhqapi.js, important part is at the bottom.

I am able to hit this bug with this script on RHQ 4.8 anytime when I run it around attempt 3-6.

Comment 8 Heiko W. Rupp 2013-08-27 10:36:24 UTC

I can reproduce (after fixing Bug 972756, which seems to be completely independent, even if the symptoms are similar) it with the supplied script, thanks Libor.

I think the underlying issue is a design problem, as the org.rhq.enterprise.server.util.CriteriaQueryRunner#execute method is not transactional and does this:

    results = new PageList<T>(getCollection(), getCount(), pageControl);

which means it first fires a query for the collection and then another one for the count.
When now a resource creation/deletion/... report is coming in after the SQL for getCollection() was fired and before getCount() has started, then the results may differ.

So basically we should either
- use a more clever query to return count and collection in one go
or
- use a transaction, so that the database keeps the state consistent.

This may not be enough for other cases of this "PageList .. " error

--- server log ---

Caused by: java.lang.IllegalArgumentException: PageList was passed an empty collection but 'totalSize' was 1, PageControl[page=0, size=-1]
	at org.rhq.core.domain.util.PageList.<init>(PageList.java:87) [rhq-core-domain-ejb3.jar:4.9.0-SNAPSHOT]
	at org.rhq.enterprise.server.util.CriteriaQueryRunner.execute(CriteriaQueryRunner.java:70) [rhq-server.jar:4.9.0-SNAPSHOT]
	at org.rhq.enterprise.server.resource.ResourceManagerBean.findResourcesByCriteria(ResourceManagerBean.java:2595) [rhq-server.jar:4.9.0-SNAPSHOT]

Comment 9 Lukas Krejci 2013-08-27 13:24:03 UTC

The org.rhq.enterprise.server.util.CriteriaQueryRunner#execute is not transactional in and of itself, because the runner is not an EJB. That said, the runner is called from SLSBs. The SLSB method in question for the repro script is ResourceManagerBean#findResourcesByCriteria, which doesn't specify a transaction attribute and therefore defaults to "REQUIRED". Hence the criteria DOES run in a transaction.

The problem is that our datasource, RHQDS, specifies the transaction isolation level as TRANSACTION_READ_COMMITTED. That isolation level allows for "phantom reads", i.e. a single select can return different number of rows depending on other concurrently committed transactions.

Therefore I cannot see many ways forward with this. I'd argue against upping the transaction isolation, because that has high performance cost and has to be set datasource-wide.

The only other option is to detect this particular error inside the criteria runner and re-run the getCollection() and getCount() until the sizes match or somehow cleverly rewrite the queries such that they return the count and the results within 1 query.

Comment 10 Lukas Krejci 2013-09-02 11:24:34 UTC

commit dc6a1fedea3bfa9efdfde446d1712d353a591778
Author: Lukas Krejci <lkrejci>
Date:   Thu Aug 29 18:34:45 2013 +0200

    [BZ 855674] - Get rid of the dreaded "PageList was passed an empty collection but 'totalSize' was X" exception.
    
    The main part of the fix is just the realization that the "error"
    condition for which the exception was thrown is actually no error but can
    be triggered by the mere fact that we use READ_COMMITTED isolation level.
    The fact that a user can go past the last page of a result is actually also
    not an error and can be triggered by for example a massive change in the
    available data between the user asking for first and second page.
    
    I think it is better to stop treating all such conditions as errors and
    instead provide the user with means of detecting such states.
    
    For that a new method on pagelist was introduced:
    PageList.isConsistent(), which calls out to PageControl.isConsistentWith()
    that check if the data and total size of the page list are consistent with
    the page control.
    
    This in turn is used by the CriteriaQueryRunner to re-attempt the queries
    if it finds the inconsistency (which is most possibly caused by the phantom
    read due to our transaction isolation level).

Comment 11 Lukas Krejci 2013-09-03 16:44:45 UTC

commit e82330bd6af45fe519f2eb85e81ca02a97b1eb04
Author: Lukas Krejci <lkrejci>
Date:   Tue Sep 3 18:42:50 2013 +0200

    [BZ 855674] - Refactored the phantom-read detection out of CriteriaQueryRunner
    
    This was to make it testable but more importantly reusable by ordinary SLSB
    methods (that don't use criteria queries).
    
    Also added unit tests for the phantom-read detection (now placed in
    QueryUtility in server jar) and for PageControl in domain jar.

Comment 12 Libor Zoubek 2013-09-24 13:10:52 UTC

repro script does not hit this issue anymore on JON 3.2.ER1

Note You need to log in before you can comment on or make changes to this bug.