Bug 1034796

Summary: Showing metric schedules for compatible group is very slow
Product: [JBoss] JBoss Operations Network Reporter: Libor Zoubek <lzoubek>
Component: Monitoring -- OtherAssignee: Lukas Krejci <lkrejci>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: unspecified    
Version: JON 3.2CC: hrupp, jsanda, theute
Target Milestone: CR01   
Target Release: JON 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1035767 (view as bug list) Environment:
Last Closed: 2014-01-02 20:37:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1035767    
Bug Blocks: 1012435    

Description Libor Zoubek 2013-11-26 14:14:14 UTC
Description of problem:

There seems to be a performance regression comparing to JON 3.1.2 in showing metric schedules for compatible group

Version-Release number of selected component (if applicable):
JON 3.2.ER7

How reproducible: always


Steps to Reproduce:
1. create compatible group of your network interfaces (2 on linux by default)
2. go to this group -> Monitoring -> Schedules
3. Hit refresh

Actual results:

on JON 3.2.0.ER7 it takes 4-5 seconds to load 
on JON 3.1.2.GA it takes 1second (almost instant)


Expected results:

refreshing/loading schedules on JON 3.2.0 takes more or less same time as on 3.1.2

Additional info: Setup: 1 server with 1 local agent + storage node

Comment 1 Mike Foley 2013-11-26 14:19:57 UTC
Detected as part of JON 3.2 ER7 Test Day.

Supporting facts:
here is 3.2 http://testday5.bc.jonqe.lab.eng.bos.redhat.com:7080/coregui/#ResourceGroup/10011/Monitoring/Schedules

and here 3.1.2 http://lzoubek4.bc.jonqe.lab.eng.bos.redhat.com:7080/coregui/#ResourceGroup/10001/Monitoring/Schedules

this looks like a serious performance regression in an area where JON 3.2 is supposed to be more performant and scalable.  

adding the JON 3.2 GA blocker (although this still needs to be triaged by the triage team).  preliminarily targetting for CR1.

Comment 2 Heiko W. Rupp 2013-11-26 17:14:56 UTC
Assigning to jsanda - even as we did not triage it yet, so that he can investigate this.

Comment 3 John Sanda 2013-11-26 21:50:07 UTC
The schedules tab doesn't execute any C* queries, only JPQL queries generated by our criteria query framework. The reason for the slow down has to do with changes made for bug 855674. In Libor's server log, there should be warnings that look like,

16:43:56,213 WARN  [org.rhq.enterprise.server.util.CriteriaQueryRunner] (http-/0.0.0.0:7080-7) Could not get consistent results of the paged data and a total count for MeasurementScheduleCriteria[filters: [filterResourceGroupId=10001, filterResourceInventoryStatuses=[COMMITTED]], fetches: [], sorts: []]. After 10 attempts, the collection size is 22, while the count query reports 220 for PageControl[page=0, size=-1, sort[definition.displayName ASC]]. The discrepancy has not cleared up in 3900ms so we're giving up, returning inconsistent results. Note that is most possibly NOT an error. It is likely caused by concurrent database activity that changes the contents of the database that the criteria query is querying.: java.lang.Exception
	at org.rhq.enterprise.server.util.CriteriaQueryRunner.execute(CriteriaQueryRunner.java:82) [rhq-server.jar:4.10.0-SNAPSHOT]
	at org.rhq.enterprise.server.measurement.MeasurementScheduleManagerBean.getMeasurementScheduleCompositesByContext(MeasurementScheduleManagerBean.java:1490) [rhq-server.jar:4.10.0-SNAPSHOT]


Enable debug logging on the class org.rhq.enterprise.server.util.QueryUtility and you will see DEBUG statements like,

6:43:55,187 DEBUG [org.rhq.enterprise.server.util.QueryUtility] (http-/0.0.0.0:7080-7) Possible phantom read detected while running a query. The collection size = 22, count = 220, pageControl = PageControl[page=0, size=-1, sort[definition.displayName ASC]]. Attempt number 9. Will wait for 1000ms.: java.lang.Exception
	at org.rhq.enterprise.server.util.QueryUtility.fetchPagedDataAndCount(QueryUtility.java:311) [rhq-server.jar:4.10.0-SNAPSHOT]
	at org.rhq.enterprise.server.util.CriteriaQueryRunner.execute(CriteriaQueryRunner.java:79) [rhq-server.jar:4.10.0-SNAPSHOT]


jshaughn pointed out that the phantom read detection stuff might be having some problems with the query in question due to its alter projection. 

I am of the opinion that this is not a blocker as the full result set does get returned to the UI, albeit more slowly. I am reassigning to Lukas since he can better assess this.

Comment 4 Lukas Krejci 2013-11-28 12:41:53 UTC
Fixed upstream by BZ 1035767.

Comment 5 Lukas Krejci 2013-11-28 13:14:26 UTC
commit ea707eaf2cb4650b15a04ea1d8c228961807bbac
Author: Lukas Krejci <lkrejci>
Date:   Thu Nov 28 13:39:26 2013 +0100

    [BZ 1035767] - Enable altering count query along with alterProjection
    If CriteriaQueryGeneration.alterProjection changes the projection such
    that the actual number of returned results changes (e.g. using distinct),
    the default count query would no longer match and the returned results
    would be inconsistent with the count found by the count query.
    
    Because alterProjection is essentially free-form JPQL, we need to provide
    a similar means to alter the count query so that the data query and count
    query can be modified to return consistent data.
    (cherry picked from commit 25f768afb5c72e5eb0fa17c6789b3b8aaa895be9)

Comment 6 Simeon Pinder 2013-12-03 23:19:23 UTC
Moving to ON_QA for testing in latest(CR1) brew build.

Comment 7 Libor Zoubek 2013-12-05 14:47:57 UTC
verified on CR1