Bug 1032192

Summary: Improve query performance for alert cache refresh; using n + 1 selects
Product: [Other] RHQ Project Reporter: Elias Ross <genman>
Component: Core Server, PerformanceAssignee: Jay Shaughnessy <jshaughn>
Status: ON_QA --- QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.9CC: hrupp
Target Milestone: GA   
Target Release: RHQ 4.13   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elias Ross 2013-11-19 17:04:12 UTC
Initially I thought to deploy RHQ Server in a separate datacenter with about 70ms ping time to the Oracle database. However, I found although it works, there are a couple of issues obviously needing addressing:

1. Startup time takes a very, very long time. (25 minutes)

21:38:05,842 INFO  [org.jboss.as] (MSC service thread 1-7) JBAS015899: JBoss EAP 6.1.0.Alpha1 (AS 7.2.0.Alpha1-redhat-4) starting
...
22:03:25,774 INFO  [org.rhq.enterprise.server.core.StartupBean] (pool-7-thread-1) RHQ 4.9.0 (build bb4fa70) Server started.

This includes time to get the plugins. (It seems slow while building the alert cache.)

2. Availability reporting takes about 5-20 seconds per agent.

3. Inventory merge can take about a minute or more:

00:08:24,805 WARN  [org.rhq.enterprise.server.discovery.DiscoveryServerServiceImpl] (http-/0.0.0.0:7080-5) Performance: inventory merge (74448)ms


I'm guessing the following changes would help at least some of this:

* Hibernate changes:
hibernate.jdbc.fetch_size=100
hibernate.jdbc.batch_size=100
hibernate.order_inserts=true
hibernate.order_updates=true

* Use Join Fetch where it makes sense:
select x from xyz
join fetch x.something

* Use sequence number pooling (e.g. http://mcatr.blogspot.com/2010/12/hibernate-id-generation-and-oracle.html)

* Use updates rather than select and edit entities.

Comment 1 Elias Ross 2013-11-20 03:22:29 UTC
Just for reference, the alert fetch stack trace:

        at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:355)
        at org.rhq.core.domain.util.JoinFetchReportingQueryTranslator.list(JoinFetchReportingQueryTranslator.java:84)
        at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:195)
        at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1246)
        at org.hibernate.internal.QueryImpl.list(QueryImpl.java:101)
        at org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:264)
        at org.rhq.enterprise.server.alert.AlertConditionManagerBean.getAlertConditionComposites(AlertConditionManagerBean.java:158)

...

Comment 2 Elias Ross 2013-12-02 20:19:22 UTC
These are settings I use, but not sure if they help that much:

@@ -12,11 +12,14 @@
          <!--
          <property name="hibernate.show_sql" value="true" />
          <property name="hibernate.format_sql" value="true" />
-         <property name="hibernate.use_sql_comments" value="true" /> 
+         <property name="hibernate.use_sql_comments" value="true" />
          -->
          <!--<property name="hibernate.hbm2ddl.auto" value="create-drop" />-->
-         <property name="hibernate.jdbc.batch_size" value="30"/>
+         <property name="hibernate.jdbc.batch_size" value="100"/>
+         <property name="hibernate.jdbc.fetch_size" value="100"/>
          <property name="hibernate.cache.use_second_level_cache" value="false"/>
+         <property name="hibernate.order_inserts"   value="true"/>
+         <property name="hibernate.order_updates"   value="true"/>
 
          <!--<property name="com.intellij.javaee.persistence.datasource" value="rhq-oracle" />-->
          <property name="com.intellij.javaee.persistence.datasource" value="rhq-postgres" />
@@ -26,7 +29,7 @@
          <!-- This value doesn't have any theoretical grounds and is purely based on experiments. Value of 32 gives
               the best compromise between the quick load of queries with low and high limits (i.e. page size in tens vs.
               thousands) in an inventory with 30 agents and >20000 resources. -->
-         <property name="hibernate.default_batch_fetch_size" value="32"/>
+         <property name="hibernate.default_batch_fetch_size" value="300"/>
 
       </properties>
    </persistence-unit>

Comment 3 Heiko W. Rupp 2014-02-05 10:12:32 UTC
There have improvements been made to speed up alert reporting server side 
in master b84dcfa (and other commits before).
See also Bug 1037616

Comment 4 Elias Ross 2014-05-06 16:19:00 UTC
More logs from alert cache loading. It takes about 23 seconds to load 250 cache elements. Might be better to load more at once or something.

16:16:01,014 DEBUG [org.rhq.enterprise.server.alert.AlertConditionManagerBean] (pool-7-thread-1) Found 36068 elements of type 'Resource Availability for Duration', list was size 250
16:16:24,698 DEBUG [org.rhq.enterprise.server.alert.AlertConditionManagerBean] (pool-7-thread-1) Found 36068 elements of type 'Resource Availability for Duration', list was size 250

Comment 5 Elias Ross 2014-05-06 18:27:08 UTC
The bug is each alert composite load runs this query:

11:20:06,258 DEBUG [org.hibernate.SQL] (http-/127.0.0.1:8080-1) select alertcondi0_.ID as col_0_0_, alertdefin1_.ID as col_1_0_, resource2_.ID as col_2_0_, (select availabili3_.AVAILABILITY_TYPE from RHQ_AVAILABILITY availabili3_ inner join RHQ_RESOURCE resource4_ on availabili3_.RESOURCE_ID=resource4_.ID where resource4_.ID=resource2_.ID and (availabili3_.END_TIME is null)) as col_3_0_ from RHQ_ALERT_CONDITION alertcondi0_ inner join RHQ_ALERT_DEFINITION alertdefin1_ on alertcondi0_.ALERT_DEFINITION_ID=alertdefin1_.ID inner join RHQ_RESOURCE resource2_ on alertdefin1_.RESOURCE_ID=resource2_.ID where (alertdefin1_.RECOVERY_ID=0 or alertdefin1_.RECOVERY_ID<>0 and (exists (select alertdefin5_.ID from RHQ_ALERT_DEFINITION alertdefin5_ where alertdefin5_.ID=alertdefin1_.RECOVERY_ID and alertdefin5_.DELETED=false and alertdefin5_.ENABLED=false))) and (resource2_.AGENT_ID=? or ? is null) and alertdefin1_.ENABLED=true and alertdefin1_.DELETED=false and alertcondi0_.TYPE=? order by alertcondi0_.ID

This returns only the alert condition ID, not the full value.

Then for every entry, this query is run:

16:42:11,585 DEBUG [org.hibernate.SQL] (EJB default - 2) select alertcondi0_.ID as ID8_0_, alertcondi0_.ALERT_DEFINITION_ID as ALERT8_8_0_, alertcondi0_.TYPE as TYPE8_0_, alertcondi0_.COMPARATOR as COMPARATOR8_0_, alertcondi0_.MEASUREMENT_DEFINITION_ID as MEASUREM9_8_0_, alertcondi0_.NAME as NAME8_0_, alertcondi0_.OPTION_STATUS as OPTION5_8_0_, alertcondi0_.THRESHOLD as THRESHOLD8_0_, alertcondi0_.TRIGGER_ID as TRIGGER7_8_0_ from RHQ_ALERT_CONDITION alertcondi0_ where alertcondi0_.ID=?
...
16:42:11,674 DEBUG [org.hibernate.SQL] (EJB default - 2) select alertcondi0_.ID as ID8_0_, alertcondi0_.ALERT_DEFINITION_ID as ALERT8_8_0_, alertcondi0_.TYPE as TYPE8_0_, alertcondi0_.COMPARATOR as COMPARATOR8_0_, alertcondi0_.MEASUREMENT_DEFINITION_ID as MEASUREM9_8_0_, alertcondi0_.NAME as NAME8_0_, alertcondi0_.OPTION_STATUS as OPTION5_8_0_, alertcondi0_.THRESHOLD as THRESHOLD8_0_, alertcondi0_.TRIGGER_ID as TRIGGER7_8_0_ from RHQ_ALERT_CONDITION alertcondi0_ where alertcondi0_.ID=?

The solution is probably to force a load of the attribute. Trying to figure this out...

Comment 6 Jay Shaughnessy 2014-08-19 18:06:40 UTC
Merged associated pull request: https://github.com/rhq-project/rhq/pull/65

master commit ab6e5c067e4a81c3caa55b5d9f374bbdd2ed3093
Merge: 2f35925 3713036
Author: jshaughn <jshaughn>
Date:   Tue Aug 19 13:59:58 2014 -0400

    Merge pull request #65 from genman/BZ1032192

    BZ1032192 - AlertCondition composite queries, query each column




Testing Notes:
This is basically an internal change for performance.  Testing is covered by unit testing.