Bug 658491

Summary: For Oracle several call-time metrics received in one batch are assigned to the first resource in the batch
Product: [Other] RHQ Project Reporter: Pavel Kiryukhin <realset>
Component: Core ServerAssignee: Heiko W. Rupp <hrupp>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: high Docs Contact:
Priority: low    
Version: 3.0.0CC: ccrouch, jsanda, mazz
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Windows   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=570360
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 663987 (view as bug list) Environment:
Last Closed: 2013-09-02 03:18:10 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 678340    

Description Pavel Kiryukhin 2010-11-30 09:30:57 EST
Description of problem:
For Oracle 11g. Several call-time metrics received in one batch are assigned to the first resource in the batch.

Version-Release number of selected component (if applicable):
Reproduced on 3.0.0 but as I could see from the source should be appliciable for 4.0.0 also.

How reproducible:
Server should receive a batch of call-time metrics from different resources at the same time. 

Steps to Reproduce:
1.Turn on call-time metric collection for different resources. For example 	"Method Invocation Time" for RHQ Server EJB3 session beans. Or use any other resources wich emit call-time metrics
2. Wait for collection
3. Check actual response times at Monitor-Response tab.
  
Actual results:
For some resources not all responses are shown. For other resources some extra responses are shown.

Expected results:
Each response is displayed on Response tab for apropriate resource.

Additional info:
Reproduced with "Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - 64bit Production" and the latest 10 and 11 Oracle JDBC drivers.
The actual problem is with batch insert queries.
Queries for INSERT INTO RHQ_CALLTIME_DATA_VALUE and RHQ_CALLTIME_DATA_VALUE tables have some subqueries with parameters. These parameters doesn't change and have the same values (values of the first record) for whole batch.
Actualy I think it's Oracle bug. But it's very important for RHQ.
I would sugest not to perform batch insert in Oracle at all to avoid this situation.
Comment 2 Heiko W. Rupp 2010-12-14 04:57:52 EST
Pavel,
to get this right: you have e.g. 2 resources with 1 call time schedule each and the data and if they are transmitted in one batch, data for both ends up at e.g. the first resource?
  Heiko
Comment 3 Pavel Kiryukhin 2010-12-14 05:20:26 EST
Heiko,
You understand that right. Both call-time records are displayed on Response time of the first resource (the first in order of the received batch).
Comment 4 Heiko W. Rupp 2010-12-14 07:56:25 EST
Pavel,
I've written a test ( http://git.fedorahosted.org/git/?p=rhq/rhq.git;a=blob;f=modules/enterprise/server/jar/src/test/java/org/rhq/enterprise/server/measurement/test/MeasurementDataManagerTest.java ) to test this from the insert side - do you have a chance to run this in your env?
This works for me  on Postgres, but as you reported the issue on Oracle, it does not prove anything. I will also try to run it against one of our Oracle instances.

Thanks
    Heiko
Comment 5 Pavel Kiryukhin 2010-12-14 08:19:22 EST
Heiko,
As I could see this test should reproduce the problem.
I could try this on my Oracle instance, but I have no expirience running rhq tests. If you could give me guidelines how to run it, and set it up with particular database backend, I would try.
Comment 6 Heiko W. Rupp 2010-12-14 08:33:06 EST
Pavel,
great that the test seems to do what it should.

I think you should have a look at http://rhq-project.org/display/RHQ/Building+RHQ#BuildingRHQ-CheckoutRHQSources that talks about how to check out the sources and build RHQ.

To set up the database you add an entry in ~/.m2/settings.xml a la:

      <profile>
         <id>oracle11</id>
         <activation>
            <activeByDefault>false</activeByDefault>
         </activation>
         <properties>            
            <rhq.test.ds.user-name>rhqadmin</rhq.test.ds.user-name>
            <rhq.test.ds.password>rhqadmin</rhq.test.ds.password>
            <rhq.test.ds.type-mapping>Oracle10g</rhq.test.ds.type-mapping>
            <rhq.test.ds.driver-class>oracle.jdbc.driver.OracleDriver</rhq.test.ds.driver-class>
            <rhq.test.ds.connection-url>jdbc:oracle:thin:@pickeldi:1521:orcl</rhq.test.ds.connection-url>
            <rhq.test.ds.xa-datasource-class>oracle.jdbc.xa.client.OracleXADataSource</rhq.test.ds.xa-datasource-class>
<rhq.test.ds.server-name>172.7.31.8</rhq.test.ds.server-name>
            <rhq.test.ds.port>1521</rhq.test.ds.port>
            <rhq.test.ds.db-name>xe</rhq.test.ds.db-name>
<!--
            <rhq.test.ds.connection-url>jdbc:jon-log:jdbc:oracle:thin:@dev01.qa.atl.jboss.com:1521:qadb01</rhq.test.ds.connection-url>
            <rhq.test.ds.driver-class>org.jboss.on.db.log.LoggerDriver</rhq.test.ds.driver-class>
-->
         </properties>
      </profile>


And then start the test with

$ cd modules/entreprise/server/jar
$ mvn -Poracle11 -Dtest=MeasurementDataManagerTest test

If you have questions, join us in #rhq on irc.freenode.net
Comment 7 Pavel Kiryukhin 2010-12-15 04:24:30 EST
Heiko,
I run your test on my Oracle instance. Test have failed as expected!
The result is
List 2 returned 0 entries, expected was 1
org.rhq.enterprise.server.measurement.test.MeasurementDataManagerTest.bz658491(MeasurementDataManagerTest.java:163) 
After run I have only one record in RHQ_CALLTIME_DATA_KEY with CALL_DESTINATION=/foo and two records in RHQ_CALLTIME_DATA_VALUE.

Regards, Pavel
Comment 9 Heiko W. Rupp 2010-12-17 11:09:01 EST
Investigation shows that bff7953 introduced the issue as direct path inserts ( /*+APPEND */ ) do not allow any "read after write" within the same transaction.
Comment 10 Joseph Marques 2010-12-17 15:05:34 EST
root cause here is insertion of data using a sub-query to the same table using append hints:

INSERT /*+ APPEND */ INTO SomeTable(col...)
SELECT (col...) FROM SomeTable WHERE ...

the first time this is done in a transaction it will succeed, but the second time it happens it will fail because the select is trying to read from a table with direct path inserts.

oddly enough, if things hadn't been batched, it would have produced a very evident trace in the log as soon as the hint was added and deployed.  could the way we handle testing for batch errors be incorrect:

int insertedRowCount = 0;
for (int i = 0; i < results.length; i++) {
  // oracle returns -2 because it can't count updated rows
  if (((results[i] < 0) || (results[i] > 1)) && (results[i] != -2)) {
    throw new MeasurementStorageException(
      "Failed to insert call-time data key rows - result ["
        + results[i] + "] for batch command [" 
        + i + "] is less than 0 or greater than 1.");
  }
  // If Oracle returns -2, just count 1 row
  insertedRowCount += results[i]==-2 ? 1 : results[i]; 
}

lastly, this should have been caught by our unit tests, since we're seeing DataPurgeJobTest.createNewCalltimeData is fail today.
Comment 11 Heiko W. Rupp 2010-12-20 07:00:28 EST
Fixed in 9a7672945dc3c4890b2d491e91a6c36fc633b37a in master.
Comment 12 Corey Welton 2011-01-03 23:27:24 EST
Removing the blocker on the JON241 version of this bug.
Comment 13 John Mazzitelli 2011-07-14 11:23:12 EDT
this has re-appeared. I suspect this might never have been fixed in teh first place. The unit tests in place all pass, but only on Oracle 10g Express Edition. They fail on Oracle 11g Enterprise Edition. I suspect the problem is not in the version, but the edition (Express vs Enterprise).
Comment 14 John Mazzitelli 2011-07-14 12:49:30 EDT
notes:

here's the two sql inserts we issue, first inserts into the calltime data key table, second inserts into calltime data value table. for the tests, we batch insert two rows in each.

note that the second query, I think we can remove the join to the rhq_numbers query - probably just a cut-n-paste mistake - no need to join it like we needed to in the first query

INSERT INTO RHQ_CALLTIME_DATA_KEY(id, schedule_id, call_destination)
   SELECT RHQ_calltime_data_key_id_seq.nextval, ?1, ?2
   FROM RHQ_numbers
   WHERE i = 42 AND NOT EXISTS (SELECT * FROM RHQ_CALLTIME_DATA_KEY WHERE schedule_id = ?3 AND call_destination = ?4)

1,3 = 10005
2,4 = /foo

1,3 = 10006
2,4 = /bar

INSERT /*+ APPEND */ INTO RHQ_CALLTIME_DATA_VALUE(id, key_id, begin_time, end_time, minimum, maximum, total, count)
   SELECT RHQ_calltime_data_value_id_seq.nextval, key.id, ?1, ?2, ?3, ?4, ?5, ?6
   FROM RHQ_numbers num, RHQ_calltime_data_key key
   WHERE num.i = 42 AND key.id = (SELECT id FROM RHQ_CALLTIME_DATA_KEY WHERE schedule_id = ?7 AND call_destination = ?8)

1 = 1310660707888 (begin time) long
2 = 1310660707988 (end time) long
3 = 100.0 (min) double
4 = 100.0 (max) double
5 = 100.0 (total) double
6 = 1 (count) long
7 = 10005 (sched id)
8 = /foo

1 = 1310660707800 (begin time) long
2 = 1310660707900 (end time) long
3 = 200.0 (min) double
4 = 200.0 (max) double
5 = 200.0 (total) double
6 = 1 (count) long
7 = 10006 (sched id)
8 = /bar
Comment 15 John Mazzitelli 2011-07-14 13:55:55 EDT
unsure why this query has that subselect but is appears the parameter being bound is getting fixed for the batch (not getting substituted for each individual query in the batch).

Here's what we are going to use now, the 4 failing tests are passing with this (notice we removed the rhq_number join, don't see a need for that, plus we removed the sub-select and just use "key" entity in where clause):

    private static final String CALLTIME_VALUE_INSERT_STATEMENT = "INSERT /*+ APPEND */ INTO " + DATA_VALUE_TABLE_NAME
        + "(id, key_id, begin_time, end_time, minimum, maximum, total, count) "
        + "SELECT %s, key.id, ?, ?, ?, ?, ?, ? FROM " + DATA_KEY_TABLE_NAME
        + " key WHERE key.schedule_id = ? AND key.call_destination = ?";
Comment 16 John Mazzitelli 2011-07-14 14:09:57 EDT
running tests on jenkins to see if this fixes it. Will move to ON_QA once done.

here's how to test:

1) you MUST be using Oracle 11g ENTERPRISE edition! Oracle XE versions are not good enough - they do not exhibit the bug - must use *Enterprise* Edition

2) import two resources into inventory that support calltime data (you can use two EJBs from under the RHQ Server resource - so just import RHQ Server resource and you'll use two of its Session EJBs) Pick two EJBs that are used alot - like the SubjectManagerBean and the AuthorizationManagerBean)

3) Enable calltime metric for both resources and make sure they run at the same interval. I would set the collection interval to 30s (the lowest possible) to ensure the best possibility that both resources' calltime data will be sent in the same report.

4) If you are using RHQ Server's EJB resources, start clicking aroudn the UI alot, make sure yuo hit pages that require you to be auth'ed to see things - like resource tabs (inventory, measurement, configuration, etc).

5) after several minutes, go check that you have calltime data in both of the resources.

If you see calltime data for both resources, we can assume this bug is fixed.
Comment 17 Mike Foley 2011-10-03 13:16:21 EDT
there is an automated test for this ... and the test is passing.
Comment 18 Heiko W. Rupp 2013-09-02 03:18:10 EDT
Bulk closing of issues that were VERIFIED, had no target release and where the status changed more than a year ago.