Bug 1125439

Summary:	MeasurementScheduleManagerBean.findSchedulesByCriteria fails due to duplicate baselines for single schedule_id
Product:	[JBoss] JBoss Operations Network	Reporter:	Larry O'Leary <loleary>
Component:	Monitoring -- Other	Assignee:	Jay Shaughnessy <jshaughn>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Garik Khachikyan <gkhachik>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	JON 3.2.2	CC:	gkhachik, jshaughn, lcarlon, mkoci, rhatlapa
Target Milestone:	ER02
Target Release:	JON 3.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Attempting to access resource schedules threw an exception due to duplicate baselines present in the schedule_id variable. Schedules would not load and JBoss ON was unable to retrieve the metric data. A unique identifier index has been added to the baseline.schedule value to detect duplicate baselines for a schedule. Schedules now load as expected.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-12-11 14:04:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Larry O'Leary 2014-07-31 20:45:13 UTC

Description of problem:
Attempting to access any metric or metric schedule for certain resources fails and a "Cannot load schedules" message is displayed. Reviewing the message center reveals:

    Root Cause:
    org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline 

    Detail:
    java.lang.RuntimeException:[1406822207072] javax.ejb.EJBException:javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> javax.persistence.PersistenceException:org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline--- STACK TRACE FOLLOWS ---[1406822207072] javax.ejb.EJBException:javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> javax.persistence.PersistenceException:org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline   at Unknown.RuntimeException_0(Unknown Source)   at Unknown.instantiate_32(Unknown Source)   at Unknown.$instantiate_0(Unknown Source)   at Unknown.$instantiate(Unknown Source)   at Unknown.$readObject(Unknown Source)   at Unknown.$onResponseReceived(Unknown Source)   at Unknown.onResponseReceived_4(Unknown Source)   at Unknown.$fireOnResponseReceived(Unknown Source)   at Unknown.onReadyStateChange_0(Unknown Source)   at Unknown.anonymous(Unknown Source)   at Unknown.apply(Unknown Source)   at Unknown.entry0(Unknown Source) 

The server log provides a stack trace that links this to the baseline table containing multiple baseline entries for the same schedule_id.

    ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-126) JBAS014134: EJB Invocation failed on component MeasurementScheduleManagerBean for method public abstract org.rhq.core.domain.util.PageList org.rhq.enterprise.server.measurement.MeasurementScheduleManagerRemote.findSchedulesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.MeasurementScheduleCriteria): javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInOurTx(CMTTxInterceptor.java:165) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10]
        at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:250) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10]
        ...
        at org.rhq.enterprise.server.measurement.MeasurementScheduleManagerLocal$$$view17.findSchedulesByCriteria(Unknown Source) [rhq-server.jar:4.9.0.JON320GA-redhat-2]
        at org.rhq.coregui.server.gwt.MeasurementDataGWTServiceImpl.findMeasurementSchedulesByCriteria(MeasurementDataGWTServiceImpl.java:152)
        ...
    Caused by: javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline
        at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1]
        at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1]
        at org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:273) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1]
        at org.rhq.enterprise.server.util.QueryUtility.fetchPagedDataAndCount(QueryUtility.java:295) [rhq-server.jar:4.9.0.JON320GA-redhat-2]
        at org.rhq.enterprise.server.util.CriteriaQueryRunner.execute(CriteriaQueryRunner.java:78) [rhq-server.jar:4.9.0.JON320GA-redhat-2]
        at org.rhq.enterprise.server.measurement.MeasurementScheduleManagerBean.findSchedulesByCriteria(MeasurementScheduleManagerBean.java:1580) [rhq-server.jar:4.9.0.JON320GA-redhat-2]
        ...
    Caused by: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline
        at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:104) [hibernate-core-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1]
        at org.hibernate.loader.entity.EntityLoader.loadByUniqueKey(EntityLoader.java:161) [hibernate-core-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1]
        ...


    SELECT * FROM rhq_measurement_bline WHERE schedule_id=123432

    ID     BL_USER_ENTERED  BL_MIN      BL_MAX      BL_MEAN     BL_COMPUTE_TIME       SCHEDULE_ID
    19999  0                1125400000  4329100000  1976400000  28-Jul-2014 19:22:04  123432
    20033  0                1125400000  4329100000  1976400000  28-Jul-2014 19:22:04  123432
    20052  0                1125400000  4329100000  1976400000  28-Jul-2014 19:22:04  123432

Version-Release number of selected component (if applicable):
3.2.2

How reproducible:
Unknown

Actual results:
The same baseline entry appears multiple times in the rhq_measurement_bline table.

Expected results:
One one baseline entry appears to each baseline.

Additional info:
In the environment this issue was reported, there are actually 42,000 metric schedules that are impacted by this. Some schedule_ids have 2 entries; some have 3 entries; others have 4 entries. 

This environment has two storage nodes operating in a cluster. Perhaps storage node read/write timeouts while calculating baselines is causing the same data to be calculated and inserted multiple times?

Comment 1 Jay Shaughnessy 2014-08-05 16:32:56 UTC

Created https://github.com/rhq-project/rhq/pull/105 with potentially useful changes related to this issue.

Comment 2 Jay Shaughnessy 2014-08-15 02:56:26 UTC

The multiple baselines come about due to overlapping data purge jobs being run, and within the data purge jobs unfortunate timing can produce overlapping baseline generation. In general stateful quartz jobs don't overlap, that's exactly the point of declaring them stateful. But it is possible in basically one scenario. First, it needs to be an HA environment, which means multiple servers and therefore muiltiple (clustered) quartz schedulers. Then, it requires a false positive check for a "down" scheduler. Each quartz scheduler intermittently checks in with the DB and updates a check-in time. When it does that it also checks the other schedulers to make sure they have all checked in on schedule. If a scheduler has missed a check-in it it considered down and the scheduler checking-in performs recovery logic on in-progress jobs which it presumes are dead. It:
1) cleans up the in-progress job
2) if the job is set for recovery it is then executed immediately (although
all of our jobs are set to recovery=false)
3) it then applies the misfire action set on the trigger.
By default (for a cron trigger) it will immediately launch the job if it has
misfired at least once. A "misfire" means it missed a scheduled execution
because a previous execution was still running. If it has not misfired
it is just rescheduled for the next scheduled execution after the current
time.

This is what is happening. A running scheduler is missing it's check-in. And therefore, the job is actually running but it is considered dead. The recovery ends up executing it again, and then you have overlapping executions.

There are basically 3 reasons this happens. By far the most prevalent is lack of clock sync among the servers and/or database. It is critical that clock sync is close, quartz only allows a little bit of wiggle room when checking for a missed check-in. The other main reason is some condition that doesn't allow nodes to "check in" to the database in a timely manner (at the cluster check-in interval). This can happen if there is too much lag time when updating the DB, for whatever reason, potentially only on one server.

Comment 3 Jay Shaughnessy 2014-08-17 19:34:53 UTC

Master commit a06323a067fa942bb5100be2845c47f514626c79
Author: Jay Shaughnessy <jshaughn>
Date:   Sat Aug 16 14:59:09 2014 -0400

 Due to some git trouble this is something of a manually generated squash
 commit.  The changes are somewhat two-fold.  First, there are changes
 relevant to the initially reported symptom, which was the existence of
 duplicate baselines for a single schedule.  The code looked like it was
 capable of generating a storing a duplicate, due to the use of Lists (not
 Sets) in the APIs, and also the lack of a unique index even though there is
 a 1:1 relationship between baseline and schedule.  The first set of commits
 addressed these issues and also adding chunking to avoid large Txs and also
 oracle IN-clause limits.
 - Make the db index on baseline's scheduleId unique. There should only be
   one baseline for a schedule. This will prevent the problem in the BZ,
   possibly at the expense of failure when trying to store a duplicate, but
   that would perhaps help us pinpoint a problematic code-path.
 - Use Sets (or a Map) as opposed to Lists in several places in the local API.
 - Simplify and I think make more efficient the query for selecting schedules
   without a current baseline. Make it return distinct ids (this is a hedge,
   I don't think it would return duplicates, but this ensures that constraint).
   Replace a left join with a NOT EXISTS sub select.
 - Chunk the storage of baselines to prevent large Txs, don't pass entities
   across Tx boundary, fetch all Schedules in one query, and (maybe) simplify
   the entity manager interaction.
 - Simplify some of the TransactionAttribute decls
 - Mark MeasurementSchedule.baseline as an optional 1-1 relationship, just to
   make it explicit on the entity.
 - Fix some local SLSB jdoc

 Later, the actual root cause was discovered: the baseline job executing
 concurrently and the same schedule occurring in the overlapping jobs. This
 scenario was originally thought to be impossible  due to the stateful
 nature of the quartz job. But in fact it is possible in certain conditions
 due to quartz clustered recovery logic(see BZ comments).  These changes
 minimize the concurrent job possibility:
 - increase the quartz clusterCheckinInterval from 30s to 2m.  This
   decreases the chance of false positives when quartz checks for down
   schedulers and should have little or no effect on the recovery of our jobs.
 - move the data calculation work from DataPurgeJob to the new DataCalcJob,
   leaving DataPurgeJob to deal only with RDB maintenance and letting
   the new DataCalcJob only work on baseline and OOB generation.  There is
   no reason for DataCalc work to have to wait for DB maintenance work.
   Also, stagger the jobs by 1/2 hour to lessen concurrent work.
 - for the cron jobs, switch the misfire policy from the default (RUN_ONCE)
   to DO_NOTHING.  So, instead of long running jobs again starting up
   immediately after finishing, let it wait for the next scheduled start
   time.  This also reduces the chances of overlap if we do in fact get a
   false positive recovery situation.  (as an aside, there is a good web page
   here regarding clustered quartz logic:
     http://flylib.com/books/en/2.65.1.91/1/)

 Given the work described above it's now very unlikely but not completely
 guaranteed that we won't have overlapping baseline calculation jobs.
 Because we now have a unique index on baseline.schedule we will immediately
 detect an attempt to store a duplicate baseline for a schedule.  In the
 unlikely event that this happens we know it's [very likely] due to an
 overlapping job. So if we fail to save baselines because of a constraint
 violation we just catch, log a warning and continue.

Comment 4 Jay Shaughnessy 2014-08-19 14:32:44 UTC

release/jon3.3.x commit 184f13babbfeacc5bb0e215f2c0145ba9197aba2
Author: Jay Shaughnessy <jshaughn>
Date:   Sat Aug 16 14:59:09 2014 -0400

 Cherry-pick master: 3e4f9571733ea0c7867acf693efc5fc82110b41a

Comment 5 Simeon Pinder 2014-09-03 20:31:34 UTC

Moving to ON_QA as available for test with the following brew build:
https://brewweb.devel.redhat.com//buildinfo?buildID=381194

Comment 7 Garik Khachikyan 2014-09-23 12:31:29 UTC

@Jay: hi, need some info please if it would fix/correct the situation from old release (so if one does `rhqctl upgrade` the servers with such situation)? (may i suspect - no)?

Another thing - any concrete scenario for making a bit chance to make such a situation in 3.3 ? :)

Would it be enough making a HA 2 servers and plug some agents waiting for several rounds of purge jobs to run and trace the database for possible duplicates please?

So far I understand that with current enhancements we preventing it from being happen in many aspects - so my chances to try to reproduce or find a hole - would be close to 0 :)

thanks.

Comment 8 Jay Shaughnessy 2014-09-23 13:03:27 UTC

Hi Garik,

Yes, the upgrade will repair the duplicates issue.  When the DB is upgraded we execute statements that first remove duplicate baselines, and then apply a unique index that will prevent duplicates in the future.  We don't really expect duplicates given the other changes here but we can't completely rule it out.  But now we will handle it if it happens.

Recreation steps are not simple, it was hard *before* the fix, requiring some manual manipulation of the Quartz jobs.  I think running whatever HA and HA perf testy plans that you guys typically run through would be sufficient.

Comment 9 Garik Khachikyan 2014-10-17 11:26:48 UTC

#VERIFIED

made a performance server with multiple agents and HA structure - all worked fine and no logs as such in description appeared after a day of running.