|Summary:||MeasurementScheduleManagerBean.findSchedulesByCriteria fails due to duplicate baselines for single schedule_id|
|Product:||[JBoss] JBoss Operations Network||Reporter:||Larry O'Leary <loleary>|
|Component:||Monitoring -- Other||Assignee:||Jay Shaughnessy <jshaughn>|
|Status:||CLOSED CURRENTRELEASE||QA Contact:||Garik Khachikyan <gkhachik>|
|Version:||JON 3.2.2||CC:||gkhachik, jshaughn, lcarlon, mkoci, rhatlapa|
|Target Release:||JON 3.3.0|
|Fixed In Version:||Doc Type:||Bug Fix|
Attempting to access resource schedules threw an exception due to duplicate baselines present in the schedule_id variable. Schedules would not load and JBoss ON was unable to retrieve the metric data. A unique identifier index has been added to the baseline.schedule value to detect duplicate baselines for a schedule. Schedules now load as expected.
|Last Closed:||2014-12-11 14:04:01 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Larry O'Leary 2014-07-31 20:45:13 UTC
Description of problem: Attempting to access any metric or metric schedule for certain resources fails and a "Cannot load schedules" message is displayed. Reviewing the message center reveals: Root Cause: org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline Detail: java.lang.RuntimeException: javax.ejb.EJBException:javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> javax.persistence.PersistenceException:org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline--- STACK TRACE FOLLOWS --- javax.ejb.EJBException:javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> javax.persistence.PersistenceException:org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at Unknown.RuntimeException_0(Unknown Source) at Unknown.instantiate_32(Unknown Source) at Unknown.$instantiate_0(Unknown Source) at Unknown.$instantiate(Unknown Source) at Unknown.$readObject(Unknown Source) at Unknown.$onResponseReceived(Unknown Source) at Unknown.onResponseReceived_4(Unknown Source) at Unknown.$fireOnResponseReceived(Unknown Source) at Unknown.onReadyStateChange_0(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.apply(Unknown Source) at Unknown.entry0(Unknown Source) The server log provides a stack trace that links this to the baseline table containing multiple baseline entries for the same schedule_id. ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-126) JBAS014134: EJB Invocation failed on component MeasurementScheduleManagerBean for method public abstract org.rhq.core.domain.util.PageList org.rhq.enterprise.server.measurement.MeasurementScheduleManagerRemote.findSchedulesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.MeasurementScheduleCriteria): javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInOurTx(CMTTxInterceptor.java:165) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:250) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] ... at org.rhq.enterprise.server.measurement.MeasurementScheduleManagerLocal$$$view17.findSchedulesByCriteria(Unknown Source) [rhq-server.jar:4.9.0.JON320GA-redhat-2] at org.rhq.coregui.server.gwt.MeasurementDataGWTServiceImpl.findMeasurementSchedulesByCriteria(MeasurementDataGWTServiceImpl.java:152) ... Caused by: javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:273) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.rhq.enterprise.server.util.QueryUtility.fetchPagedDataAndCount(QueryUtility.java:295) [rhq-server.jar:4.9.0.JON320GA-redhat-2] at org.rhq.enterprise.server.util.CriteriaQueryRunner.execute(CriteriaQueryRunner.java:78) [rhq-server.jar:4.9.0.JON320GA-redhat-2] at org.rhq.enterprise.server.measurement.MeasurementScheduleManagerBean.findSchedulesByCriteria(MeasurementScheduleManagerBean.java:1580) [rhq-server.jar:4.9.0.JON320GA-redhat-2] ... Caused by: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:104) [hibernate-core-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.hibernate.loader.entity.EntityLoader.loadByUniqueKey(EntityLoader.java:161) [hibernate-core-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] ... SELECT * FROM rhq_measurement_bline WHERE schedule_id=123432 ID BL_USER_ENTERED BL_MIN BL_MAX BL_MEAN BL_COMPUTE_TIME SCHEDULE_ID 19999 0 1125400000 4329100000 1976400000 28-Jul-2014 19:22:04 123432 20033 0 1125400000 4329100000 1976400000 28-Jul-2014 19:22:04 123432 20052 0 1125400000 4329100000 1976400000 28-Jul-2014 19:22:04 123432 Version-Release number of selected component (if applicable): 3.2.2 How reproducible: Unknown Actual results: The same baseline entry appears multiple times in the rhq_measurement_bline table. Expected results: One one baseline entry appears to each baseline. Additional info: In the environment this issue was reported, there are actually 42,000 metric schedules that are impacted by this. Some schedule_ids have 2 entries; some have 3 entries; others have 4 entries. This environment has two storage nodes operating in a cluster. Perhaps storage node read/write timeouts while calculating baselines is causing the same data to be calculated and inserted multiple times?
Comment 1 Jay Shaughnessy 2014-08-05 16:32:56 UTC
Created https://github.com/rhq-project/rhq/pull/105 with potentially useful changes related to this issue.
Comment 2 Jay Shaughnessy 2014-08-15 02:56:26 UTC
The multiple baselines come about due to overlapping data purge jobs being run, and within the data purge jobs unfortunate timing can produce overlapping baseline generation. In general stateful quartz jobs don't overlap, that's exactly the point of declaring them stateful. But it is possible in basically one scenario. First, it needs to be an HA environment, which means multiple servers and therefore muiltiple (clustered) quartz schedulers. Then, it requires a false positive check for a "down" scheduler. Each quartz scheduler intermittently checks in with the DB and updates a check-in time. When it does that it also checks the other schedulers to make sure they have all checked in on schedule. If a scheduler has missed a check-in it it considered down and the scheduler checking-in performs recovery logic on in-progress jobs which it presumes are dead. It: 1) cleans up the in-progress job 2) if the job is set for recovery it is then executed immediately (although all of our jobs are set to recovery=false) 3) it then applies the misfire action set on the trigger. By default (for a cron trigger) it will immediately launch the job if it has misfired at least once. A "misfire" means it missed a scheduled execution because a previous execution was still running. If it has not misfired it is just rescheduled for the next scheduled execution after the current time. This is what is happening. A running scheduler is missing it's check-in. And therefore, the job is actually running but it is considered dead. The recovery ends up executing it again, and then you have overlapping executions. There are basically 3 reasons this happens. By far the most prevalent is lack of clock sync among the servers and/or database. It is critical that clock sync is close, quartz only allows a little bit of wiggle room when checking for a missed check-in. The other main reason is some condition that doesn't allow nodes to "check in" to the database in a timely manner (at the cluster check-in interval). This can happen if there is too much lag time when updating the DB, for whatever reason, potentially only on one server.
Comment 3 Jay Shaughnessy 2014-08-17 19:34:53 UTC
Master commit a06323a067fa942bb5100be2845c47f514626c79 Author: Jay Shaughnessy <email@example.com> Date: Sat Aug 16 14:59:09 2014 -0400 Due to some git trouble this is something of a manually generated squash commit. The changes are somewhat two-fold. First, there are changes relevant to the initially reported symptom, which was the existence of duplicate baselines for a single schedule. The code looked like it was capable of generating a storing a duplicate, due to the use of Lists (not Sets) in the APIs, and also the lack of a unique index even though there is a 1:1 relationship between baseline and schedule. The first set of commits addressed these issues and also adding chunking to avoid large Txs and also oracle IN-clause limits. - Make the db index on baseline's scheduleId unique. There should only be one baseline for a schedule. This will prevent the problem in the BZ, possibly at the expense of failure when trying to store a duplicate, but that would perhaps help us pinpoint a problematic code-path. - Use Sets (or a Map) as opposed to Lists in several places in the local API. - Simplify and I think make more efficient the query for selecting schedules without a current baseline. Make it return distinct ids (this is a hedge, I don't think it would return duplicates, but this ensures that constraint). Replace a left join with a NOT EXISTS sub select. - Chunk the storage of baselines to prevent large Txs, don't pass entities across Tx boundary, fetch all Schedules in one query, and (maybe) simplify the entity manager interaction. - Simplify some of the TransactionAttribute decls - Mark MeasurementSchedule.baseline as an optional 1-1 relationship, just to make it explicit on the entity. - Fix some local SLSB jdoc Later, the actual root cause was discovered: the baseline job executing concurrently and the same schedule occurring in the overlapping jobs. This scenario was originally thought to be impossible due to the stateful nature of the quartz job. But in fact it is possible in certain conditions due to quartz clustered recovery logic(see BZ comments). These changes minimize the concurrent job possibility: - increase the quartz clusterCheckinInterval from 30s to 2m. This decreases the chance of false positives when quartz checks for down schedulers and should have little or no effect on the recovery of our jobs. - move the data calculation work from DataPurgeJob to the new DataCalcJob, leaving DataPurgeJob to deal only with RDB maintenance and letting the new DataCalcJob only work on baseline and OOB generation. There is no reason for DataCalc work to have to wait for DB maintenance work. Also, stagger the jobs by 1/2 hour to lessen concurrent work. - for the cron jobs, switch the misfire policy from the default (RUN_ONCE) to DO_NOTHING. So, instead of long running jobs again starting up immediately after finishing, let it wait for the next scheduled start time. This also reduces the chances of overlap if we do in fact get a false positive recovery situation. (as an aside, there is a good web page here regarding clustered quartz logic: http://flylib.com/books/en/22.214.171.124/1/) Given the work described above it's now very unlikely but not completely guaranteed that we won't have overlapping baseline calculation jobs. Because we now have a unique index on baseline.schedule we will immediately detect an attempt to store a duplicate baseline for a schedule. In the unlikely event that this happens we know it's [very likely] due to an overlapping job. So if we fail to save baselines because of a constraint violation we just catch, log a warning and continue.
Comment 4 Jay Shaughnessy 2014-08-19 14:32:44 UTC
release/jon3.3.x commit 184f13babbfeacc5bb0e215f2c0145ba9197aba2 Author: Jay Shaughnessy <firstname.lastname@example.org> Date: Sat Aug 16 14:59:09 2014 -0400 Cherry-pick master: 3e4f9571733ea0c7867acf693efc5fc82110b41a
Comment 5 Simeon Pinder 2014-09-03 20:31:34 UTC
Moving to ON_QA as available for test with the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=381194
Comment 7 Garik Khachikyan 2014-09-23 12:31:29 UTC
@Jay: hi, need some info please if it would fix/correct the situation from old release (so if one does `rhqctl upgrade` the servers with such situation)? (may i suspect - no)? Another thing - any concrete scenario for making a bit chance to make such a situation in 3.3 ? :) Would it be enough making a HA 2 servers and plug some agents waiting for several rounds of purge jobs to run and trace the database for possible duplicates please? So far I understand that with current enhancements we preventing it from being happen in many aspects - so my chances to try to reproduce or find a hole - would be close to 0 :) thanks.
Comment 8 Jay Shaughnessy 2014-09-23 13:03:27 UTC
Hi Garik, Yes, the upgrade will repair the duplicates issue. When the DB is upgraded we execute statements that first remove duplicate baselines, and then apply a unique index that will prevent duplicates in the future. We don't really expect duplicates given the other changes here but we can't completely rule it out. But now we will handle it if it happens. Recreation steps are not simple, it was hard *before* the fix, requiring some manual manipulation of the Quartz jobs. I think running whatever HA and HA perf testy plans that you guys typically run through would be sufficient.
Comment 9 Garik Khachikyan 2014-10-17 11:26:48 UTC
#VERIFIED made a performance server with multiple agents and HA structure - all worked fine and no logs as such in description appeared after a day of running.