Bug 1125439
| Summary: | MeasurementScheduleManagerBean.findSchedulesByCriteria fails due to duplicate baselines for single schedule_id | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Operations Network | Reporter: | Larry O'Leary <loleary> |
| Component: | Monitoring -- Other | Assignee: | Jay Shaughnessy <jshaughn> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Garik Khachikyan <gkhachik> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | JON 3.2.2 | CC: | gkhachik, jshaughn, lcarlon, mkoci, rhatlapa |
| Target Milestone: | ER02 | ||
| Target Release: | JON 3.3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Attempting to access resource schedules threw an exception due to duplicate baselines present in the schedule_id variable. Schedules would not load and JBoss ON was unable to retrieve the metric data. A unique identifier index has been added to the baseline.schedule value to detect duplicate baselines for a schedule. Schedules now load as expected.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-12-11 14:04:01 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Created https://github.com/rhq-project/rhq/pull/105 with potentially useful changes related to this issue.
The multiple baselines come about due to overlapping data purge jobs being run, and within the data purge jobs unfortunate timing can produce overlapping baseline generation. In general stateful quartz jobs don't overlap, that's exactly the point of declaring them stateful. But it is possible in basically one scenario. First, it needs to be an HA environment, which means multiple servers and therefore muiltiple (clustered) quartz schedulers. Then, it requires a false positive check for a "down" scheduler. Each quartz scheduler intermittently checks in with the DB and updates a check-in time. When it does that it also checks the other schedulers to make sure they have all checked in on schedule. If a scheduler has missed a check-in it it considered down and the scheduler checking-in performs recovery logic on in-progress jobs which it presumes are dead. It:
1) cleans up the in-progress job
2) if the job is set for recovery it is then executed immediately (although
all of our jobs are set to recovery=false)
3) it then applies the misfire action set on the trigger.
By default (for a cron trigger) it will immediately launch the job if it has
misfired at least once. A "misfire" means it missed a scheduled execution
because a previous execution was still running. If it has not misfired
it is just rescheduled for the next scheduled execution after the current
time.
This is what is happening. A running scheduler is missing it's check-in. And therefore, the job is actually running but it is considered dead. The recovery ends up executing it again, and then you have overlapping executions.
There are basically 3 reasons this happens. By far the most prevalent is lack of clock sync among the servers and/or database. It is critical that clock sync is close, quartz only allows a little bit of wiggle room when checking for a missed check-in. The other main reason is some condition that doesn't allow nodes to "check in" to the database in a timely manner (at the cluster check-in interval). This can happen if there is too much lag time when updating the DB, for whatever reason, potentially only on one server.
Master commit a06323a067fa942bb5100be2845c47f514626c79
Author: Jay Shaughnessy <jshaughn>
Date: Sat Aug 16 14:59:09 2014 -0400
Due to some git trouble this is something of a manually generated squash
commit. The changes are somewhat two-fold. First, there are changes
relevant to the initially reported symptom, which was the existence of
duplicate baselines for a single schedule. The code looked like it was
capable of generating a storing a duplicate, due to the use of Lists (not
Sets) in the APIs, and also the lack of a unique index even though there is
a 1:1 relationship between baseline and schedule. The first set of commits
addressed these issues and also adding chunking to avoid large Txs and also
oracle IN-clause limits.
- Make the db index on baseline's scheduleId unique. There should only be
one baseline for a schedule. This will prevent the problem in the BZ,
possibly at the expense of failure when trying to store a duplicate, but
that would perhaps help us pinpoint a problematic code-path.
- Use Sets (or a Map) as opposed to Lists in several places in the local API.
- Simplify and I think make more efficient the query for selecting schedules
without a current baseline. Make it return distinct ids (this is a hedge,
I don't think it would return duplicates, but this ensures that constraint).
Replace a left join with a NOT EXISTS sub select.
- Chunk the storage of baselines to prevent large Txs, don't pass entities
across Tx boundary, fetch all Schedules in one query, and (maybe) simplify
the entity manager interaction.
- Simplify some of the TransactionAttribute decls
- Mark MeasurementSchedule.baseline as an optional 1-1 relationship, just to
make it explicit on the entity.
- Fix some local SLSB jdoc
Later, the actual root cause was discovered: the baseline job executing
concurrently and the same schedule occurring in the overlapping jobs. This
scenario was originally thought to be impossible due to the stateful
nature of the quartz job. But in fact it is possible in certain conditions
due to quartz clustered recovery logic(see BZ comments). These changes
minimize the concurrent job possibility:
- increase the quartz clusterCheckinInterval from 30s to 2m. This
decreases the chance of false positives when quartz checks for down
schedulers and should have little or no effect on the recovery of our jobs.
- move the data calculation work from DataPurgeJob to the new DataCalcJob,
leaving DataPurgeJob to deal only with RDB maintenance and letting
the new DataCalcJob only work on baseline and OOB generation. There is
no reason for DataCalc work to have to wait for DB maintenance work.
Also, stagger the jobs by 1/2 hour to lessen concurrent work.
- for the cron jobs, switch the misfire policy from the default (RUN_ONCE)
to DO_NOTHING. So, instead of long running jobs again starting up
immediately after finishing, let it wait for the next scheduled start
time. This also reduces the chances of overlap if we do in fact get a
false positive recovery situation. (as an aside, there is a good web page
here regarding clustered quartz logic:
http://flylib.com/books/en/2.65.1.91/1/)
Given the work described above it's now very unlikely but not completely
guaranteed that we won't have overlapping baseline calculation jobs.
Because we now have a unique index on baseline.schedule we will immediately
detect an attempt to store a duplicate baseline for a schedule. In the
unlikely event that this happens we know it's [very likely] due to an
overlapping job. So if we fail to save baselines because of a constraint
violation we just catch, log a warning and continue.
release/jon3.3.x commit 184f13babbfeacc5bb0e215f2c0145ba9197aba2 Author: Jay Shaughnessy <jshaughn> Date: Sat Aug 16 14:59:09 2014 -0400 Cherry-pick master: 3e4f9571733ea0c7867acf693efc5fc82110b41a Moving to ON_QA as available for test with the following brew build: https://brewweb.devel.redhat.com//buildinfo?buildID=381194 @Jay: hi, need some info please if it would fix/correct the situation from old release (so if one does `rhqctl upgrade` the servers with such situation)? (may i suspect - no)? Another thing - any concrete scenario for making a bit chance to make such a situation in 3.3 ? :) Would it be enough making a HA 2 servers and plug some agents waiting for several rounds of purge jobs to run and trace the database for possible duplicates please? So far I understand that with current enhancements we preventing it from being happen in many aspects - so my chances to try to reproduce or find a hole - would be close to 0 :) thanks. Hi Garik, Yes, the upgrade will repair the duplicates issue. When the DB is upgraded we execute statements that first remove duplicate baselines, and then apply a unique index that will prevent duplicates in the future. We don't really expect duplicates given the other changes here but we can't completely rule it out. But now we will handle it if it happens. Recreation steps are not simple, it was hard *before* the fix, requiring some manual manipulation of the Quartz jobs. I think running whatever HA and HA perf testy plans that you guys typically run through would be sufficient. #VERIFIED made a performance server with multiple agents and HA structure - all worked fine and no logs as such in description appeared after a day of running. |
Description of problem: Attempting to access any metric or metric schedule for certain resources fails and a "Cannot load schedules" message is displayed. Reviewing the message center reveals: Root Cause: org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline Detail: java.lang.RuntimeException:[1406822207072] javax.ejb.EJBException:javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> javax.persistence.PersistenceException:org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline--- STACK TRACE FOLLOWS ---[1406822207072] javax.ejb.EJBException:javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> javax.persistence.PersistenceException:org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline -> org.hibernate.HibernateException:More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at Unknown.RuntimeException_0(Unknown Source) at Unknown.instantiate_32(Unknown Source) at Unknown.$instantiate_0(Unknown Source) at Unknown.$instantiate(Unknown Source) at Unknown.$readObject(Unknown Source) at Unknown.$onResponseReceived(Unknown Source) at Unknown.onResponseReceived_4(Unknown Source) at Unknown.$fireOnResponseReceived(Unknown Source) at Unknown.onReadyStateChange_0(Unknown Source) at Unknown.anonymous(Unknown Source) at Unknown.apply(Unknown Source) at Unknown.entry0(Unknown Source) The server log provides a stack trace that links this to the baseline table containing multiple baseline entries for the same schedule_id. ERROR [org.jboss.as.ejb3.invocation] (http-/0.0.0.0:7080-126) JBAS014134: EJB Invocation failed on component MeasurementScheduleManagerBean for method public abstract org.rhq.core.domain.util.PageList org.rhq.enterprise.server.measurement.MeasurementScheduleManagerRemote.findSchedulesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.MeasurementScheduleCriteria): javax.ejb.EJBException: javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at org.jboss.as.ejb3.tx.CMTTxInterceptor.handleExceptionInOurTx(CMTTxInterceptor.java:165) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInOurTx(CMTTxInterceptor.java:250) [jboss-as-ejb3-7.2.1.Final-redhat-10.jar:7.2.1.Final-redhat-10] ... at org.rhq.enterprise.server.measurement.MeasurementScheduleManagerLocal$$$view17.findSchedulesByCriteria(Unknown Source) [rhq-server.jar:4.9.0.JON320GA-redhat-2] at org.rhq.coregui.server.gwt.MeasurementDataGWTServiceImpl.findMeasurementSchedulesByCriteria(MeasurementDataGWTServiceImpl.java:152) ... Caused by: javax.persistence.PersistenceException: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1387) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.hibernate.ejb.AbstractEntityManagerImpl.convert(AbstractEntityManagerImpl.java:1310) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.hibernate.ejb.QueryImpl.getResultList(QueryImpl.java:273) [hibernate-entitymanager-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.rhq.enterprise.server.util.QueryUtility.fetchPagedDataAndCount(QueryUtility.java:295) [rhq-server.jar:4.9.0.JON320GA-redhat-2] at org.rhq.enterprise.server.util.CriteriaQueryRunner.execute(CriteriaQueryRunner.java:78) [rhq-server.jar:4.9.0.JON320GA-redhat-2] at org.rhq.enterprise.server.measurement.MeasurementScheduleManagerBean.findSchedulesByCriteria(MeasurementScheduleManagerBean.java:1580) [rhq-server.jar:4.9.0.JON320GA-redhat-2] ... Caused by: org.hibernate.HibernateException: More than one row with the given identifier was found: 123432, for class: org.rhq.core.domain.measurement.MeasurementBaseline at org.hibernate.loader.entity.AbstractEntityLoader.load(AbstractEntityLoader.java:104) [hibernate-core-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] at org.hibernate.loader.entity.EntityLoader.loadByUniqueKey(EntityLoader.java:161) [hibernate-core-4.2.0.SP1-redhat-1.jar:4.2.0.SP1-redhat-1] ... SELECT * FROM rhq_measurement_bline WHERE schedule_id=123432 ID BL_USER_ENTERED BL_MIN BL_MAX BL_MEAN BL_COMPUTE_TIME SCHEDULE_ID 19999 0 1125400000 4329100000 1976400000 28-Jul-2014 19:22:04 123432 20033 0 1125400000 4329100000 1976400000 28-Jul-2014 19:22:04 123432 20052 0 1125400000 4329100000 1976400000 28-Jul-2014 19:22:04 123432 Version-Release number of selected component (if applicable): 3.2.2 How reproducible: Unknown Actual results: The same baseline entry appears multiple times in the rhq_measurement_bline table. Expected results: One one baseline entry appears to each baseline. Additional info: In the environment this issue was reported, there are actually 42,000 metric schedules that are impacted by this. Some schedule_ids have 2 entries; some have 3 entries; others have 4 entries. This environment has two storage nodes operating in a cluster. Perhaps storage node read/write timeouts while calculating baselines is causing the same data to be calculated and inserted multiple times?