Bug 1095016

Summary: Quartz stops working if database connection is reset
Product: [JBoss] JBoss Operations Network Reporter: Larry O'Leary <loleary>
Component: Core Server, DatabaseAssignee: Jay Shaughnessy <jshaughn>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: JON 3.2CC: ahovsepy, ctrianta, fbrychta, jshaughn, mazz, spinder
Target Milestone: DR02Keywords: Regression, TestCaseNeeded
Target Release: JON 3.2.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-07-29 00:17:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
server.log.png none

Description Larry O'Leary 2014-05-07 00:31:15 UTC
Description of problem:
If connection to the database is temporarily interrupted, Quartz stops working and no future jobs are scheduled. 

This seems to impact:
 - database purge job
 - metric compression
 - baseline calculation
 - resource operation execution
 - stale agent detection
 
This issue probably impacts many other functions of the server but the ones listed are the only ones that I have witnessed.

The server log simply repeats the following exception approximately every 15 seconds:

    18:50:52,691 WARN  [org.quartz.impl.jdbcjobstore.AttributeRestoringConnectionInvocationHandler] (QuartzScheduler_RHQScheduler-jboss.example.com1399419974912_ClusterManager) Failed restore connection's original auto commit setting.: java.sql.SQLException: Connection is not associated with a managed connection.org.jboss.jca.adapters.jdbc.jdk6.WrappedConnectionJDK6@11a75d80
        at org.jboss.jca.adapters.jdbc.WrappedConnection.lock(WrappedConnection.java:155)
        at org.jboss.jca.adapters.jdbc.WrappedConnection.setAutoCommit(WrappedConnection.java:707)
        at org.quartz.impl.jdbcjobstore.AttributeRestoringConnectionInvocationHandler.restoreOriginalAtributes(AttributeRestoringConnectionInvocationHandler.java:134) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.cleanupConnection(JobStoreSupport.java:3551) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.doCheckin(JobStoreSupport.java:3221) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$ClusterManager.manage(JobStoreSupport.java:3811) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.impl.jdbcjobstore.JobStoreSupport$ClusterManager.run(JobStoreSupport.java:3847) [quartz-1.6.5.jar:1.6.5]

    18:50:52,935 ERROR [org.quartz.core.QuartzSchedulerThread] (RHQScheduler_QuartzSchedulerThread) releaseTriggerRetryLoop: RuntimeException null: java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy144.rollback(Unknown Source)
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.rollbackConnection(JobStoreSupport.java:3601) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(JobStoreSupport.java:3764) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(JobStoreSupport.java:3722) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.impl.jdbcjobstore.JobStoreSupport.releaseAcquiredTrigger(JobStoreSupport.java:2805) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.core.QuartzSchedulerThread.releaseTriggerRetryLoop(QuartzSchedulerThread.java:534) [quartz-1.6.5.jar:1.6.5]
        at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:354) [quartz-1.6.5.jar:1.6.5]
    Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_55]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_55]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_55]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_55]
        at org.quartz.impl.jdbcjobstore.AttributeRestoringConnectionInvocationHandler.invoke(AttributeRestoringConnectionInvocationHandler.java:71) [quartz-1.6.5.jar:1.6.5]
        ... 7 more
    Caused by: org.postgresql.util.PSQLException: This connection has been closed.
        at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.rollback(AbstractJdbc2Connection.java:839)
        at org.jboss.jca.adapters.jdbc.BaseWrapperManagedConnection.jdbcRollback(BaseWrapperManagedConnection.java:1107)
        at org.jboss.jca.adapters.jdbc.WrappedConnection.rollback(WrappedConnection.java:779)
        ... 12 more


It appears that Quartz never tries to get a clean connection.

Version-Release number of selected component (if applicable):
3.2.0.GA

How reproducible:
Always

Steps to Reproduce:
1.  Start JBoss ON system.
2.  Import agent into inventory.
3.  Verify that all agent resources are reported as UP.
4.  Simulate a short database outage:

        service postgresql stop
        # or: pg_ctl stop -m fast

    After 5 seconds, start the database back up:        
    
        service postgresql start
        # or: pg_ctl start

5.  Invoke a discovery scan operation on a platform resource.


Actual results:
Every 15 seconds the following error is logged in server.log:
    18:50:52,935 ERROR [org.quartz.core.QuartzSchedulerThread] (RHQScheduler_QuartzSchedulerThread) releaseTriggerRetryLoop: RuntimeException null: java.lang.reflect.UndeclaredThrowableException

The discovery scan is never executed.

Expected results:
No error should appear and the system should resume its normal operation.

Additional info:
This is a regression. In JBoss ON 3.1.2 this same scenario works except for what is noted in bug 1094540.

Comment 1 John Mazzitelli 2014-05-07 18:27:43 UTC
for the record, in JON 3.1.2, when I see the DB blip, and quartz reconnect to the DB, i see this message:

14:25:38,931 INFO  [QuartzSchedulerThread] releaseTriggerRetryLoop: connection restored.

If this doesn't work in 3.2, we'll need to find out why that releaseTriggerRetryLoop in the QuartzSchedulerThread doesn't work anymore.

Comment 2 Jay Shaughnessy 2014-05-08 18:46:05 UTC
I've been at this for > 1/2 day trying various combinations of settings in quartz.properties and in the DS declarations in the app server, without any success.

I have to attribute the regression/change of behavior to our move from AS 4.2.3 in JON 3.1.2 to AS7 in JON 3.2.0.  I don't think it's anything we've done.  As of now I don't see a solution other than a server restart.

We are using an old version of Quartz, 1.6.5 (2009).  I'm going to see id we can painlessly upgrade to the last release of the V1 line, 1.8.6 (2012).  The V2 line has significant API changes.  We'll see how much trouble it is to move to 1.8.6, and if we get there, if it causes any trouble...

By the way, Bug 957691 already exists for upgrading Quartz, although it suggests a full upgrade to the most recent 2.x version.

Comment 3 Jay Shaughnessy 2014-05-09 13:57:47 UTC
The upgrade did not solve the issue.  I've posted a message in the Quartz Scheduler forums here:

  http://forums.terracotta.org/forums/posts/list/6565.page

I'm out of ideas, it seems that for whatever reason Quartz continues to use a dead connection either because it doesn't let it go, or because AS7 provides it to them.

So at the moment, a server restart is the only workaround.

Comment 4 Jay Shaughnessy 2014-05-09 14:39:24 UTC
I may have a possible solution but it requires a few packaging changes.  I'll discuss further with mazz to review the feasibility.

Comment 5 Jay Shaughnessy 2014-05-12 19:52:15 UTC
In the end a simple solution was found, just took a while to find it.

master commit a3c6d72a44b982abcb8b918daccd2dd74dc76fac
Author: Jay Shaughnessy <jshaughn>
Date:   Mon May 12 15:44:19 2014 -0400

We create the NoTxRHQDS solely for use by Quartz to interact with its jobs.
The errors resulted from connections not being validated in a timely manner
by AS7/EAP.  It was hard to figure out because we needed to explicitly set
<validation>
  <validate-on-match>true</validate-on-match>
</validation>

What this flag does is a little hard to figure from its description. And
according to the EAP documentation [1] this is supposed to be true by default.
But it isn't.  EAP bug pending creation...

- added the new option to our DMR client used by the installer
- cleaned up quartz.properties a bit, organizing and removing ignored settings
- note that this is not needed on the XA DS

Comment 6 Simeon Pinder 2014-05-22 15:15:39 UTC
Via product triage, determined that this bug is to be included for DR01 target milestone.

Comment 7 Simeon Pinder 2014-05-30 02:43:39 UTC
Moving to ON_QA as available for test in latest cumulative patch build(DR01):
http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/5-29-2014/

Comment 8 Filip Brychta 2014-06-02 11:35:37 UTC
The issue is still there on
Version :	
3.2.0.GA Update 02
Build Number :	
055b880:0620403

<validate-on-match> on the NoTxRHQDS was false after installation. I tried to set it to true manually and then the issue goes away.
So a question is, why the patch didn't set <validate-on-match> to true.

Comment 9 Jay Shaughnessy 2014-06-02 14:53:38 UTC
I think I messed up here and treated this BZ as if it were targeted to 3.3 instead of 3.2.2.  Because I see no backport of the fix.

Although, because the fix is an installer fix, a backport is not enough to repair an existing install.  In that case DMR must execute against the JON server.

First, the backport to jon3.2.x:

commit 6f02db638000b52c031ab3402866f1a15279af1f
Author: Jay Shaughnessy <jshaughn>
Date:   Mon Jun 2 10:45:31 2014 -0400

We create the NoTxRHQDS solely for use by Quartz to interact with its jobs.
The errors resulted from connections not being validated in a timely manner
by AS7/EAP.  It was hard to figure out because we needed to explicitly set
    <validation>
      <validate-on-match>true</validate-on-match>
    </validation>
    
What this flag does is a little hard to figure from its description. And
according to the EAP documentation [1] this is supposed to be true by default.
But it isn't.  EAP bug pending creation...
    
- added the new option to our DMR client used by the installer
- cleaned up quartz.properties a bit, organizing and removing ignored settings
- note that this is not needed on the XA DS    
    
    Cherry-Pick master: a3c6d72a44b982abcb8b918daccd2dd74dc76fac



Next, here is the DMR for updating an existing [running] server via the jboss CLI:

/subsystem=datasources/data-source=NoTxRHQDS/:write-attribute(name=validate-on-match,value=true)

Executing the above DMR operation will effectively apply the fix to an existing server.  I'm not sure whether the change takes place immediately, it may require a restart.

Comment 11 Simeon Pinder 2014-06-02 16:13:56 UTC
Yep. Jay and I already discussed. Additional jon.git commit fix: c941596f86ff so that configuration update applied as part of cumulative fix.  Moving to MODIFIED for testing in next build.

Comment 12 Simeon Pinder 2014-06-13 14:34:47 UTC
Moving to ON_QA as available for test in build:
http://jon01.mw.lab.eng.bos.redhat.com:8042/dist/release/jon/3.2.2.GA/6-13-2014_0900/

Comment 13 Filip Brychta 2014-06-16 08:31:47 UTC
Verified on
Version :	
3.2.0.GA Update 02
Build Number :	
dd34f04:ee37628

Comment 14 Armine Hovsepyan 2014-06-16 09:21:22 UTC
Created attachment 909069 [details]
server.log.png

Comment 15 Larry O'Leary 2014-07-29 00:17:20 UTC
This has been verified and released in Red Hat JBoss Operations Network 3.2 Update 02 (3.2.2) available from the Red Hat Customer Portal[1].



[1]: https://access.redhat.com/jbossnetwork/restricted/softwareDetail.html?softwareId=31783