Bug 612754

Summary:	investigate starting communications subsystem before the quartz scheduler
Product:	[Other] RHQ Project	Reporter:	Joseph Marques <jmarques>
Component:	Communications Subsystem	Assignee:	RHQ Project Maintainer <rhq-maint>
Status:	CLOSED WONTFIX	QA Contact:	Corey Welton <cwelton>
Severity:	medium	Docs Contact:
Priority:	low
Version:	3.0.0	CC:	ccrouch, mazz
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-03-24 13:59:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joseph Marques 2010-07-08 22:52:10 UTC

The following exception indicates that the server, as it is starting up, will attempt to execute scheduled jobs prior to the communications layer being ready.  So, if we start quartz after we start the comm layer, we'll ensure that these jobs don't fail unnecessarily.

18:35:35,897 ERROR [ResourceOperationJob] Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=11931, type=Linux, key=marques-redhat, name=marques-redhat, parent=<null>, version=Linux 2.6.32.14-127.fc12.x86_64]],job-name=[rhq-resource-11931-735590344-1278628535289], job-group=[rhq-resource-11931], operation-name=[viewProcessList], subject=[org.rhq.core.domain.auth.Subject[id=1,name=admin]], description=[Alert operation for BZ-562816 template]]
java.lang.reflect.UndeclaredThrowableException
	at $Proxy504.invokeOperation(Unknown Source)
	at org.rhq.enterprise.server.operation.ResourceOperationJob.invokeOperationOnResource(ResourceOperationJob.java:183)
	at org.rhq.enterprise.server.operation.ResourceOperationJob.execute(ResourceOperationJob.java:95)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:525)
Caused by: java.lang.Exception: Can not make remoting client invocation due to not being connected to server.
	at org.jboss.remoting.Client.invoke(Client.java:1639)
	at org.jboss.remoting.Client.invoke(Client.java:548)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.rawSend(JBossRemotingRemoteCommunicator.java:514)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutCallbacks(JBossRemotingRemoteCommunicator.java:456)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.sendWithoutInitializeCallback(JBossRemotingRemoteCommunicator.java:475)
	at org.rhq.enterprise.communications.command.client.JBossRemotingRemoteCommunicator.send(JBossRemotingRemoteCommunicator.java:496)
	at org.rhq.enterprise.communications.command.client.AbstractCommandClient.invoke(AbstractCommandClient.java:143)
	at org.rhq.enterprise.communications.command.client.ClientCommandSender.send(ClientCommandSender.java:1087)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.send(ClientCommandSenderTask.java:229)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:107)
	at org.rhq.enterprise.communications.command.client.ClientCommandSenderTask.call(ClientCommandSenderTask.java:55)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)

Comment 1 John Mazzitelli 2010-07-09 01:22:53 UTC

This should not be happening. See comments/code from StartupServlet where comm is started before scheduler:

        // The order here is important!!!
        // IF YOU WANT TO CHANGE THE ORDER YOU MUST GET THE CHANGE PEER-REVIEWED FIRST BEFORE COMMITTING IT!!!
        //
        // If we start the scheduler before the comm layer, what happens if a stored job needs to send a message?
        // But if we start the comm layer before the scheduler, what happens if a message is received that needs
        // a job scheduled for it? I think the former is more likely to happen than the latter
        // (that is, a scheduled job would more likely need to send a message; as opposed to an incoming message
        // causing a job to be scheduled), so that explains the ordering of the comm layer and the scheduler.

...
        initScheduler(); // make sure this is initialized before starting the plugin deployer
...
        startServerCommunicationServices();
        startScheduler();
        scheduleJobs();
...

Comment 2 Joseph Marques 2010-09-24 16:14:48 UTC

but aren't the jobs already scheduled from the previous launch of the server, and we're just rescheduling them (which is a no-op most of the time, unless we've changed the interval or intervalDelay)?  we need to delay the start of quartz itself, not just the scheduling of the jobs.

Comment 3 John Mazzitelli 2010-09-24 16:52:28 UTC

The sequence is this:

1. initScheduler();
2. startServerCommunicationServices();
3. startScheduler();
4. scheduleJobs();

1. this prepares quartz, but jobs will not run yet. the scheduler has been created but it won't start jobs and (unless I'm wrong) it won't even allow you to schedule jobs yet. The only thing I'm not sure about is the scheduling part - but I know the jobs don't start running yet.

2. this starts up our comm layer - after this is done, agents can start sending us messages

3. Now we start the scheduler. if jobs have been scheduled, they start. you can start scheduling things.

4. we re-schedule our core jobs.

(side note: seems like a race condition between 3 and 4 - but I think its ok if you try to reschedule a job that is already running. At least, I don't think we've had any reports of any problems here.)

Now in the description it says:

"as it is starting up, will attempt to execute scheduled jobs prior to the communications layer being ready."

As you can see above, I don't see how this is possible. We only start the comm layer after we initialize quartz, but BEFORE we start it. In other words, the comm layer is started BEFORE quartz is started. So by the time quartz starts firing jobs, the comm layer should be up.

The description of this BZ continues:

"So, if we start quartz after we start the comm layer, we'll ensure that
these jobs don't fail unnecessarily."

Which is exactly what steps 2 and 3 are doing. Thus I don't see how this could be the cause of the exception. What RHQ version did this happen on? I know I split out this quartz init/start (steps 1 and 3) recently. I don't know - but I can't see how this is happening in the latest code since clearly the comm layer is started before quartz starts firing off jobs.

Comment 4 Joseph Marques 2010-09-24 17:27:45 UTC

i think this has to do with the fact that quartz reads the schedules from the database.  just because we're shutting a server down, doesn't mean that the existing schedules are being deleted.  so, when the start is starting up, if the sequence is:

1. initScheduler();
2. startServerCommunicationServices();
3. startScheduler();
4. scheduleJobs();

Then immediately after step 1, but before step 2 completes, the quartz schedule looks in the database and realizes that it already has previous schedules.  in fact, the schedules are so old, that their nextFireTime might equate to *now*, which causes them to fire immediately upon server start.  this firing is what is contending with #2, and I believe why it's possible to see errors here.  it's all based off of the fact that the schedules are persisted constructs (as opposed to in-memory only).

Comment 5 Corey Welton 2010-09-28 12:43:40 UTC

mazz - thoughts?

Comment 6 John Mazzitelli 2010-09-28 13:12:45 UTC

> mazz - thoughts?

I have nothing to add - see previous comments for my thoughts :) OK, that's not 100% true - I'll add the following...

First, if someone alters the startup order of the things in StartupServlet - I reject all culpability and blame :} I (and others) have tried to do that in the past, thinking we were fixing something only to have other problems we didn't anticipate crop up and we had to revert back. My solution, therefore, is to leave it as is and just deal with the fact that a scheduled job may fail immediately at startup and allow the next invocation of the job occur normally. That said, let me say that I still don't believe that is really happening now because I don't believe that just initializing the quartz scheduler starts invoking the jobs. Until I see it, I won't believe it. :) There must be something else going on that triggered the exception. Here's why I think that:

Step 1. initScheduler();

This calls our initQuartzScheduler() which boils down to a quartz call to schedulerFactory.getScheduler() and that only creates the scheduler and initializes it - it doesn't start it. Javadocs for that quartz API says "Returns a handle to the Scheduler produced by this factory." It says nothing about triggering jobs and I looked at the quartz code and i don't see anything about starting jobs or triggering triggers.

Step 3. startScheduler();

This is our API called after we start the comm layer. This startScheduler() boils down to a quartz call to "scheduler.start();" and the quartz javadoc says:

     * Starts the Scheduler's threads that fire Triggers
     * When a scheduler is first created it is in "stand-by" mode, and will not
     * fire triggers.

So here it explicitly states a) this starts to fire triggers and b) when the scheduler is first created (like I did in step 1 above) it will NOT fire triggers. Seems pretty clear to me.

Hence, until I see proof that our step 1 (i.e. initScheduler) actually does cause jobs to trigger I refuse to believe that's causing the jobs to run.

Comment 8 John Mazzitelli 2011-03-24 13:59:11 UTC

won't change the ordering of the startup.