758462 – Quartz jobs fail to notify listeners of completion, leaving job in 'processing' state/hung indefinitely

Bug 758462 - Quartz jobs fail to notify listeners of completion, leaving job in 'processing' state/hung indefinitely

Summary: Quartz jobs fail to notify listeners of completion, leaving job in 'processin...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Candlepin
Classification:	Community
Component:	candlepin
Sub Component:
Version:	0.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Jesus M. Rodriguez
QA Contact:	John Sefler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-11-29 20:54 UTC by Amanda Carter
Modified:	2015-05-14 15:22 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-07-17 13:04:36 UTC
Embargoed:

Attachments	(Terms of Use)
ensure job detail isn't null, skip it otherwise. (5.33 KB, patch) 2011-12-01 19:27 UTC, Jesus M. Rodriguez	no flags	Details \| Diff
View All

Comment 1 Bryan Kearney 2011-11-29 21:13:39 UTC

What is the listener?

Comment 6 Jesus M. Rodriguez 2011-12-01 19:25:13 UTC

This occurs because the listener looks for the next job to run from the cp_jobs table. It uses the id from that table to ask the Quartz Scheduler for the JobDetail. Since the Candlepin nodes are *NOT* running Quartz in clustered mode the JobDetail is null.

I've attached a patch to fix this NullPointerException, but there is still an open issue that is best solved by turning Quartz Clustering on.

Say Node1 and Node2 are two Candlepin nodes pointing to a single database. Also, assume that there is a proxy that distributes requests between these two nodes. Quartz is *NOT* in clustered mode on these nodes either, so each node will have its own Quartz scheduler and its own Quartz RAMJobStore. For sake of argument, let's also assume that each request is split evenly between the two nodes which will make illustrating the problem a little easier.

So 4 refresh pools job requests come in to the proxy, Node1 creates Job1. Since
there are no entries in the cp_job table, Job1 will proceed to run.

Request 2 comes in to Node2 which creates Job2. Because Job1 is already RUNNING, Job2 will be added to the Quartz scheduler than immediately PAUSED. It will appear in the cp_job table as either CREATED (0) or PENDING (1).

Request 3 comes in to Node1 which creates Job3. Since Job1 is already RUNNING,
Job3 will be added to the Quartz scheduler than immediately PAUSED. It will appear in the cp_job table as either CREATED (0) or PENDING (1).

Finally, rquest 4 comes in to Node2 which creates Job4. Since Job1 is already RUNNING, Job4 will be added to the Quartz scheduler than immediately PAUSED. It will appear in the cp_job table as either CREATED (0) or PENDING (1).

So the jobs will look like this:

Node1: Job1 (Running), Job3 (Paused)
Node2: Job2 (Paused), Job4 (Paused)

Here's where it gets tricky. Once Job1 is finished. It will trigger the jobs listener to issue a jobWasExecuted event. In this event we look to see if there are any pending jobs to run. We take the first one and ask the scheduler to unpause it. In the case of Node1, the list will contain Job2, Job3, and Job4.
We take the first one (Job2), and tell the Quartz scheduler running on Node1 to unpause Job2. Wait! Job2 isn't on Node1 it is on Node2. (Prior to the patch
this would've caused a NullPointerException) We skip this job, and get the next entry, Job3. We tell the scheduler to unpause Job3. Job3 will now run to completion.

Confused yet? Ok let's see where we are now.

Node1: Job1 (Finished), Job3 (Running)
Node2: Job2 (Paused), Job4 (Paused)

cp_job table will look like this:

id, state
--, ------
Job1, 3
Job2, 0
Job3, 2
Job4, 0

Once Job3 finishes, the listener is triggered again. It goes to cp_job
and finds Job2. It isn't on Node1, so it goes to find Job4. It isn't
on Node1 either. So nothing else to do. The process ends. Node1 finished
running Job1 and Job3. But Job2 and Job4 are still Paused indefinitely
because there is no way to kick them off without adding yet another
process to monitor these jobs.

How will Quartz clustering help then?

With Quartz clustering turned on, each node in the cluster can see
*ALL* of the jobs in the cluster (even those on other Nodes). So when
Job1 finished, it would've been able to unpause Job2 on Node2.
Same with Job3 unpausing Job4.

Comment 7 Jesus M. Rodriguez 2011-12-01 19:26:16 UTC

So the patch simply fixes the null pointer exception. I don't plan on fixing the notification of Node2 since Quartz clustering is the proper solution.

Comment 8 Jesus M. Rodriguez 2011-12-01 19:27:32 UTC

Created attachment 539369 [details]
ensure job detail isn't null, skip it otherwise.

Comment 10 Bryan Kearney 2012-07-17 13:04:36 UTC

Marking all community bugs modified or beyong as closed.

Note You need to log in before you can comment on or make changes to this bug.