What is the listener?
This occurs because the listener looks for the next job to run from the cp_jobs table. It uses the id from that table to ask the Quartz Scheduler for the JobDetail. Since the Candlepin nodes are *NOT* running Quartz in clustered mode the JobDetail is null. I've attached a patch to fix this NullPointerException, but there is still an open issue that is best solved by turning Quartz Clustering on. Say Node1 and Node2 are two Candlepin nodes pointing to a single database. Also, assume that there is a proxy that distributes requests between these two nodes. Quartz is *NOT* in clustered mode on these nodes either, so each node will have its own Quartz scheduler and its own Quartz RAMJobStore. For sake of argument, let's also assume that each request is split evenly between the two nodes which will make illustrating the problem a little easier. So 4 refresh pools job requests come in to the proxy, Node1 creates Job1. Since there are no entries in the cp_job table, Job1 will proceed to run. Request 2 comes in to Node2 which creates Job2. Because Job1 is already RUNNING, Job2 will be added to the Quartz scheduler than immediately PAUSED. It will appear in the cp_job table as either CREATED (0) or PENDING (1). Request 3 comes in to Node1 which creates Job3. Since Job1 is already RUNNING, Job3 will be added to the Quartz scheduler than immediately PAUSED. It will appear in the cp_job table as either CREATED (0) or PENDING (1). Finally, rquest 4 comes in to Node2 which creates Job4. Since Job1 is already RUNNING, Job4 will be added to the Quartz scheduler than immediately PAUSED. It will appear in the cp_job table as either CREATED (0) or PENDING (1). So the jobs will look like this: Node1: Job1 (Running), Job3 (Paused) Node2: Job2 (Paused), Job4 (Paused) Here's where it gets tricky. Once Job1 is finished. It will trigger the jobs listener to issue a jobWasExecuted event. In this event we look to see if there are any pending jobs to run. We take the first one and ask the scheduler to unpause it. In the case of Node1, the list will contain Job2, Job3, and Job4. We take the first one (Job2), and tell the Quartz scheduler running on Node1 to unpause Job2. Wait! Job2 isn't on Node1 it is on Node2. (Prior to the patch this would've caused a NullPointerException) We skip this job, and get the next entry, Job3. We tell the scheduler to unpause Job3. Job3 will now run to completion. Confused yet? Ok let's see where we are now. Node1: Job1 (Finished), Job3 (Running) Node2: Job2 (Paused), Job4 (Paused) cp_job table will look like this: id, state --, ------ Job1, 3 Job2, 0 Job3, 2 Job4, 0 Once Job3 finishes, the listener is triggered again. It goes to cp_job and finds Job2. It isn't on Node1, so it goes to find Job4. It isn't on Node1 either. So nothing else to do. The process ends. Node1 finished running Job1 and Job3. But Job2 and Job4 are still Paused indefinitely because there is no way to kick them off without adding yet another process to monitor these jobs. How will Quartz clustering help then? With Quartz clustering turned on, each node in the cluster can see *ALL* of the jobs in the cluster (even those on other Nodes). So when Job1 finished, it would've been able to unpause Job2 on Node2. Same with Job3 unpausing Job4.
So the patch simply fixes the null pointer exception. I don't plan on fixing the notification of Node2 since Quartz clustering is the proper solution.
Created attachment 539369 [details] ensure job detail isn't null, skip it otherwise.
Marking all community bugs modified or beyong as closed.