if quartz-ha bit is flipped to "ON", this will automatically ensure that only one server instance in the cluster ever operates on the job at a time. this is good for most jobs, but not all of them. thanks for the work in RHQ-668, we now need the "correct" server to pick up that job for processing. in other words, since the alert data is segmented, we can't very well just have any arbitrary server pick up the out-of-band processing work for the alert condition and OOB matching; we need to have THE server that submitted the job in-band do that processing. moreover, if a server goes down and that alerts data gets migrated as the agents repartition themselves to other servers, we need to make sure that the condition checking jobs get executed by the server that CURRENTLY has the data for that segment.
From http://www.opensymphony.com/quartz/wikidocs/TutorialLesson11.html * quartz cluster feature enables load balancing and job recovery option (failover) * CAN NOT use a clustered instance and non-clustered instance to connect to the same quartz tables - very very very very bad * cluster machine clocks must be synchronized to within 1-second of each other To enable: * set "org.quartz.jobStore.isClustered" property to "true" * each instance in the cluster should use the same copy of the quartz.properties file ** ok, thread pool can be different (may this is based on # cpus/cores [HA configuration property] each RHQ server instance has) ** use "AUTO" value for the "org.quartz.scheduler.instanceId" property so each node automatically gets a unique instanceId
The only other thing we have to do is make sure me make the repartition job a stateful one. From http://www.opensymphony.com/quartz/api/org/quartz/StatefulJob.html "...The key difference is that (Stateful Job's) associated JobDataMap is re-persisted after every execution of the job, thus preserving state for the next execution. The other difference is that stateful jobs are not allowed to execute concurrently, which means new triggers that occur before the completion of the execute(xx) method will be delayed." By making the job stateful, we can be ensured that only one instance of it will be exeucting at a time, across the entire RHQ server cluster.
From http://forums.opensymphony.com/thread.jspa?messageID=2859ଫ "The only suggestion I would make is to not use the AUTO instance id feature. Quartz doesn't clean up entries in the scheduler_state table, so the entries can start building up pretty quickly if you take the application up and down a lot. Consider using an identifier that doesn't change frequently but is unique to the instance. The hostname of the machine Quartz is running on might be a workable solution. " Maybe we should use the cluster instance id / server name for this quartz "org.quartz.scheduler.instanceId" property too, because we expect the server instance to go down / come up.
using the HA installer I was able to installer a 2-node server cloud. I wrote up a quick dummy, non-stateful job and suffixed the cloud instance name to the job name. using our SchedulerBean, i was able to schedule this job against this specific server instance only, and have verified that it ran on only that server instance. prior to this, since we had no concept of a server name, the quartz impl would by default round-robin which instance in the cloud ran this job. now, each server can register server-specific jobs in the startup servlet (just as we do with other jobs) and be assured that only that server instance will run them, managed via quartz.
interestingly enough, i wasn't completely crazy when i had gone through the quartz docs. looking at the quartz tables, i *do* see one quartz scheduler for each server in the cloud. the docs made me believe that these scheduler instances would have to be manually created / manipulated explicitly through the quartz api, but it appears that they are auto-created with a "<hostname><server_start_timestamp>" naming convention. so, the SchedulerBean SLSB automatically manipulates job details for that respective server's scheduler, but if the jobName inside the details is not server-specific quartz will manage that job in a cluster-oriented manner (round-robin the job between the various schedulers) instead of having each server execute separate instances of it.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-679