Red Hat Bugzilla – Bug 990990
Process instance takes too long to be completed on the second node of cluster when the first node goes down
Last modified: 2013-08-05 10:49:47 EDT
Having scenario as follows:
- cluster with 2 nodes (running as EAP 6.1 domain)
- process instance has been started on node1
- after a couple of seconds node1 has been shut down
- waiting for node2 to complete the process instance - which takes more than 30 seconds
The same process can be completed in about 6 seconds (the process definition includes a timer) without the simulation of failover.
here is a process instance log:
1/Aug/13 11:49:10: 22 - Join
1/Aug/13 11:49:10: 23 - increment (ActionNode)
1/Aug/13 11:49:10: 24 - Split
1/Aug/13 11:49:10: 25 - EndNode
1/Aug/13 11:49:09: 18 - Join
1/Aug/13 11:49:09: 19 - increment (ActionNode)
1/Aug/13 11:49:09: 20 - Split
1/Aug/13 11:49:09: 21 - TimerNode
1/Aug/13 11:49:08: 14 - Join
1/Aug/13 11:49:08: 15 - increment (ActionNode)
1/Aug/13 11:49:08: 16 - Split
1/Aug/13 11:49:08: 17 - TimerNode
1/Aug/13 11:48:33: 10 - Join
1/Aug/13 11:48:33: 11 - increment (ActionNode)
1/Aug/13 11:48:33: 12 - Split
1/Aug/13 11:48:33: 13 - TimerNode
1/Aug/13 11:48:32: 9 - TimerNode
1/Aug/13 11:48:32: 6 - Join
1/Aug/13 11:48:32: 7 - increment (ActionNode)
1/Aug/13 11:48:32: 8 - Split
1/Aug/13 11:48:30: 0 - StartNode
1/Aug/13 11:48:30: 1 - init (ActionNode)
1/Aug/13 11:48:30: 2 - Join
1/Aug/13 11:48:30: 3 - increment (ActionNode)
1/Aug/13 11:48:30: 4 - Split
1/Aug/13 11:48:30: 5 - TimerNode
take a look at 11:49:08: 17 - TimerNode and 11:48:33: 10 - Join => this is probably the point when node1 went down.
Attaching server logs from both server nodes and of course the process definition.
Created attachment 781526 [details]
Created attachment 781527 [details]
Created attachment 781528 [details]
I forgot to state the version, which is community 6.0.0.CR1
First of all few comments about the process definition:
1. not sure if that is desired design as there are loop and cycle timer event which means it will for every entry of the timer node duplicate the timer, meaning:
- first entry single timer defined that fires every second
- second entry two timers defined and every firing every second
- third entry three timers defined and every firing every second
I believe that it should use timer duration event instead of timer cycle event. Which would close timer node after it expires.
2. specifying timer interval of 1 second does not really make use of the cluster as it's too frequently to utilize nodes in the cluster as by default nodes checks for jobs to fire every 20 seconds. Modifying process to fire ever 30 seconds or 60 seconds gives much better cluster utilization.
And back to main issue, I believe that the issue you observe is due to to frequent fires and the multiplicity of timer nodes. Quartz cluster support is backed by global cluster lock meaning only single node in the cluster can fire given job. If it fails to complete execution of a given job that means it does not release the lock and quartz needs to discover failed cluster nodes and pick up the failed job by another cluster member. In some cases this fail over (and lock discover and release) might take some seconds.
I don't think we can do much about it, based on quartz documentation:
"The clustering feature works best for scaling out long-running and/or cpu-intensive jobs (distributing the work-load over multiple nodes). If you need to scale out to support thousands of short-running (e.g 1 second) jobs, consider partitioning the set of jobs by using multiple distinct schedulers (including multiple clustered schedulers for HA). The scheduler makes use of a cluster-wide lock, a pattern that degrades performance as you add more nodes (when going beyond about three nodes - depending upon your database's capabilities, etc.)."
Radek, what's your thoughts on this?
Maciej, thanks for the clarification, closing the issue.