Bug 990990 - Process instance takes too long to be completed on the second node of cluster when the first node goes down
Process instance takes too long to be completed on the second node of cluster...
Product: JBoss BPMS Platform 6
Classification: JBoss
Component: Business Central (Show other bugs)
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Maciej Swiderski
Radovan Synek
Depends On:
  Show dependency treegraph
Reported: 2013-08-01 07:04 EDT by Radovan Synek
Modified: 2013-08-05 10:49 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-08-05 10:49:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
process definition (15.75 KB, application/xml)
2013-08-01 07:05 EDT, Radovan Synek
no flags Details
server-one log (326.13 KB, text/x-log)
2013-08-01 07:06 EDT, Radovan Synek
no flags Details
server-two log (168.99 KB, text/x-log)
2013-08-01 07:06 EDT, Radovan Synek
no flags Details

  None (edit)
Description Radovan Synek 2013-08-01 07:04:48 EDT
Having scenario as follows:
- cluster with 2 nodes (running as EAP 6.1 domain)
- process instance has been started on node1
- after a couple of seconds node1 has been shut down
- waiting for node2 to complete the process instance - which takes more than 30 seconds

The same process can be completed in about 6 seconds (the process definition includes a timer) without the simulation of failover.

here is a process instance log:
1/Aug/13 11:49:10: 22 - Join
1/Aug/13 11:49:10: 23 - increment (ActionNode)
1/Aug/13 11:49:10: 24 - Split
1/Aug/13 11:49:10: 25 - EndNode
1/Aug/13 11:49:09: 18 - Join
1/Aug/13 11:49:09: 19 - increment (ActionNode)
1/Aug/13 11:49:09: 20 - Split
1/Aug/13 11:49:09: 21 - TimerNode
1/Aug/13 11:49:08: 14 - Join
1/Aug/13 11:49:08: 15 - increment (ActionNode)
1/Aug/13 11:49:08: 16 - Split
1/Aug/13 11:49:08: 17 - TimerNode
1/Aug/13 11:48:33: 10 - Join
1/Aug/13 11:48:33: 11 - increment (ActionNode)
1/Aug/13 11:48:33: 12 - Split 
1/Aug/13 11:48:33: 13 - TimerNode
1/Aug/13 11:48:32: 9 - TimerNode
1/Aug/13 11:48:32: 6 - Join
1/Aug/13 11:48:32: 7 - increment (ActionNode)
1/Aug/13 11:48:32: 8 - Split
1/Aug/13 11:48:30: 0 - StartNode
1/Aug/13 11:48:30: 1 - init (ActionNode)
1/Aug/13 11:48:30: 2 - Join
1/Aug/13 11:48:30: 3 - increment (ActionNode)
1/Aug/13 11:48:30: 4 - Split
1/Aug/13 11:48:30: 5 - TimerNode

take a look at 11:49:08: 17 - TimerNode and 11:48:33: 10 - Join => this is probably the point when node1 went down.

Attaching server logs from both server nodes and of course the process definition.
Comment 1 Radovan Synek 2013-08-01 07:05:30 EDT
Created attachment 781526 [details]
process definition
Comment 2 Radovan Synek 2013-08-01 07:06:15 EDT
Created attachment 781527 [details]
server-one log
Comment 3 Radovan Synek 2013-08-01 07:06:39 EDT
Created attachment 781528 [details]
server-two log
Comment 4 Radovan Synek 2013-08-01 10:27:41 EDT
I forgot to state the version, which is community 6.0.0.CR1
Comment 5 Maciej Swiderski 2013-08-05 09:44:01 EDT
First of all few comments about the process definition:
1. not sure if that is desired design as there are loop and cycle timer event which means it will for every entry of the timer node duplicate the timer, meaning:
- first entry single timer defined that fires every second
- second entry two timers defined and every firing every second
- third entry three timers defined and every firing every second 
- etc

I believe that it should use timer duration event instead of timer cycle event. Which would close timer node after it expires.

2. specifying timer interval of 1 second does not really make use of the cluster as it's too frequently to utilize nodes in the cluster as by default nodes checks for jobs to fire every 20 seconds. Modifying process to fire ever 30 seconds or 60 seconds gives much better cluster utilization.

And back to main issue, I believe that the issue you observe is due to to frequent fires and the multiplicity of timer nodes. Quartz cluster support is backed by global cluster lock meaning only single node in the cluster can fire given job. If it fails to complete execution of a given job that means it does not release the lock and quartz needs to discover failed cluster nodes and pick up the failed job by another cluster member. In some cases this fail over (and lock discover and release) might take some seconds.

I don't think we can do much about it, based on quartz documentation:
"The clustering feature works best for scaling out long-running and/or cpu-intensive jobs (distributing the work-load over multiple nodes). If you need to scale out to support thousands of short-running (e.g 1 second) jobs, consider partitioning the set of jobs by using multiple distinct schedulers (including multiple clustered schedulers for HA). The scheduler makes use of a cluster-wide lock, a pattern that degrades performance as you add more nodes (when going beyond about three nodes - depending upon your database's capabilities, etc.)."

Radek, what's your thoughts on this?
Comment 6 Radovan Synek 2013-08-05 10:49:47 EDT
Maciej, thanks for the clarification, closing the issue.

Note You need to log in before you can comment on or make changes to this bug.