990990 – Process instance takes too long to be completed on the second node of cluster when the first node goes down

Bug 990990 - Process instance takes too long to be completed on the second node of cluster when the first node goes down

Summary: Process instance takes too long to be completed on the second node of cluster...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	JBoss BPMS Platform 6
Classification:	Retired
Component:	Business Central
Sub Component:
Version:	6.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Maciej Swiderski
QA Contact:	Radovan Synek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-01 11:04 UTC by Radovan Synek
Modified:	2013-08-05 14:49 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-05 14:49:47 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
process definition (15.75 KB, application/xml) 2013-08-01 11:05 UTC, Radovan Synek	no flags	Details
server-one log (326.13 KB, text/x-log) 2013-08-01 11:06 UTC, Radovan Synek	no flags	Details
server-two log (168.99 KB, text/x-log) 2013-08-01 11:06 UTC, Radovan Synek	no flags	Details
View All

Description Radovan Synek 2013-08-01 11:04:48 UTC

Having scenario as follows:
- cluster with 2 nodes (running as EAP 6.1 domain)
- process instance has been started on node1
- after a couple of seconds node1 has been shut down
- waiting for node2 to complete the process instance - which takes more than 30 seconds

The same process can be completed in about 6 seconds (the process definition includes a timer) without the simulation of failover.

here is a process instance log:
1/Aug/13 11:49:10: 22 - Join
1/Aug/13 11:49:10: 23 - increment (ActionNode)
1/Aug/13 11:49:10: 24 - Split
1/Aug/13 11:49:10: 25 - EndNode
1/Aug/13 11:49:09: 18 - Join
1/Aug/13 11:49:09: 19 - increment (ActionNode)
1/Aug/13 11:49:09: 20 - Split
1/Aug/13 11:49:09: 21 - TimerNode
1/Aug/13 11:49:08: 14 - Join
1/Aug/13 11:49:08: 15 - increment (ActionNode)
1/Aug/13 11:49:08: 16 - Split
1/Aug/13 11:49:08: 17 - TimerNode
1/Aug/13 11:48:33: 10 - Join
1/Aug/13 11:48:33: 11 - increment (ActionNode)
1/Aug/13 11:48:33: 12 - Split 
1/Aug/13 11:48:33: 13 - TimerNode
1/Aug/13 11:48:32: 9 - TimerNode
1/Aug/13 11:48:32: 6 - Join
1/Aug/13 11:48:32: 7 - increment (ActionNode)
1/Aug/13 11:48:32: 8 - Split
1/Aug/13 11:48:30: 0 - StartNode
1/Aug/13 11:48:30: 1 - init (ActionNode)
1/Aug/13 11:48:30: 2 - Join
1/Aug/13 11:48:30: 3 - increment (ActionNode)
1/Aug/13 11:48:30: 4 - Split
1/Aug/13 11:48:30: 5 - TimerNode

take a look at 11:49:08: 17 - TimerNode and 11:48:33: 10 - Join => this is probably the point when node1 went down.

Attaching server logs from both server nodes and of course the process definition.

Comment 1 Radovan Synek 2013-08-01 11:05:30 UTC

Created attachment 781526 [details]
process definition

Comment 2 Radovan Synek 2013-08-01 11:06:15 UTC

Created attachment 781527 [details]
server-one log

Comment 3 Radovan Synek 2013-08-01 11:06:39 UTC

Created attachment 781528 [details]
server-two log

Comment 4 Radovan Synek 2013-08-01 14:27:41 UTC

I forgot to state the version, which is community 6.0.0.CR1

Comment 5 Maciej Swiderski 2013-08-05 13:44:01 UTC

First of all few comments about the process definition:
1. not sure if that is desired design as there are loop and cycle timer event which means it will for every entry of the timer node duplicate the timer, meaning:
- first entry single timer defined that fires every second
- second entry two timers defined and every firing every second
- third entry three timers defined and every firing every second
- etc

I believe that it should use timer duration event instead of timer cycle event. Which would close timer node after it expires.

2. specifying timer interval of 1 second does not really make use of the cluster as it's too frequently to utilize nodes in the cluster as by default nodes checks for jobs to fire every 20 seconds. Modifying process to fire ever 30 seconds or 60 seconds gives much better cluster utilization.

And back to main issue, I believe that the issue you observe is due to to frequent fires and the multiplicity of timer nodes. Quartz cluster support is backed by global cluster lock meaning only single node in the cluster can fire given job. If it fails to complete execution of a given job that means it does not release the lock and quartz needs to discover failed cluster nodes and pick up the failed job by another cluster member. In some cases this fail over (and lock discover and release) might take some seconds.

I don't think we can do much about it, based on quartz documentation:
"The clustering feature works best for scaling out long-running and/or cpu-intensive jobs (distributing the work-load over multiple nodes). If you need to scale out to support thousands of short-running (e.g 1 second) jobs, consider partitioning the set of jobs by using multiple distinct schedulers (including multiple clustered schedulers for HA). The scheduler makes use of a cluster-wide lock, a pattern that degrades performance as you add more nodes (when going beyond about three nodes - depending upon your database's capabilities, etc.)."

Radek, what's your thoughts on this?

Comment 6 Radovan Synek 2013-08-05 14:49:47 UTC

Maciej, thanks for the clarification, closing the issue.

Note You need to log in before you can comment on or make changes to this bug.