Bug 1273109 - BPM cluster: Job executor fails with NPE while creating a deployment unit
BPM cluster: Job executor fails with NPE while creating a deployment unit
Product: JBoss BPMS Platform 6
Classification: JBoss
Component: jBPM Core (Show other bugs)
Unspecified Unspecified
urgent Severity urgent
: CR1
: 6.2.0
Assigned To: Maciej Swiderski
Radovan Synek
: Regression
Depends On:
  Show dependency treegraph
Reported: 2015-10-19 11:45 EDT by Radovan Synek
Modified: 2015-11-20 05:29 EST (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
excerpt from server log node two (10.56 KB, text/plain)
2015-10-19 11:45 EDT, Radovan Synek
no flags Details
server log node one (690.67 KB, text/plain)
2015-11-09 11:34 EST, Radovan Synek
no flags Details
server log node two (651.50 KB, text/plain)
2015-11-09 11:35 EST, Radovan Synek
no flags Details
maven based reproducer (63.69 KB, application/zip)
2015-11-09 11:37 EST, Radovan Synek
no flags Details

  None (edit)
Description Radovan Synek 2015-10-19 11:45:42 EDT
Created attachment 1084450 [details]
excerpt from server log node two

Description of problem:
Having a BPM cluster with two nodes in EAP domain, "deploy" operation triggered via Guvnor REST API on the first cluster node fails and the second cluster node shows a NPE in server log:
Error during command org.kie.remote.services.rest.async.cmd.DeploymentCmd error message null: java.lang.NullPointerException
	at org.kie.remote.services.rest.async.cmd.DeploymentCmd.execute(DeploymentCmd.java:87) [kie-remote-services-6.3.0.Final-redhat-2.jar:6.3.0.Final-redhat-2]

Version-Release number of selected component (if applicable):

Steps to Reproduce:
1. setup a BPM cluster with two nodes
2. clone a repository containing a project into Business Central
3. deploy the project via REST API

Actual results:
Comment 1 Radovan Synek 2015-10-20 09:43:59 EDT
update: the issue probably applies to other Guvnor REST API operations as well. Triggering a repository clone operation failed as a client kept receiving job status "ACCEPTED", although the job has been properly completed (verified in UI).
The probable reason is the job had been server by a job executor on cluster node two and the client was communicating with the cluster node one. In some cases the operation succeeded - likely when the job had been server by the same node client was communicating with.
Comment 2 Maciej Swiderski 2015-10-23 06:39:44 EDT
problem was caused by use of local cache that kept information about jobs - that applies to both deployment and guvnor(project) related operations. Since all operations are async and thus operated by jbpm executor they might be executed on any cluster member. So checking individual node does not guarantee to work as the local cache might not exists if job runs on another node than the request (over REST) came in.

Solution was to enhance use of executor api to be able to query for executed jobs if not found in local cache.







Moreover there will be a need for enhancements in tests as current test case do not check actual cluster behavior. So the outcome of the test might not be properly checked. Depending on operation executed it might be different enhancement in tests needed. Here are just suggestions so feel free to apply any other that might be better choices:

guvnor operation (e.g. compile project, clone repo etc)
- make sure that all nodes in the cluster are checked that given job id is executed. Since the job might be executed on one of the nodes (though no guarantee on which one) both nodes should be capable to return valid status information about the job

deployment/undeployment operations
always check that all nodes within cluster have the the deployment unit either deployed or undeployed. Take into consideration that deployments are synchronized in background so best is to delay the check on all nodes with the amount of time that the synchronization runs - as far as I know it's 1 second in the tests although it might be different as it's configurable.

With this enhancements we do cover proper cluster test that when used in production will be used with load balancer in front of actually cluster nodes so then REST calls can be routed to any cluster member without our knowledge.
Comment 3 Radovan Synek 2015-11-09 11:32:10 EST
I have to reassign this issue, as the undeploy operation brought cluster into an inconsistent state - first node properly removed the deployment unit, but the second node didn't.

Attaching a standalone reproducer and server logs.
Comment 4 Radovan Synek 2015-11-09 11:34 EST
Created attachment 1091858 [details]
server log node one
Comment 5 Radovan Synek 2015-11-09 11:35 EST
Created attachment 1091859 [details]
server log node two
Comment 6 Radovan Synek 2015-11-09 11:37 EST
Created attachment 1091861 [details]
maven based reproducer
Comment 7 Maciej Swiderski 2015-11-10 10:12:19 EST
there was one missing bit to handle on JobResultManager to get the job request data as well as job result data to properly deal with cluster distributed jobs.



again, I'd like to emphasize that since the synchronization of deployments is done in background best is to let it sync properly between operations - so delaying between deploy/undeploy operations to make sure what is tested is actually happening.
Comment 8 Radovan Synek 2015-11-20 05:29:40 EST
Verified with BPMS-6.2.0.CR1

Note You need to log in before you can comment on or make changes to this bug.