Bug 1273109 - BPM cluster: Job executor fails with NPE while creating a deployment unit
Summary: BPM cluster: Job executor fails with NPE while creating a deployment unit
Keywords:
Status: CLOSED EOL
Alias: None
Product: JBoss BPMS Platform 6
Classification: Retired
Component: jBPM Core
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: CR1
: 6.2.0
Assignee: Alessandro Lazarotti
QA Contact: Radovan Synek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-10-19 15:45 UTC by Radovan Synek
Modified: 2020-03-27 20:04 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-27 20:04:34 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
excerpt from server log node two (10.56 KB, text/plain)
2015-10-19 15:45 UTC, Radovan Synek
no flags Details
server log node one (690.67 KB, text/plain)
2015-11-09 16:34 UTC, Radovan Synek
no flags Details
server log node two (651.50 KB, text/plain)
2015-11-09 16:35 UTC, Radovan Synek
no flags Details
maven based reproducer (63.69 KB, application/zip)
2015-11-09 16:37 UTC, Radovan Synek
no flags Details

Description Radovan Synek 2015-10-19 15:45:42 UTC
Created attachment 1084450 [details]
excerpt from server log node two

Description of problem:
Having a BPM cluster with two nodes in EAP domain, "deploy" operation triggered via Guvnor REST API on the first cluster node fails and the second cluster node shows a NPE in server log:
Error during command org.kie.remote.services.rest.async.cmd.DeploymentCmd error message null: java.lang.NullPointerException
	at org.kie.remote.services.rest.async.cmd.DeploymentCmd.execute(DeploymentCmd.java:87) [kie-remote-services-6.3.0.Final-redhat-2.jar:6.3.0.Final-redhat-2]

Version-Release number of selected component (if applicable):
6.2.0.ER4

Steps to Reproduce:
1. setup a BPM cluster with two nodes
2. clone a repository containing a project into Business Central
3. deploy the project via REST API

Actual results:

Comment 1 Radovan Synek 2015-10-20 13:43:59 UTC
update: the issue probably applies to other Guvnor REST API operations as well. Triggering a repository clone operation failed as a client kept receiving job status "ACCEPTED", although the job has been properly completed (verified in UI).
The probable reason is the job had been server by a job executor on cluster node two and the client was communicating with the cluster node one. In some cases the operation succeeded - likely when the job had been server by the same node client was communicating with.

Comment 2 Maciej Swiderski 2015-10-23 10:39:44 UTC
problem was caused by use of local cache that kept information about jobs - that applies to both deployment and guvnor(project) related operations. Since all operations are async and thus operated by jbpm executor they might be executed on any cluster member. So checking individual node does not guarantee to work as the local cache might not exists if job runs on another node than the request (over REST) came in.

Solution was to enhance use of executor api to be able to query for executed jobs if not found in local cache.

jbpm
master:
https://github.com/droolsjbpm/jbpm/commit/bea9921c55f55b7655259d247a752e3db3180fe7

6.3.x:
https://github.com/droolsjbpm/jbpm/commit/3204149a8c5a253b0ddd513bf61975865321e250

droolsjbpm-integration
master:
https://github.com/droolsjbpm/droolsjbpm-integration/commit/66d215802cd3cacce7abaa6ee7ea451d007a1921

6.3.x:
https://github.com/droolsjbpm/droolsjbpm-integration/commit/e51677851f4183f3211d9ae30152aa9d23da687f

guvnor
master:
https://github.com/droolsjbpm/guvnor/commit/5f844e2f7ec81f09cb756b707399a8cf4478a149

6.3.x:
https://github.com/droolsjbpm/guvnor/commit/e57dbbdcabb3861d313e1e3471c5fae96a2fea12

Moreover there will be a need for enhancements in tests as current test case do not check actual cluster behavior. So the outcome of the test might not be properly checked. Depending on operation executed it might be different enhancement in tests needed. Here are just suggestions so feel free to apply any other that might be better choices:

guvnor operation (e.g. compile project, clone repo etc)
- make sure that all nodes in the cluster are checked that given job id is executed. Since the job might be executed on one of the nodes (though no guarantee on which one) both nodes should be capable to return valid status information about the job

deployment/undeployment operations
always check that all nodes within cluster have the the deployment unit either deployed or undeployed. Take into consideration that deployments are synchronized in background so best is to delay the check on all nodes with the amount of time that the synchronization runs - as far as I know it's 1 second in the tests although it might be different as it's configurable.

With this enhancements we do cover proper cluster test that when used in production will be used with load balancer in front of actually cluster nodes so then REST calls can be routed to any cluster member without our knowledge.

Comment 3 Radovan Synek 2015-11-09 16:32:10 UTC
I have to reassign this issue, as the undeploy operation brought cluster into an inconsistent state - first node properly removed the deployment unit, but the second node didn't.

Attaching a standalone reproducer and server logs.

Comment 4 Radovan Synek 2015-11-09 16:34:29 UTC
Created attachment 1091858 [details]
server log node one

Comment 5 Radovan Synek 2015-11-09 16:35:06 UTC
Created attachment 1091859 [details]
server log node two

Comment 6 Radovan Synek 2015-11-09 16:37:53 UTC
Created attachment 1091861 [details]
maven based reproducer

Comment 7 Maciej Swiderski 2015-11-10 15:12:19 UTC
there was one missing bit to handle on JobResultManager to get the job request data as well as job result data to properly deal with cluster distributed jobs.

droolsjbpm-integration
master:
https://github.com/droolsjbpm/droolsjbpm-integration/commit/711d3be82bf56051be45a68593b6c2139e9fde6c

6.3.x:
https://github.com/droolsjbpm/droolsjbpm-integration/commit/f4c8a18d045bc9cee3451599acdf4e3b7cdafa2d

again, I'd like to emphasize that since the synchronization of deployments is done in background best is to let it sync properly between operations - so delaying between deploy/undeploy operations to make sure what is tested is actually happening.

Comment 8 Radovan Synek 2015-11-20 10:29:40 UTC
Verified with BPMS-6.2.0.CR1


Note You need to log in before you can comment on or make changes to this bug.