Red Hat Bugzilla – Bug 1273109
BPM cluster: Job executor fails with NPE while creating a deployment unit
Last modified: 2015-11-20 05:29:40 EST
Created attachment 1084450 [details]
excerpt from server log node two
Description of problem:
Having a BPM cluster with two nodes in EAP domain, "deploy" operation triggered via Guvnor REST API on the first cluster node fails and the second cluster node shows a NPE in server log:
Error during command org.kie.remote.services.rest.async.cmd.DeploymentCmd error message null: java.lang.NullPointerException
at org.kie.remote.services.rest.async.cmd.DeploymentCmd.execute(DeploymentCmd.java:87) [kie-remote-services-6.3.0.Final-redhat-2.jar:6.3.0.Final-redhat-2]
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. setup a BPM cluster with two nodes
2. clone a repository containing a project into Business Central
3. deploy the project via REST API
update: the issue probably applies to other Guvnor REST API operations as well. Triggering a repository clone operation failed as a client kept receiving job status "ACCEPTED", although the job has been properly completed (verified in UI).
The probable reason is the job had been server by a job executor on cluster node two and the client was communicating with the cluster node one. In some cases the operation succeeded - likely when the job had been server by the same node client was communicating with.
problem was caused by use of local cache that kept information about jobs - that applies to both deployment and guvnor(project) related operations. Since all operations are async and thus operated by jbpm executor they might be executed on any cluster member. So checking individual node does not guarantee to work as the local cache might not exists if job runs on another node than the request (over REST) came in.
Solution was to enhance use of executor api to be able to query for executed jobs if not found in local cache.
Moreover there will be a need for enhancements in tests as current test case do not check actual cluster behavior. So the outcome of the test might not be properly checked. Depending on operation executed it might be different enhancement in tests needed. Here are just suggestions so feel free to apply any other that might be better choices:
guvnor operation (e.g. compile project, clone repo etc)
- make sure that all nodes in the cluster are checked that given job id is executed. Since the job might be executed on one of the nodes (though no guarantee on which one) both nodes should be capable to return valid status information about the job
always check that all nodes within cluster have the the deployment unit either deployed or undeployed. Take into consideration that deployments are synchronized in background so best is to delay the check on all nodes with the amount of time that the synchronization runs - as far as I know it's 1 second in the tests although it might be different as it's configurable.
With this enhancements we do cover proper cluster test that when used in production will be used with load balancer in front of actually cluster nodes so then REST calls can be routed to any cluster member without our knowledge.
I have to reassign this issue, as the undeploy operation brought cluster into an inconsistent state - first node properly removed the deployment unit, but the second node didn't.
Attaching a standalone reproducer and server logs.
Created attachment 1091858 [details]
server log node one
Created attachment 1091859 [details]
server log node two
Created attachment 1091861 [details]
maven based reproducer
there was one missing bit to handle on JobResultManager to get the job request data as well as job result data to properly deal with cluster distributed jobs.
again, I'd like to emphasize that since the synchronization of deployments is done in background best is to let it sync properly between operations - so delaying between deploy/undeploy operations to make sure what is tested is actually happening.
Verified with BPMS-6.2.0.CR1