956805 – Failed to start the Infinispan subsystem with cause NullPointerException in domain mode

Bug 956805 - Failed to start the Infinispan subsystem with cause NullPointerException in domain mode

Summary: Failed to start the Infinispan subsystem with cause NullPointerException in d...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	EJB
Sub Component:
Version:	6.0.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ER1
Target Release:	EAP 6.2.0
Assignee:	David M. Lloyd
QA Contact:	Jan Martiska
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-04-25 17:14 UTC by wfink
Modified:	2019-06-20 19:29 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-12-15 16:14:08 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description wfink 2013-04-25 17:14:23 UTC

Description of problem:

The following NPE occur often when the server is started:

ERROR [org.jboss.msc.service.fail] (MSC service thread 1-46) MSC000001: Failed to start service jboss.ejb.cache.store.infinispan: org.jboss.msc.service.StartException in service jboss.ejb.cache.store.infinispan: Failed to start service
    at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1767) [jboss-msc-1.0.2.GA-redhat-2.jar:1.0.2.GA-redhat-2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.6.0_17]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.6.0_17]
    at java.lang.Thread.run(Thread.java:636) [rt.jar:1.6.0_17]
Caused by: java.lang.NullPointerException
    at org.jboss.as.ejb3.remote.LocalEjbReceiver.addClusterNodes(LocalEjbReceiver.java:423)
    at org.jboss.as.ejb3.remote.LocalEjbReceiver.addClusterNodes(LocalEjbReceiver.java:409)
    at org.jboss.as.ejb3.remote.LocalEjbReceiver.registryAdded(LocalEjbReceiver.java:378)
    at org.jboss.as.clustering.registry.RegistryCollectorService.add(RegistryCollectorService.java:52)
    at org.jboss.as.ejb3.cache.impl.backing.clustering.ClusteredBackingCacheEntryStoreSourceService.start(ClusteredBackingCacheEntryStoreSourceService.java:100)
    at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1811) [jboss-msc-1.0.2.GA-redhat-2.jar:1.0.2.GA-redhat-2]
    at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1746) [jboss-msc-1.0.2.GA-redhat-2.jar:1.0.2.GA-redhat-2]
    ... 3 more


The start procedure at the moment is:
- stop the DC (keep the HC running)
- start the DC (while the HC is still running)
  here the NPE is thrown sporadicaly and the infinispan service start fail
- stop the HC
- start the HC, here the problem was not seen

It is not tested whether the DC is working correctly in that state.

It looks like that the problem is a race condition, during the restart there are many applications deployed, one application might be force the problem as this app will have the highest resources use and start slower.
If the number of applications is reduced the error disapear or did not happen often.

Comment 1 Brian Stansberry 2013-05-22 19:43:48 UTC

For this to NPE to happen, EJBRemoteConnectorService.getEJBRemoteConnectorSocketBinding() would have to return null.

    SocketBinding getEJBRemoteConnectorSocketBinding() {
        if (this.remotingServer == null) {
            return null;
        }
        return this.remotingServer.getSocketBinding();
    }

So either this.remotingServer would have to be null, or this.remotingServer.getSocketBinding() returns null. The latter is unlikely; it returns a value that is injected via a simple dependency in the RemotingService utility class:

.addDependency(bindingName, SocketBinding.class, streamServerService.getSocketBindingInjector())

The former is a bit more unusual. The value of "remotingServer" is provided to EJBRemoteConnectorService in an odd way. The management operation handler EJB3RemoteServiceAdd adds a dependency:

// add dependency on the remoting server (which allows remoting connector to connect to it)
ejbRemoteConnectorServiceBuilder.addDependency(remotingServerServiceName);

and then in EJBRemoteConnectorService.start() the depended on service is looked up from MSC:

// get the remoting server (which allows remoting connector to connect to it) service
final ServiceContainer serviceContainer = context.getController().getServiceContainer();
final ServiceController streamServerServiceController = serviceContainer.getRequiredService(this.remotingConnectorServiceName);
final AbstractStreamServerService streamServerService = (AbstractStreamServerService) streamServerServiceController.getService();

I don't understand why this is handled in this convoluted way. Why isn't an InjectedValue used, with EJB3RemoteServiceAdd setting up an injection? 

This unusual way of doing this should still work though.

Comment 2 Brian Stansberry 2013-05-22 21:06:29 UTC

(In reply to Brian Stansberry from comment #1)
>
> I don't understand why this is handled in this convoluted way. Why isn't an
> InjectedValue used, with EJB3RemoteServiceAdd setting up an injection? 
> 

Never mind; I get it. The Service<T> is needed, not the T.

Comment 6 Joe Wertz 2013-08-07 04:39:23 UTC

Confirmed in 6.1.1ER4

Hit this while trying to dig into a variety of failures associated with org.jboss.as.test.clustering.cluster.ejb3.stateless.RemoteStatelessFailoverTestCase and BZ 921532.

It appears to be a race condition. Like Brian theorized, this.remotingServer == null when this failure occurs.

This is because EJBRemoteConnectorService.start(), which handles the unconventional dependency injection Brian described, hasn't been called when infinispan attempts to get the remoteServer information. So it fails. The dependency injection itself doesn't fail, and shows up milliseconds later in the logs, but by then it's too late.

Wanted to get this information down somewhere while I'm still digging on how to actually fix it.

Comment 7 Brian Stansberry 2013-08-07 17:02:58 UTC

What I could never figure out is why the EJBRemoteConnectorService.start() wouldn't have been called. You could have a race where it gets called, but too late, if there isn't a proper dependency somewhere. But I couldn't find a code path that would result in a missing dependency.

Comment 10 Joe Wertz 2013-08-15 10:45:28 UTC

Whew. 4-service dependency interaction is fun.

ClusteredBackingCacheEntryStoreSourceService starts, calls thru to RegisteryCollector, which notifies it's listener LocalEjbReceiver, which needs the EJBRemoteConnectorService.remotingServer value.

Technically speaking, the LocalEjbReceiver implementation of the RegisteryCollector listener is what requires the EJBRemoteConnectorService.remotingService value, but right now the LocalEjbReceiver only has a 'ServiceLookupValue' attachment to the EJBRemoteConnectorService. Not a dependency.

However, if that dependency was added, I don't think it would stop the problem. The ClusteredBackingCacheEntryStoreSourceService only depends on the RegistryCollector and a 'ClientMappingRegistry'. Technically, if both of those started, the ClusteredBacking could start without the LocalEjbReceiver or the EJBRemoteConnectorService having started yet.

I think.

But I can't reproduce this anymore, so I'm not entirely sure this is actually the solution.

Comment 11 Joe Wertz 2013-08-15 12:33:14 UTC

Nevermind that last part. Reproduction successful. Also confirmed theory and solution.

Cleaning, checking upstream, and submitting pull requests tomorrow.

Comment 16 Jan Martiska 2013-09-18 13:39:05 UTC

Verified in 6.2.0.ER1.

Comment 21 Dimitris Andreadis 2013-10-24 18:28:11 UTC

Assigning jpai EJB issues to david.lloyd. Please re-assign to Cheng or others as needed.

Comment 23 dereed 2019-06-20 19:29:38 UTC

Customer tested workaround: start EAP without any deployments (or at least no clustered EJBs deployed), then add the deployments after it's fully started.

Note You need to log in before you can comment on or make changes to this bug.