Description of problem: The following NPE occur often when the server is started: ERROR [org.jboss.msc.service.fail] (MSC service thread 1-46) MSC000001: Failed to start service jboss.ejb.cache.store.infinispan: org.jboss.msc.service.StartException in service jboss.ejb.cache.store.infinispan: Failed to start service at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1767) [jboss-msc-1.0.2.GA-redhat-2.jar:1.0.2.GA-redhat-2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.6.0_17] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.6.0_17] at java.lang.Thread.run(Thread.java:636) [rt.jar:1.6.0_17] Caused by: java.lang.NullPointerException at org.jboss.as.ejb3.remote.LocalEjbReceiver.addClusterNodes(LocalEjbReceiver.java:423) at org.jboss.as.ejb3.remote.LocalEjbReceiver.addClusterNodes(LocalEjbReceiver.java:409) at org.jboss.as.ejb3.remote.LocalEjbReceiver.registryAdded(LocalEjbReceiver.java:378) at org.jboss.as.clustering.registry.RegistryCollectorService.add(RegistryCollectorService.java:52) at org.jboss.as.ejb3.cache.impl.backing.clustering.ClusteredBackingCacheEntryStoreSourceService.start(ClusteredBackingCacheEntryStoreSourceService.java:100) at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1811) [jboss-msc-1.0.2.GA-redhat-2.jar:1.0.2.GA-redhat-2] at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1746) [jboss-msc-1.0.2.GA-redhat-2.jar:1.0.2.GA-redhat-2] ... 3 more The start procedure at the moment is: - stop the DC (keep the HC running) - start the DC (while the HC is still running) here the NPE is thrown sporadicaly and the infinispan service start fail - stop the HC - start the HC, here the problem was not seen It is not tested whether the DC is working correctly in that state. It looks like that the problem is a race condition, during the restart there are many applications deployed, one application might be force the problem as this app will have the highest resources use and start slower. If the number of applications is reduced the error disapear or did not happen often.
For this to NPE to happen, EJBRemoteConnectorService.getEJBRemoteConnectorSocketBinding() would have to return null. SocketBinding getEJBRemoteConnectorSocketBinding() { if (this.remotingServer == null) { return null; } return this.remotingServer.getSocketBinding(); } So either this.remotingServer would have to be null, or this.remotingServer.getSocketBinding() returns null. The latter is unlikely; it returns a value that is injected via a simple dependency in the RemotingService utility class: .addDependency(bindingName, SocketBinding.class, streamServerService.getSocketBindingInjector()) The former is a bit more unusual. The value of "remotingServer" is provided to EJBRemoteConnectorService in an odd way. The management operation handler EJB3RemoteServiceAdd adds a dependency: // add dependency on the remoting server (which allows remoting connector to connect to it) ejbRemoteConnectorServiceBuilder.addDependency(remotingServerServiceName); and then in EJBRemoteConnectorService.start() the depended on service is looked up from MSC: // get the remoting server (which allows remoting connector to connect to it) service final ServiceContainer serviceContainer = context.getController().getServiceContainer(); final ServiceController streamServerServiceController = serviceContainer.getRequiredService(this.remotingConnectorServiceName); final AbstractStreamServerService streamServerService = (AbstractStreamServerService) streamServerServiceController.getService(); I don't understand why this is handled in this convoluted way. Why isn't an InjectedValue used, with EJB3RemoteServiceAdd setting up an injection? This unusual way of doing this should still work though.
(In reply to Brian Stansberry from comment #1) > > I don't understand why this is handled in this convoluted way. Why isn't an > InjectedValue used, with EJB3RemoteServiceAdd setting up an injection? > Never mind; I get it. The Service<T> is needed, not the T.
Confirmed in 6.1.1ER4 Hit this while trying to dig into a variety of failures associated with org.jboss.as.test.clustering.cluster.ejb3.stateless.RemoteStatelessFailoverTestCase and BZ 921532. It appears to be a race condition. Like Brian theorized, this.remotingServer == null when this failure occurs. This is because EJBRemoteConnectorService.start(), which handles the unconventional dependency injection Brian described, hasn't been called when infinispan attempts to get the remoteServer information. So it fails. The dependency injection itself doesn't fail, and shows up milliseconds later in the logs, but by then it's too late. Wanted to get this information down somewhere while I'm still digging on how to actually fix it.
What I could never figure out is why the EJBRemoteConnectorService.start() wouldn't have been called. You could have a race where it gets called, but too late, if there isn't a proper dependency somewhere. But I couldn't find a code path that would result in a missing dependency.
Whew. 4-service dependency interaction is fun. ClusteredBackingCacheEntryStoreSourceService starts, calls thru to RegisteryCollector, which notifies it's listener LocalEjbReceiver, which needs the EJBRemoteConnectorService.remotingServer value. Technically speaking, the LocalEjbReceiver implementation of the RegisteryCollector listener is what requires the EJBRemoteConnectorService.remotingService value, but right now the LocalEjbReceiver only has a 'ServiceLookupValue' attachment to the EJBRemoteConnectorService. Not a dependency. However, if that dependency was added, I don't think it would stop the problem. The ClusteredBackingCacheEntryStoreSourceService only depends on the RegistryCollector and a 'ClientMappingRegistry'. Technically, if both of those started, the ClusteredBacking could start without the LocalEjbReceiver or the EJBRemoteConnectorService having started yet. I think. But I can't reproduce this anymore, so I'm not entirely sure this is actually the solution.
Nevermind that last part. Reproduction successful. Also confirmed theory and solution. Cleaning, checking upstream, and submitting pull requests tomorrow.
Verified in 6.2.0.ER1.
Assigning jpai EJB issues to david.lloyd. Please re-assign to Cheng or others as needed.
Customer tested workaround: start EAP without any deployments (or at least no clustered EJBs deployed), then add the deployments after it's fully started.