Bug 1027738
Summary: | Re-deployment of a clustered application fail due to race-condition with infinispan | ||
---|---|---|---|
Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | wfink |
Component: | Clustering | Assignee: | Paul Ferraro <paul.ferraro> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Jitka Kozana <jkudrnac> |
Severity: | unspecified | Docs Contact: | Russell Dickenson <rdickens> |
Priority: | unspecified | ||
Version: | 6.1.1, 6.2.0 | CC: | brian.stansberry, dandread, dereed, jawilson, jdoyle, jkudrnac, lthon, myarboro |
Target Milestone: | CR2 | ||
Target Release: | EAP 6.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-12-15 16:12:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
wfink
2013-11-07 11:03:13 UTC
When using 6.2.0.CR1 preview bits, the issue is still present. I'm using the reproducer from WFLY-2458. That is: 1. unzip jboss-eap-6.2.0.CR1.zip twice to create 2 separate EAP dirs, say node1 and node2 2. cp appone.ear node1/standalone/deployments/ 3. cp appone.ear node2/standalone/deployments/ 4. node1/bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node1 5. node2/bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node2 -Djboss.socket.binding.port-offset=100 6. touch node1/standalone/deployments/appone.ear 7. watch the console log of node1 I still see the error: 17:47:43,543 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 58) MSC000001: Failed to start service jboss.infinispan.ejb.global-component-registry: org.jboss.msc.service.StartException in service jboss.infinispan.ejb.global-component-registry: org.infinispan.manager.EmbeddedCacheManagerStartupException: org.infinispan.CacheException: Unable to invoke method public void org.infinispan.remoting.transport.jgroups.JGroupsTransport.start() on object of type JGroupsTransport at org.jboss.as.clustering.msc.AsynchronousService$1.run(AsynchronousService.java:91) [jboss-as-clustering-common-7.3.0.Final-redhat-12.jar:7.3.0.Final-redhat-12] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_45] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_45] at org.jboss.threads.JBossThread.run(JBossThread.java:122) [jboss-threads-2.1.1.Final-redhat-1.jar:2.1.1.Final-redhat-1] Caused by: org.infinispan.manager.EmbeddedCacheManagerStartupException: org.infinispan.CacheException: Unable to invoke method public void org.infinispan.remoting.transport.jgroups.JGroupsTransport.start() on object of type JGroupsTransport at org.infinispan.factories.GlobalComponentRegistry.start(GlobalComponentRegistry.java:247) at org.jboss.as.clustering.infinispan.subsystem.GlobalComponentRegistryService.start(GlobalComponentRegistryService.java:57) at org.jboss.as.clustering.msc.AsynchronousService$1.run(AsynchronousService.java:86) [jboss-as-clustering-common-7.3.0.Final-redhat-12.jar:7.3.0.Final-redhat-12] ... 4 more Caused by: org.infinispan.CacheException: Unable to invoke method public void org.infinispan.remoting.transport.jgroups.JGroupsTransport.start() on object of type JGroupsTransport at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205) at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:886) at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:657) at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:646) at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:549) at org.infinispan.factories.GlobalComponentRegistry.start(GlobalComponentRegistry.java:225) ... 6 more Caused by: org.infinispan.CacheException: Unable to start JGroups Channel at org.infinispan.remoting.transport.jgroups.JGroupsTransport.startJGroupsChannelIfNeeded(JGroupsTransport.java:209) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.start(JGroupsTransport.java:198) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_45] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_45] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_45] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_45] at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:203) ... 11 more Caused by: java.lang.IllegalStateException: channel is closed at org.jgroups.JChannel.checkClosed(JChannel.java:865) at org.jgroups.JChannel._preConnect(JChannel.java:520) at org.jgroups.JChannel.connect(JChannel.java:277) at org.jgroups.JChannel.connect(JChannel.java:268) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.startJGroupsChannelIfNeeded(JGroupsTransport.java:207) ... 17 more When I built WildFly master, which already has the fix, it failed too, so this is not EAP-specific. Moving back to assigned. OK - so the previous fix ensured that the channel exists before Infinispan's global component registry starts - however, this appears to be insufficient. If undeploy doesn't trigger a full restart of the JGroups channel service, Infinispan's restarted cache manager attempts to reuse the channel that was previously closed, hence the exception. The underlying problem is that Infinispan now controls the lifecycle of the channel, while the channel is provided by EAP. However, Infinispan should only be connecting and disconnecting the channel. The channel should not close until EAP says so. I've opened an upstream issue for this: https://issues.jboss.org/browse/ISPN-3697 Until then, we can trick Infinispan into not closing the channel - and instead make that the responsibility of the channel service. I'll have a PR ready momentarily. I should note that this issue has a workaround. Instead of redeploying, the user can undeploy and deploy again. The introduction of a brief period of time allows the channel service to stop - thus causing the redeploy to use a fresh channel, instead of reusing the old, closed one. (In reply to Paul Ferraro from comment #6) > I should note that this issue has a workaround. > Instead of redeploying, the user can undeploy and deploy again. Often this will not accepted by the customer as the wanted behaviour is that the action is automatically rolled-back if the new deployment failed. Also the time where the applicaiton is not reachable is longer. I just built the GitHub EAP 6.x branch, which contains Paul's fix from the pull request, and it's no good. When replaying the reproducer I described in comment 5, I no longer get the "java.lang.IllegalStateException: channel is closed" exception. Instead, I see a series of lovely NPEs, looking like this: 08:28:58,082 WARN [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-19,shared=udp) Problems unmarshalling remote command from byte buffer: java.lang.NullPointerException at org.infinispan.marshall.jboss.JBossMarshaller$ExternalizerTableProxy.readObject(JBossMarshaller.java:164) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:355) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:213) at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:45) at org.infinispan.marshall.jboss.AbstractJBossMarshaller.objectFromObjectStream(AbstractJBossMarshaller.java:163) ... 08:28:58,083 ERROR [org.jgroups.blocks.RequestCorrelator] (OOB-20,shared=udp) failed unmarshalling buffer into return value: java.lang.NullPointerException at org.infinispan.marshall.jboss.JBossMarshaller$ExternalizerTableProxy.readObject(JBossMarshaller.java:164) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:355) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:213) at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:45) at org.infinispan.marshall.jboss.AbstractJBossMarshaller.objectFromObjectStream(AbstractJBossMarshaller.java:163) ... And there's even this: 08:28:57,806 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 57) MSC000001: Failed to start service jboss.infinispan.ejb.repl: org.jboss.msc.service.StartException in service jboss.infinispan.ejb.repl: org.infinispan.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl at org.jboss.as.clustering.msc.AsynchronousService$1.run(AsynchronousService.java:91) [jboss-as-clustering-common-7.3.0.Final-redhat-SNAPSHOT.jar:7.3.0.Final-redhat-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_45] at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_45] at org.jboss.threads.JBossThread.run(JBossThread.java:122) [jboss-threads-2.1.1.Final-redhat-1.jar:2.1.1.Final-redhat-1] Caused by: org.infinispan.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205) at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:886) at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:657) at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:646) at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:549) ... Caused by: org.infinispan.CacheException: java.lang.NullPointerException at org.infinispan.util.Util.rewrapAsCacheException(Util.java:542) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:186) at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:515) at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinator(LocalTopologyManagerImpl.java:293) at org.infinispan.topology.LocalTopologyManagerImpl.join(LocalTopologyManagerImpl.java:107) ... Caused by: java.lang.NullPointerException at org.infinispan.marshall.jboss.JBossMarshaller$ExternalizerTableProxy.readObject(JBossMarshaller.java:164) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:355) at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:213) at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:45) at org.infinispan.marshall.jboss.AbstractJBossMarshaller.objectFromObjectStream(AbstractJBossMarshaller.java:163) ... But it's all essentially the same. Note that there is a LOT of these messages, they are being reported for 1 minute, after that, the log shows: 08:28:58,090 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) ISPN000080: Disconnecting and closing JGroups Channel 08:28:58,455 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) ISPN000082: Stopping the RpcDispatcher Please, take this as a preliminary feedback. Once we get an official build, we'll provide official QA statement. This is a similar issue, but a separate cause - and needs to be addressed by upstream. See: https://issues.jboss.org/browse/ISPN-3698 Galder ZamarreƱo <galder.zamarreno> updated the status of jira ISPN-3698 to Coding In Progress This issue only appears to happen when the redeploy is triggered on multiple servers simultaneously. Unfortunately, triggering redeploy via the CLI does exactly this. As workaround a rollout plan can be used to perform redeploy. In the HA case, this is probably preferable anyway, to avoid loss of service. This PR adds a workaround for ISPN-3698, and should completely fix this issue: https://github.com/jbossas/jboss-eap/pull/697 Paul Ferraro <paul.ferraro> made a comment on jira ISPN-3698 Bumping priority and setting target release flags where the fix is needed. I built the current EAP "upstream" branch which has the fix, and I can confirm that the issue is indeed fixed. Please take this as a preliminary feedback; once we get official build, we will provide official QA statement. Verified with EAP 6.2.0.CR2 preview. |