Dist cache can lost data if the primary and all backups are lost at the same time. To detect such data loss, they are monitoring the tatal sum of entries from all nodes using JMX. If the sum reduces by a considerable ammount, data loss (by lost a segment) is concluded. But they found a case that the sum increases, not decreases, after failed state transfer. This is probably caused by an extra backup segment that hasn't been cleaned up when the last state transfer failed.
Osamu, can you confirm if the problem still occurs with 6.4.0.Final? Based on the stack trace, it looks like the IllegalArgumentException was eliminated with the fix for bug 1155606 (ISPN-4852). (Filtered stack trace below) 2014/07/31:11:16:11,110 WARN [org.infinispan.topology.CacheTopologyControlCommand] (remote-thread-931) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=cache, type=CH_UPDATE, sender=node01, joinInfo=null, topologyId=61, currentCH=DefaultConsistentHash{numSegments=240, numOwners=2, members=[...]}, pendingCH=null, throwable=null, viewId=32}: java.lang.IllegalArgumentException: Node node04 is not a member at org.infinispan.distribution.ch.DefaultConsistentHash.getSegmentsForOwner(DefaultConsistentHash.java:87) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:457) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_65]
Testing in JDG 6.4.0 will take some time because they need to set up a large cluster and this is not easily reproducible. Instead, I created Bug 1203121 which is focusing on the customer's real use case using getBulk. They will test at the same time with the new JDG after Bug 1203121 gets fixed.
The ISPN-4852 fix avoids the IllegalArgumentException by replacingthe direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to getOwnedSegments().
Created attachment 1004278 [details] another_error.log
Thanks, Dan. IllegalArgumentException case seems to be actually fixed. We went through the log files of failed state transfer again, and found another case, OutdatedTopologyException. This is very rare than IllegalArgumentException case, but on the same critical path of this issue. I attached the full stack trace as another_error.log. Note that the code path after org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate() is different. Is this fixed in JDG 6.4.0 too? ~~~ 2015-02-02 17:45:10.461 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-180934) ISPN000136: Execution error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from AAAA/clustered(s1), see cause for remote stack trace at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:41) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] ... at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:433) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_55] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_55] at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_55] Caused by: org.infinispan.statetransfer.OutdatedTopologyException: Cache topology changed while the command was executing: expected 68, got 69 ... ~~~
Osamu, that exception is harmless, it shouldn't really be logged as ERROR. The stack trace does point to another problem with the topology update being blocked in the BlockingTaskAwareExecutorService.checkForReadyTasks() call: ISPN-5317. This can delay the rebalance confirmation, thus making the rebalance longer, and even a lot longer if there were a lot of commands waiting for the new topology. The workaround, as always, is to increase the number of thread in the remote-executor thread pool...
Dan Berindei <dberinde> updated the status of jira ISPN-5268 to Resolved
The IllegalArgumentException appeared because the node had a previous consistent hash, but it was not a member of it (perhaps after a merge). The ISPN-4852 fix avoids the exception by replacing the direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to StateConsumerImpl.getOwnedSegments().