Bug 1199210

Summary: Backup segments may not be removed after failed state transfer
Product: [JBoss] JBoss Data Grid 6 Reporter: Tristan Tarrant <ttarrant>
Component: InfinispanAssignee: Dan Berindei <dberinde>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Gencur <mgencur>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.3.0CC: dberinde, jdg-bugs, mcimbora, onagano, pzapataf, tkimura
Target Milestone: ER3   
Target Release: 6.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-06-23 12:24:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1214804, 1203121    
Attachments:
Description Flags
another_error.log none

Description Tristan Tarrant 2015-03-05 16:18:26 UTC
Dist cache can lost data if the primary and all backups are lost at the same time.  To detect such data loss, they are monitoring the tatal sum of entries from all nodes using JMX.  If the sum reduces by a considerable ammount, data loss (by lost a segment) is concluded.  But they found a case that the sum increases, not decreases, after failed state transfer.  This is probably caused by an extra backup segment that hasn't been cleaned up when the last state transfer failed.

Comment 3 Dan Berindei 2015-03-17 17:07:19 UTC
Osamu, can you confirm if the problem still occurs with 6.4.0.Final?

Based on the stack trace, it looks like the IllegalArgumentException was eliminated with the fix for bug 1155606 (ISPN-4852). 

(Filtered stack trace below)

2014/07/31:11:16:11,110 WARN  [org.infinispan.topology.CacheTopologyControlCommand] (remote-thread-931) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=cache, type=CH_UPDATE, sender=node01, joinInfo=null, topologyId=61, currentCH=DefaultConsistentHash{numSegments=240, numOwners=2, members=[...]}, pendingCH=null, throwable=null, viewId=32}: java.lang.IllegalArgumentException: Node node04 is not a member
    at org.infinispan.distribution.ch.DefaultConsistentHash.getSegmentsForOwner(DefaultConsistentHash.java:87) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:457) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_65]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_65]
    at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_65]

Comment 4 Osamu Nagano 2015-03-18 07:53:06 UTC
Testing in JDG 6.4.0 will take some time because they need to set up a large cluster and this is not easily reproducible.  Instead, I created Bug 1203121 which is focusing on the customer's real use case using getBulk.  They will test at the same time with the new JDG after Bug 1203121 gets fixed.

Comment 6 Dan Berindei 2015-03-19 14:35:32 UTC
The ISPN-4852 fix avoids the IllegalArgumentException by replacingthe direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to getOwnedSegments().

Comment 7 Osamu Nagano 2015-03-20 04:12:41 UTC
Created attachment 1004278 [details]
another_error.log

Comment 8 Osamu Nagano 2015-03-20 04:17:52 UTC
Thanks, Dan.  IllegalArgumentException case seems to be actually fixed.  We went through the log files of failed state transfer again, and found another case, OutdatedTopologyException.  This is very rare than IllegalArgumentException case, but on the same critical path of this issue.  I attached the full stack trace as another_error.log.  Note that the code path after org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate() is different.  Is this fixed in JDG 6.4.0 too?

~~~
2015-02-02 17:45:10.461 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-180934) ISPN000136: Execution error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from AAAA/clustered(s1), see cause for remote stack trace
	at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:41) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	...
	at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:433) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_55]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_55]
	at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_55]
Caused by: org.infinispan.statetransfer.OutdatedTopologyException: Cache topology changed while the command was executing: expected 68, got 69
	...
~~~

Comment 9 Dan Berindei 2015-03-20 15:58:58 UTC
Osamu, that exception is harmless, it shouldn't really be logged as ERROR.

The stack trace does point to another problem with the topology update being blocked in the BlockingTaskAwareExecutorService.checkForReadyTasks() call: ISPN-5317. This can delay the rebalance confirmation, thus making the rebalance longer, and even a lot longer if there were a lot of commands waiting for the new topology. 

The workaround, as always, is to increase the number of thread in the remote-executor thread pool...

Comment 11 JBoss JIRA Server 2015-04-27 10:42:02 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-5268 to Resolved

Comment 12 Dan Berindei 2015-04-27 14:30:18 UTC
The IllegalArgumentException appeared because the node had a previous consistent hash, but it was not a member of it (perhaps after a merge). The ISPN-4852 fix avoids the exception by replacing the direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to StateConsumerImpl.getOwnedSegments().