Bug 1199210 - Backup segments may not be removed after failed state transfer
Summary: Backup segments may not be removed after failed state transfer
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ER3
: 6.5.0
Assignee: Dan Berindei
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks: jdg65-Beta-Blcokers 1203121
TreeView+ depends on / blocked
 
Reported: 2015-03-05 16:18 UTC by Tristan Tarrant
Modified: 2019-04-16 14:40 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-06-23 12:24:19 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
another_error.log (16.18 KB, text/plain)
2015-03-20 04:12 UTC, Osamu Nagano
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-5268 0 Major Resolved Backup segments may not be removed after failed state transfer 2015-08-04 08:15:16 UTC

Description Tristan Tarrant 2015-03-05 16:18:26 UTC
Dist cache can lost data if the primary and all backups are lost at the same time.  To detect such data loss, they are monitoring the tatal sum of entries from all nodes using JMX.  If the sum reduces by a considerable ammount, data loss (by lost a segment) is concluded.  But they found a case that the sum increases, not decreases, after failed state transfer.  This is probably caused by an extra backup segment that hasn't been cleaned up when the last state transfer failed.

Comment 3 Dan Berindei 2015-03-17 17:07:19 UTC
Osamu, can you confirm if the problem still occurs with 6.4.0.Final?

Based on the stack trace, it looks like the IllegalArgumentException was eliminated with the fix for bug 1155606 (ISPN-4852). 

(Filtered stack trace below)

2014/07/31:11:16:11,110 WARN  [org.infinispan.topology.CacheTopologyControlCommand] (remote-thread-931) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=cache, type=CH_UPDATE, sender=node01, joinInfo=null, topologyId=61, currentCH=DefaultConsistentHash{numSegments=240, numOwners=2, members=[...]}, pendingCH=null, throwable=null, viewId=32}: java.lang.IllegalArgumentException: Node node04 is not a member
    at org.infinispan.distribution.ch.DefaultConsistentHash.getSegmentsForOwner(DefaultConsistentHash.java:87) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:457) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_65]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_65]
    at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_65]

Comment 4 Osamu Nagano 2015-03-18 07:53:06 UTC
Testing in JDG 6.4.0 will take some time because they need to set up a large cluster and this is not easily reproducible.  Instead, I created Bug 1203121 which is focusing on the customer's real use case using getBulk.  They will test at the same time with the new JDG after Bug 1203121 gets fixed.

Comment 6 Dan Berindei 2015-03-19 14:35:32 UTC
The ISPN-4852 fix avoids the IllegalArgumentException by replacingthe direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to getOwnedSegments().

Comment 7 Osamu Nagano 2015-03-20 04:12:41 UTC
Created attachment 1004278 [details]
another_error.log

Comment 8 Osamu Nagano 2015-03-20 04:17:52 UTC
Thanks, Dan.  IllegalArgumentException case seems to be actually fixed.  We went through the log files of failed state transfer again, and found another case, OutdatedTopologyException.  This is very rare than IllegalArgumentException case, but on the same critical path of this issue.  I attached the full stack trace as another_error.log.  Note that the code path after org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate() is different.  Is this fixed in JDG 6.4.0 too?

~~~
2015-02-02 17:45:10.461 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-180934) ISPN000136: Execution error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from AAAA/clustered(s1), see cause for remote stack trace
	at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:41) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	...
	at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:433) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_55]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_55]
	at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_55]
Caused by: org.infinispan.statetransfer.OutdatedTopologyException: Cache topology changed while the command was executing: expected 68, got 69
	...
~~~

Comment 9 Dan Berindei 2015-03-20 15:58:58 UTC
Osamu, that exception is harmless, it shouldn't really be logged as ERROR.

The stack trace does point to another problem with the topology update being blocked in the BlockingTaskAwareExecutorService.checkForReadyTasks() call: ISPN-5317. This can delay the rebalance confirmation, thus making the rebalance longer, and even a lot longer if there were a lot of commands waiting for the new topology. 

The workaround, as always, is to increase the number of thread in the remote-executor thread pool...

Comment 11 JBoss JIRA Server 2015-04-27 10:42:02 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-5268 to Resolved

Comment 12 Dan Berindei 2015-04-27 14:30:18 UTC
The IllegalArgumentException appeared because the node had a previous consistent hash, but it was not a member of it (perhaps after a merge). The ISPN-4852 fix avoids the exception by replacing the direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to StateConsumerImpl.getOwnedSegments().


Note You need to log in before you can comment on or make changes to this bug.