1199210 – Backup segments may not be removed after failed state transfer

Bug 1199210 - Backup segments may not be removed after failed state transfer

Summary: Backup segments may not be removed after failed state transfer

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	ER3
Target Release:	6.5.0
Assignee:	Dan Berindei
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1203121 jdg65-Beta-Blcokers
TreeView+	depends on / blocked

Reported:	2015-03-05 16:18 UTC by Tristan Tarrant
Modified:	2019-04-16 14:40 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-06-23 12:24:19 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
another_error.log (16.18 KB, text/plain) 2015-03-20 04:12 UTC, Osamu Nagano	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-5268	0	Major	Resolved	Backup segments may not be removed after failed state transfer	2015-08-04 08:15:16 UTC

Description Tristan Tarrant 2015-03-05 16:18:26 UTC

Dist cache can lost data if the primary and all backups are lost at the same time.  To detect such data loss, they are monitoring the tatal sum of entries from all nodes using JMX.  If the sum reduces by a considerable ammount, data loss (by lost a segment) is concluded.  But they found a case that the sum increases, not decreases, after failed state transfer.  This is probably caused by an extra backup segment that hasn't been cleaned up when the last state transfer failed.

Comment 3 Dan Berindei 2015-03-17 17:07:19 UTC

Osamu, can you confirm if the problem still occurs with 6.4.0.Final?

Based on the stack trace, it looks like the IllegalArgumentException was eliminated with the fix for bug 1155606 (ISPN-4852). 

(Filtered stack trace below)

2014/07/31:11:16:11,110 WARN  [org.infinispan.topology.CacheTopologyControlCommand] (remote-thread-931) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=cache, type=CH_UPDATE, sender=node01, joinInfo=null, topologyId=61, currentCH=DefaultConsistentHash{numSegments=240, numOwners=2, members=[...]}, pendingCH=null, throwable=null, viewId=32}: java.lang.IllegalArgumentException: Node node04 is not a member
    at org.infinispan.distribution.ch.DefaultConsistentHash.getSegmentsForOwner(DefaultConsistentHash.java:87) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:457) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.0.Final-redhat-4.jar:6.1.0.Final-redhat-4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_65]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_65]
    at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_65]

Comment 4 Osamu Nagano 2015-03-18 07:53:06 UTC

Testing in JDG 6.4.0 will take some time because they need to set up a large cluster and this is not easily reproducible.  Instead, I created Bug 1203121 which is focusing on the customer's real use case using getBulk.  They will test at the same time with the new JDG after Bug 1203121 gets fixed.

Comment 6 Dan Berindei 2015-03-19 14:35:32 UTC

The ISPN-4852 fix avoids the IllegalArgumentException by replacingthe direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to getOwnedSegments().

Comment 7 Osamu Nagano 2015-03-20 04:12:41 UTC

Created attachment 1004278 [details]
another_error.log

Comment 8 Osamu Nagano 2015-03-20 04:17:52 UTC

Thanks, Dan.  IllegalArgumentException case seems to be actually fixed.  We went through the log files of failed state transfer again, and found another case, OutdatedTopologyException.  This is very rare than IllegalArgumentException case, but on the same critical path of this issue.  I attached the full stack trace as another_error.log.  Note that the code path after org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate() is different.  Is this fixed in JDG 6.4.0 too?

~~~
2015-02-02 17:45:10.461 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (remote-thread-180934) ISPN000136: Execution error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from AAAA/clustered(s1), see cause for remote stack trace
	at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:41) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	...
	at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:433) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:199) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:39) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:104) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.LocalTopologyManagerImpl.handleConsistentHashUpdate(LocalTopologyManagerImpl.java:201) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:152) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.1.1.Final-redhat-5.jar:6.1.1.Final-redhat-5]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_55]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_55]
	at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_55]
Caused by: org.infinispan.statetransfer.OutdatedTopologyException: Cache topology changed while the command was executing: expected 68, got 69
	...
~~~

Comment 9 Dan Berindei 2015-03-20 15:58:58 UTC

Osamu, that exception is harmless, it shouldn't really be logged as ERROR.

The stack trace does point to another problem with the topology update being blocked in the BlockingTaskAwareExecutorService.checkForReadyTasks() call: ISPN-5317. This can delay the rebalance confirmation, thus making the rebalance longer, and even a lot longer if there were a lot of commands waiting for the new topology. 

The workaround, as always, is to increase the number of thread in the remote-executor thread pool...

Comment 11 JBoss JIRA Server 2015-04-27 10:42:02 UTC

Dan Berindei <dberinde> updated the status of jira ISPN-5268 to Resolved

Comment 12 Dan Berindei 2015-04-27 14:30:18 UTC

The IllegalArgumentException appeared because the node had a previous consistent hash, but it was not a member of it (perhaps after a merge). The ISPN-4852 fix avoids the exception by replacing the direct call to DefaultConsistentHash.getSegmentsForOwner() with a call to StateConsumerImpl.getOwnedSegments().

Note You need to log in before you can comment on or make changes to this bug.