989807 – Data loss when entry forwarding to primary owner and primary owner shutdown

Bug 989807 - Data loss when entry forwarding to primary owner and primary owner shutdown

Summary: Data loss when entry forwarding to primary owner and primary owner shutdown

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	DR3
Target Release:	6.2.0
Assignee:	Tristan Tarrant
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	989809
TreeView+	depends on / blocked

Reported:	2013-07-30 02:01 UTC by Takayoshi Kimura
Modified:	2024-05-01 00:22 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2024-05-01 00:22:28 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-3366	0	Critical	Resolved	Data loss when entry forwarding to primary owner and primary owner shutdown	2016-03-31 05:24:34 UTC

Description Takayoshi Kimura 2013-07-30 02:01:20 UTC

Product ticket for ISPN-3366

Comment 1 JBoss JIRA Server 2013-07-30 14:12:55 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

[~nekop] You're right, searching for "Invoked with command .*0x033e0.746872656164" I got ~ 120k occurrences, so it's clearly the common prefix: the JBoss Marshalling info (033e), the length (0.), and then "thread" (746872656164).

I searched in the new logs for the first missing key: {{thread02key72}}, aka {{ByteArray\{size=16, hashCode=1b9657dd, array=0x033e0d74687265616430326b65793732\}}} and I could find it in node1's log, but not in node4's log. There is also a very similar bug for tx caches: ISPN-2510, so I think your analysis is spot on.

Ideally we should retry on the originator if the primary owner is no longer running the cache, so I'll try to implement that. If it proves too difficult, I'll only throw an exception in the first phase.

Comment 2 JBoss JIRA Server 2013-07-30 20:54:05 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

A possible workaround would be to set {{transport.strictPeerToPeer = true}} in the global configuration, which will cause node4 to return a {{NamedCacheNotFoundException}} instead of {{null}} during shutdown. The HotRod client should then also receive an exception. 

But this would only work if the cluster is symmetric - i.e. all the nodes have exactly the same caches running. Otherwise, for replicated caches, any operation would result in a {{NamedCacheNotFoundException}}.

Comment 3 JBoss JIRA Server 2013-07-31 00:17:07 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Thanks Dan, I'll try the workaround.

Comment 4 JBoss JIRA Server 2013-07-31 00:42:22 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Looks like we don't have strictPeerToPeer configuration in Infinispan Server 5.2.x.

Comment 5 JBoss JIRA Server 2013-07-31 12:05:40 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

Right, {{strictPeerToPeer}} is only available in the standalone Infinispan configuration, and not in the Infinispan Server configuration.

Comment 7 JBoss JIRA Server 2013-08-01 09:15:59 UTC

Tristan Tarrant <ttarrant> made a comment on jira ISPN-3366

[~nekop], you can get strictPeerToPeer working modifying the infinispan-defaults.xml within the infinispan subsystem jar (org.jboss.as.clustering.infinispan)

Comment 9 JBoss JIRA Server 2013-08-01 14:12:45 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3366

I have a test-case in library mode (ispn 5.2.4) showing this bug - 4 nodes up, 3 of them loading data (100000 entries in total) via normal PUT operations. During the loading one of the nodes is restarted. If the put operation on some of the loading nodes gets exception, the put is repeated, but even with that in the end I end up with few keys returning null on one of the nodes.

[~dan.berindei]: please let me know as you'll have some fix for this issue, I'll try to test it.

Comment 10 JBoss JIRA Server 2013-08-01 15:02:51 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

Always reply with a CacheNotFoundResponse when the requested cache is not running.
Throw a RemoteException(NamedCacheNotFoundException) from RpcManagerImpl, if strictPeerToPeer = true.
Throw an OutdatedTopologyException from NonTxConcurrentDistributionInterceptor, if strictPeerToPeer = false.
Catch OutdatedTopologyException in StateTransferInterceptor and retry the operation.

Comment 14 JBoss JIRA Server 2013-08-02 02:41:16 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Tested with 5.2.8-SNAPSHOT and data loss happened. (In logs it says 5.2.6-SNAPSHOT because Version.java is not updated)

3 entries are missing, first one is hash(thread10key59)=4d12e1a9.

Attached ISPN-3366-full-logs-3rd.zip, from this time total entries 4000, 1/10 of previous size. Also it contains JGroups TRACE logging.

Comment 15 JBoss JIRA Server 2013-08-02 02:42:00 UTC

Takayoshi Kimura <tkimura> updated the status of jira ISPN-3366 to Reopened

Comment 16 JBoss JIRA Server 2013-08-02 02:42:00 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Reopen, we still see data loss.

Comment 17 JBoss JIRA Server 2013-08-02 04:53:39 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Note that the possibility of the data loss is lowered, it's now 10% or so. I reproduced it 2 times out of 20 times. Previously it was over 50%.

Comment 18 JBoss JIRA Server 2013-08-02 05:52:21 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Looks like the problem symptom is different, it retries (this fixes the original problem) but doesn't commit the entry at the last try. There is no "About to commit entry" log. Probably we need to tweak EntryWrappingInterceptor as well?

Comment 19 JBoss JIRA Server 2013-08-02 07:18:04 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3366

Confirmed, with latest 5.2.x some entry is still lost.

Comment 20 JBoss JIRA Server 2013-08-05 14:41:30 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

Looks like the remaining data loss has a slightly different cause: a node can become primary owner while a command is executing. 

So we could have {{isPrimaryOwner(k) == false}} in EntryWrappingInterceptor, and {{isPrimaryOwner(k) == true}} by the time we enter NonTx[Concurrent]DistributionInterceptor:

{noformat}
11:23:12,876 TRACE [NonTransactionalLockingInterceptor] (HotRodServerWorker-9) Are (node2/clustered(s1)) we the lock owners for key 'ByteArrayKey{data=ByteArray{size=16, hashCode=4d12e1a9, array=0x033e0d74687265616431306b65793539}}'? false
11:23:12,876 TRACE [EntryWrappingInterceptor] (HotRodServerWorker-9) Wrapping entry 'ByteArrayKey{data=ByteArray{size=16, hashCode=4d12e1a9, array=0x033e0d74687265616431306b65793539}}'? false
11:23:12,882 TRACE [NonTxDistributionInterceptor] (HotRodServerWorker-9) Not doing a remote get for key ByteArrayKey{data=ByteArray{size=16, hashCode=4d12e1a9, array=0x033e0d74687265616431306b65793539}} since entry is not affected by rehash or is already in data container. We are node2/clustered(s1), owners are [node4/clustered(s1), node2/clustered(s1)]
11:23:12,886 TRACE [NonTxConcurrentDistributionInterceptor] (HotRodServerWorker-9) I'm the primary owner, sending the command to all ([node2/clustered(s1)]) the recipients in order to be applied.
{noformat}

Because the key is not wrapped on the originator, it's not written to the data container, but the command isn't looped back to the originator from the primary owner either.

A possible solution would be to check the topology id of the command against the current topology id before sending it, and throw an OutdatedTopologyException to cause StateTransferInterceptor to retry the command.

Comment 21 JBoss JIRA Server 2013-08-06 17:46:16 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

Checking the topology id before invoking the command on the primary owner wouldn't be enough: the primary owner may have already installed a new topology in which it is no longer the primary owner, and it may not forward the command to the other owners (or it may not do anything, if it's no longer an owner). This could very well be the cause of issue ISPN-3357...

I had another idea, to let the command execute and then check in StateTransferInterceptor whether the topology id changed. If so, we would retry the command on the new primary owner. The problem is that the check yields too many false positives, and if we retry a put command after it was successfully executed then we won't get the correct return value.  This is even more important for conditional commands.

Comment 22 JBoss JIRA Server 2013-08-07 22:20:48 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

[~tkimura], could you try to build again from branch https://github.com/danberindei/infinispan/tree/t_3366_m and see if you can still reproduce the issue?

The fix there is still incomplete in that it breaks some putForExternalRead tests, but it should fix the data loss as it retries the command when the topology changes on the primary owner. BTW, this is the approach that I said wouldn't work in the previous comment - I realized the topology id is set on the originator, so checking for changes on the primary owner should work. The challenge now is to limit the cases where we retry the command, because putForExternalRead is asynchronous and retrying doesn't really work in that case.

Comment 24 JBoss JIRA Server 2013-08-09 01:58:00 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Tested with https://github.com/danberindei/infinispan/tree/t_3366_52 5.2.8-SNAPSHOT.

Ran the test 20 times and found 1 missing backup entry but it's not a complete data loss.

hash(thread17key76)=5904ce3d

* hotrod -> node3
* node3 -> node1
* node1 removed this entry due to rebalance

See ISPN-3366-full-logs-4th.zip.

Comment 25 JBoss JIRA Server 2013-08-09 05:35:45 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Tested 40 times, no complete data loss and missing backup entry 4 times.

If the missing backup issue is not related to ISPN-3366 fix we can close this issue and create another one for the missing backup issue.

Comment 26 JBoss JIRA Server 2013-08-09 16:20:37 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

The missing backup entry is probably related to my fix: if state transfer starts just before we issue a put command, and we send the entries to the new owner just before we commit the entry in the put command, that new owner won't have a backup copy of the key.

State transfer used to automatically forward commands to all the new owners if the topology changed during the execution of the command, but I tried to remove it because the forwarded commands were executed without holding a lock on the primary, so they could lead to inconsistencies. 

I'm still trying to come up with a better solution - perhaps sending the command to the backup owners only after the entries are committed on the primary owner will work. But in the meantime I'll just re-add the forwarding in the state transfer interceptor, as locking is not reliable in non-tx caches during state transfer anyway (two nodes could both think they are the primary owner for a key at the same time).

Comment 27 JBoss JIRA Server 2013-08-12 11:38:46 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3366

I integrated the fix both in master and in 5.2.x.

Regular put commands should be ok now, but putIfAbsent commands may still end up with insufficient owners in certain situations.

Comment 28 JBoss JIRA Server 2013-08-13 03:30:33 UTC

Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366

Verified the fix with 5.2.x branch.

For putIfAbsent commands, is it something we can fix within ISPN-3357?

Comment 29 JBoss JIRA Server 2013-08-14 11:37:38 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3366

I have verified the fix with 5.2.x as well (for 6.0.x JGRP-1675 is blocking me from doing so)

Comment 30 JBoss JIRA Server 2013-08-15 10:47:45 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3366

In 6.0.x works as well.

Note You need to log in before you can comment on or make changes to this bug.