Product ticket for ISPN-3366
Dan Berindei <dberinde> made a comment on jira ISPN-3366 [~nekop] You're right, searching for "Invoked with command .*0x033e0.746872656164" I got ~ 120k occurrences, so it's clearly the common prefix: the JBoss Marshalling info (033e), the length (0.), and then "thread" (746872656164). I searched in the new logs for the first missing key: {{thread02key72}}, aka {{ByteArray\{size=16, hashCode=1b9657dd, array=0x033e0d74687265616430326b65793732\}}} and I could find it in node1's log, but not in node4's log. There is also a very similar bug for tx caches: ISPN-2510, so I think your analysis is spot on. Ideally we should retry on the originator if the primary owner is no longer running the cache, so I'll try to implement that. If it proves too difficult, I'll only throw an exception in the first phase.
Dan Berindei <dberinde> made a comment on jira ISPN-3366 A possible workaround would be to set {{transport.strictPeerToPeer = true}} in the global configuration, which will cause node4 to return a {{NamedCacheNotFoundException}} instead of {{null}} during shutdown. The HotRod client should then also receive an exception. But this would only work if the cluster is symmetric - i.e. all the nodes have exactly the same caches running. Otherwise, for replicated caches, any operation would result in a {{NamedCacheNotFoundException}}.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Thanks Dan, I'll try the workaround.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Looks like we don't have strictPeerToPeer configuration in Infinispan Server 5.2.x.
Dan Berindei <dberinde> made a comment on jira ISPN-3366 Right, {{strictPeerToPeer}} is only available in the standalone Infinispan configuration, and not in the Infinispan Server configuration.
Tristan Tarrant <ttarrant> made a comment on jira ISPN-3366 [~nekop], you can get strictPeerToPeer working modifying the infinispan-defaults.xml within the infinispan subsystem jar (org.jboss.as.clustering.infinispan)
Radim Vansa <rvansa> made a comment on jira ISPN-3366 I have a test-case in library mode (ispn 5.2.4) showing this bug - 4 nodes up, 3 of them loading data (100000 entries in total) via normal PUT operations. During the loading one of the nodes is restarted. If the put operation on some of the loading nodes gets exception, the put is repeated, but even with that in the end I end up with few keys returning null on one of the nodes. [~dan.berindei]: please let me know as you'll have some fix for this issue, I'll try to test it.
Dan Berindei <dberinde> made a comment on jira ISPN-3366 Always reply with a CacheNotFoundResponse when the requested cache is not running. Throw a RemoteException(NamedCacheNotFoundException) from RpcManagerImpl, if strictPeerToPeer = true. Throw an OutdatedTopologyException from NonTxConcurrentDistributionInterceptor, if strictPeerToPeer = false. Catch OutdatedTopologyException in StateTransferInterceptor and retry the operation.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Tested with 5.2.8-SNAPSHOT and data loss happened. (In logs it says 5.2.6-SNAPSHOT because Version.java is not updated) 3 entries are missing, first one is hash(thread10key59)=4d12e1a9. Attached ISPN-3366-full-logs-3rd.zip, from this time total entries 4000, 1/10 of previous size. Also it contains JGroups TRACE logging.
Takayoshi Kimura <tkimura> updated the status of jira ISPN-3366 to Reopened
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Reopen, we still see data loss.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Note that the possibility of the data loss is lowered, it's now 10% or so. I reproduced it 2 times out of 20 times. Previously it was over 50%.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Looks like the problem symptom is different, it retries (this fixes the original problem) but doesn't commit the entry at the last try. There is no "About to commit entry" log. Probably we need to tweak EntryWrappingInterceptor as well?
Radim Vansa <rvansa> made a comment on jira ISPN-3366 Confirmed, with latest 5.2.x some entry is still lost.
Dan Berindei <dberinde> made a comment on jira ISPN-3366 Looks like the remaining data loss has a slightly different cause: a node can become primary owner while a command is executing. So we could have {{isPrimaryOwner(k) == false}} in EntryWrappingInterceptor, and {{isPrimaryOwner(k) == true}} by the time we enter NonTx[Concurrent]DistributionInterceptor: {noformat} 11:23:12,876 TRACE [NonTransactionalLockingInterceptor] (HotRodServerWorker-9) Are (node2/clustered(s1)) we the lock owners for key 'ByteArrayKey{data=ByteArray{size=16, hashCode=4d12e1a9, array=0x033e0d74687265616431306b65793539}}'? false 11:23:12,876 TRACE [EntryWrappingInterceptor] (HotRodServerWorker-9) Wrapping entry 'ByteArrayKey{data=ByteArray{size=16, hashCode=4d12e1a9, array=0x033e0d74687265616431306b65793539}}'? false 11:23:12,882 TRACE [NonTxDistributionInterceptor] (HotRodServerWorker-9) Not doing a remote get for key ByteArrayKey{data=ByteArray{size=16, hashCode=4d12e1a9, array=0x033e0d74687265616431306b65793539}} since entry is not affected by rehash or is already in data container. We are node2/clustered(s1), owners are [node4/clustered(s1), node2/clustered(s1)] 11:23:12,886 TRACE [NonTxConcurrentDistributionInterceptor] (HotRodServerWorker-9) I'm the primary owner, sending the command to all ([node2/clustered(s1)]) the recipients in order to be applied. {noformat} Because the key is not wrapped on the originator, it's not written to the data container, but the command isn't looped back to the originator from the primary owner either. A possible solution would be to check the topology id of the command against the current topology id before sending it, and throw an OutdatedTopologyException to cause StateTransferInterceptor to retry the command.
Dan Berindei <dberinde> made a comment on jira ISPN-3366 Checking the topology id before invoking the command on the primary owner wouldn't be enough: the primary owner may have already installed a new topology in which it is no longer the primary owner, and it may not forward the command to the other owners (or it may not do anything, if it's no longer an owner). This could very well be the cause of issue ISPN-3357... I had another idea, to let the command execute and then check in StateTransferInterceptor whether the topology id changed. If so, we would retry the command on the new primary owner. The problem is that the check yields too many false positives, and if we retry a put command after it was successfully executed then we won't get the correct return value. This is even more important for conditional commands.
Dan Berindei <dberinde> made a comment on jira ISPN-3366 [~tkimura], could you try to build again from branch https://github.com/danberindei/infinispan/tree/t_3366_m and see if you can still reproduce the issue? The fix there is still incomplete in that it breaks some putForExternalRead tests, but it should fix the data loss as it retries the command when the topology changes on the primary owner. BTW, this is the approach that I said wouldn't work in the previous comment - I realized the topology id is set on the originator, so checking for changes on the primary owner should work. The challenge now is to limit the cases where we retry the command, because putForExternalRead is asynchronous and retrying doesn't really work in that case.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Tested with https://github.com/danberindei/infinispan/tree/t_3366_52 5.2.8-SNAPSHOT. Ran the test 20 times and found 1 missing backup entry but it's not a complete data loss. hash(thread17key76)=5904ce3d * hotrod -> node3 * node3 -> node1 * node1 removed this entry due to rebalance See ISPN-3366-full-logs-4th.zip.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Tested 40 times, no complete data loss and missing backup entry 4 times. If the missing backup issue is not related to ISPN-3366 fix we can close this issue and create another one for the missing backup issue.
Dan Berindei <dberinde> made a comment on jira ISPN-3366 The missing backup entry is probably related to my fix: if state transfer starts just before we issue a put command, and we send the entries to the new owner just before we commit the entry in the put command, that new owner won't have a backup copy of the key. State transfer used to automatically forward commands to all the new owners if the topology changed during the execution of the command, but I tried to remove it because the forwarded commands were executed without holding a lock on the primary, so they could lead to inconsistencies. I'm still trying to come up with a better solution - perhaps sending the command to the backup owners only after the entries are committed on the primary owner will work. But in the meantime I'll just re-add the forwarding in the state transfer interceptor, as locking is not reliable in non-tx caches during state transfer anyway (two nodes could both think they are the primary owner for a key at the same time).
Dan Berindei <dberinde> made a comment on jira ISPN-3366 I integrated the fix both in master and in 5.2.x. Regular put commands should be ok now, but putIfAbsent commands may still end up with insufficient owners in certain situations.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3366 Verified the fix with 5.2.x branch. For putIfAbsent commands, is it something we can fix within ISPN-3357?
Radim Vansa <rvansa> made a comment on jira ISPN-3366 I have verified the fix with 5.2.x as well (for 6.0.x JGRP-1675 is blocking me from doing so)
Radim Vansa <rvansa> made a comment on jira ISPN-3366 In 6.0.x works as well.