In non-tx cache, when conditional command (e.g. ReplaceCommand) fails on primary owner because the old value is not equal to current value, the NonTxDistributionInterceptor.visitReplaceCommand returns false but the entry is committed in EntryWrappingInterceptor anyway. Speaking about EntryWrappingInterceptor.invokeNextAndApplyChanges: NonTxDistributionInterceptor does not mark the command as unsuccessful if it fails on the primary owner, therefore, checking for command.isSuccessful() in order to retry the command upon topology change may seem insufficient.
Dan Berindei <dberinde> made a comment on jira ISPN-3603 I think the problem is that {{EntryWrappingInterceptor}} doesn't wrap the entry on the originator, but then {{NonTxDistributionInterceptor}} retrieves the key from another owner and wraps the entry itself (because state transfer is in progress and the originator is in the process of receiving the entry via state transfer). In fact, I think {{NonTxDistributionInterceptor.remoteGetBeforeWrite()}} is not needed at all. The remote get was required in order to check the previous value (or return it, for non-conditional commands). But in non-tx caches, only the return value on the primary owner matters, and the primary owner always has the value. So it should be safe to remove {{remoteGetBeforeWrite()}} completely.
Pedro Ruivo <pedroruivo2> updated the status of jira ISPN-3603 to Coding In Progress
Pedro Ruivo <pedroruivo2> made a comment on jira ISPN-3603 [~rvansa] can you provide me some logs? I've spoken with [~dan.berindei] and I agree that the remoteGetBeforeWrite() can be remove safely but the entry should only be committed if it is marked as changed. However, it is only marked if the command succeeds. Do you know if concurrent updates are happening at the same time (or send me the logs to look at it)?
Radim Vansa <rvansa> made a comment on jira ISPN-3603 Yes, there was a concurrent update which caused one replace to fail on the primary node. The entry was changed by the ReplaceCommand.perform - after the entry got wrapped, the command could be successfully execute on the node (the value was changed on primary node between the remote get and replicating the replace command there) and so it got changed in the command's context - and then it was committed.
Radim Vansa <rvansa> made a comment on jira ISPN-3603 By the way, we probably can't just remove the remoteGetBeforeWrite command if we're on the primary owner and the key is affected by rehash - otherwise we could execute the command thinking that the key is null (or some outdated value) although it just did not made it to this node yet via ST.
Mircea Markus <mmarkus> made a comment on jira ISPN-3603 [~rvansa] this doesn't look like something introduced in 6.0.0 but something existing since 5.2.x, can you please confirm?
Radim Vansa <rvansa> made a comment on jira ISPN-3603 Well, I think my resilience tests would show more "problematic situations" that are now fixed in 6.0 - looking for this particular issue would be pretty complicated. Unless you want all fixes to be backported to 5.2.x. I don't have any specific test-case, just stress the cluster until it gets broken and then spend quite a long time digging into the trace logs for each instance of some problem.
Dan Berindei <dberinde> made a comment on jira ISPN-3603 [~rvansa], the primary owner in the write CH is always the primary owner in the current CH as well, so it must already have the entry.
Radim Vansa <rvansa> made a comment on jira ISPN-3603 [~dan.berindei]: If the primary owner crashes, is primary ownership always transfered to one of old backups, or can it be transfered to a new node?
Fixed in ER3.1