Product ticket for ISPN-3357
Radim Vansa <rvansa> made a comment on jira ISPN-3357 I have written down a test in library mode exhibiting this behaviour. Please, let me know as fix is ready, I'll try to check that.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3357 It's rare but this issue also happens with node leave rebalance so I've updated the title accordingly. I've attached logs from my test - 8 nodes dist cluster, a node leave with kill -9, 6 backups are missing.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3357 It's rare but this issue also happens with node leave rebalance so I've updated the title accordingly. I've attached logs from my test - 8 nodes dist cluster, a node leave with kill -9, 6 backups are missing. The event sequence is basically same as node join rebalance.
Dan Berindei <dberinde> made a comment on jira ISPN-3357 It makes sense that this would happen during leave as well, as a leave is handled by first updating the CH to remove the leaver, and then adding a new owner just like during joins. The only difference is that the owners in the pending CH are almost always a superset of the owners in the current CH, so the write CH is the same as the pending CH and the final CH update doesn't actually change anything. The write commands already have an {{ignorePreviousValue}} flag that is used in tx caches to skip the remote get and the value check (ISPN-3235). We can use the same flag in non-tx caches to make the command non-conditional on backup nodes. There is a complication, though: say there is a state transfer in progress, the owners of k in the current CH are [A, B], and the owners in the pending CH are [C, B] (so the write CH owners are [A, B, C]). If A initiates a putIfAbsent(k, v) operation and C rejects the command because the topology changed (so the owners are now [C, B]), but B writes the value, then A will retry the command, C will do the previous value check, and the operation will fail - leaving B without an update. So we have to ignore the previous value on C, but only if it's the same as the one we're trying to set. Otherwise we could overwrite a value set by another operation, and we wouldn't know whether to return {{null}} or the value existing on C. That means that the values need to implement {{equals()}} for this to work - but this shouldn't be a problem with HotRod clients. One more possible source of inconsistencies is that if the originator notices a topology change at the end of the operation (in StateTransferInterceptor), it forwards the command again to all the owners. This happens without a lock, so it could overlap with another operation and leave a different value on each node. I plan to replace this with another topology in EntryWrappingInterceptor, just before applying the update, to make sure that if state transfer missed the update, then the command will be retried. Finally, there is another case where we break consistency that I wanted to mention: if the primary owner dies, and there are multiple backup owners (either because numOwners >= 3 or because there is a join in progress), only one of the backups may execute the command. The originator would report a failure, but the cache would still be inconsistent, and retrying the putIfAbsent call wouldn't fix it. The inconsistency can also appear if a command is rejected only on some of the backup owners, and the originator dies before retrying. [~tkimura], I know you mentioned that you ignored the keys for which ISPN reported an error in your tests, I just wanted to make sure that this is acceptable.
Dan Berindei <dberinde> made a comment on jira ISPN-3357 It makes sense that this would happen during leave as well, as a leave is handled by first updating the CH to remove the leaver, and then adding a new owner just like during joins. The only difference is that the owners in the pending CH are almost always a superset of the owners in the current CH, so the write CH is the same as the pending CH and the final CH update doesn't actually change anything. The write commands already have an {{ignorePreviousValue}} flag that is used in tx caches to skip the remote get and the value check (ISPN-3235). We can use the same flag in non-tx caches to make the command non-conditional on backup nodes. There is a complication, though: say there is a state transfer in progress, the owners of k in the current CH are [A, B], and the owners in the pending CH are [C, B] (so the write CH owners are [A, B, C]). If A initiates a putIfAbsent(k, v) operation and B rejects the command because the topology changed (so the owners are now [C, B]), but C writes the value, then A will retry the command, C will do the previous value check, and the operation will fail - leaving B without an update. So we have to ignore the previous value on C, but only if it's the same as the one we're trying to set. Otherwise we could overwrite a value set by another operation, and we wouldn't know whether to return {{null}} or the value existing on C. That means that the values need to implement {{equals()}} for this to work - but this shouldn't be a problem with HotRod clients. One more possible source of inconsistencies is that if the originator notices a topology change at the end of the operation (in StateTransferInterceptor), it forwards the command again to all the owners. This happens without a lock, so it could overlap with another operation and leave a different value on each node. I plan to replace this with another topology in EntryWrappingInterceptor, just before applying the update, to make sure that if state transfer missed the update, then the command will be retried. Finally, there is another case where we break consistency that I wanted to mention: if the primary owner dies, and there are multiple backup owners (either because numOwners >= 3 or because there is a join in progress), only one of the backups may execute the command. The originator would report a failure, but the cache would still be inconsistent, and retrying the putIfAbsent call wouldn't fix it. The inconsistency can also appear if a command is rejected only on some of the backup owners, and the originator dies before retrying. [~tkimura], I know you mentioned that you ignored the keys for which ISPN reported an error in your tests, I just wanted to make sure that this is acceptable.
Takayoshi Kimura <tkimura> made a comment on jira ISPN-3357 Yes the "originator dies" scenario is acceptable.
Dan Berindei <dberinde> made a comment on jira ISPN-3357 For now we are completely ignoring the previous value on the primary owner when retrying, not just if the current value matches the final value of the command.
It turns out that this bug will be fixed in ER1 (after integrating https://github.com/infinispan/infinispan/commit/f9982dea28). So moving the target milestone.
removed doc text flag.