1016499 – Conditional command is committed when it fails on primary owner

Bug 1016499 - Conditional command is committed when it fails on primary owner

Summary: Conditional command is committed when it fails on primary owner

Keywords:
Status:	VERIFIED
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ER3
Target Release:	6.2.0
Assignee:	Tristan Tarrant
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1010419
TreeView+	depends on / blocked

Reported:	2013-10-08 09:32 UTC by Radim Vansa
Modified:	2022-05-31 22:24 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-3603	0	Critical	Resolved	Conditional command is committed on originator when it fails on primary owner	2013-10-21 11:01:21 UTC

Description Radim Vansa 2013-10-08 09:32:27 UTC

In non-tx cache, when conditional command (e.g. ReplaceCommand) fails on primary owner because the old value is not equal to current value, the NonTxDistributionInterceptor.visitReplaceCommand returns false but the entry is committed in EntryWrappingInterceptor anyway.

Speaking about EntryWrappingInterceptor.invokeNextAndApplyChanges: NonTxDistributionInterceptor does not mark the command as unsuccessful if it fails on the primary owner, therefore, checking for command.isSuccessful() in order to retry the command upon topology change may seem insufficient.

Comment 1 JBoss JIRA Server 2013-10-08 15:15:40 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3603

I think the problem is that {{EntryWrappingInterceptor}} doesn't wrap the entry on the originator, but then {{NonTxDistributionInterceptor}} retrieves the key from another owner and wraps the entry itself (because state transfer is in progress and the originator is in the process of receiving the entry via state transfer).

In fact, I think {{NonTxDistributionInterceptor.remoteGetBeforeWrite()}} is not needed at all. The remote get was required in order to check the previous value (or return it, for non-conditional commands). But in non-tx caches, only the return value on the primary owner matters, and the primary owner always has the value. So it should be safe to remove {{remoteGetBeforeWrite()}} completely.

Comment 2 JBoss JIRA Server 2013-10-08 18:25:29 UTC

Pedro Ruivo <pedroruivo2> updated the status of jira ISPN-3603 to Coding In Progress

Comment 3 JBoss JIRA Server 2013-10-08 18:57:55 UTC

Pedro Ruivo <pedroruivo2> made a comment on jira ISPN-3603

[~rvansa] can you provide me some logs? 
I've spoken with [~dan.berindei] and I agree that the remoteGetBeforeWrite() can be remove safely but the entry should only be committed if it is marked as changed. However, it is only marked if the command succeeds. 
Do you know if concurrent updates are happening at the same time (or send me the logs to look at it)?

Comment 4 JBoss JIRA Server 2013-10-09 07:02:09 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3603

Yes, there was a concurrent update which caused one replace to fail on the primary node.
The entry was changed by the ReplaceCommand.perform - after the entry got wrapped, the command could be successfully execute on the node (the value was changed on primary node between the remote get and replicating the replace command there) and so it got changed in the command's context - and then it was committed.

Comment 5 JBoss JIRA Server 2013-10-09 07:10:29 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3603

By the way, we probably can't just remove the remoteGetBeforeWrite command if we're on the primary owner and the key is affected by rehash - otherwise we could execute the command thinking that the key is null (or some outdated value) although it just did not made it to this node yet via ST.

Comment 6 JBoss JIRA Server 2013-10-09 08:38:46 UTC

Mircea Markus <mmarkus> made a comment on jira ISPN-3603

[~rvansa] this doesn't look like something introduced in 6.0.0 but something existing since 5.2.x, can you please confirm?

Comment 7 JBoss JIRA Server 2013-10-09 09:08:14 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3603

Well, I think my resilience tests would show more "problematic situations" that are now fixed in 6.0 - looking for this particular issue would be pretty complicated. Unless you want all fixes to be backported to 5.2.x. I don't have any specific test-case, just stress the cluster until it gets broken and then spend quite a long time digging into the trace logs for each instance of some problem.

Comment 8 JBoss JIRA Server 2013-10-09 13:40:35 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-3603

[~rvansa], the primary owner in the write CH is always the primary owner in the current CH as well, so it must already have the entry.

Comment 9 JBoss JIRA Server 2013-10-10 06:23:38 UTC

Radim Vansa <rvansa> made a comment on jira ISPN-3603

[~dan.berindei]: If the primary owner crashes, is primary ownership always transfered to one of old backups, or can it be transfered to a new node?

Comment 11 Tristan Tarrant 2013-10-18 11:19:20 UTC

Fixed in ER3.1

Note You need to log in before you can comment on or make changes to this bug.