Bug 1110647 - ReplaceIfUnmodified is broken under concurrent execution
Summary: ReplaceIfUnmodified is broken under concurrent execution
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: CR2
: 6.3.0
Assignee: Galder Zamarreño
QA Contact: Martin Gencur
URL:
Whiteboard:
Depends On:
Blocks: 1113299 1104639
TreeView+ depends on / blocked
 
Reported: 2014-06-18 07:03 UTC by Osamu Nagano
Modified: 2018-12-06 16:54 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1113299 (view as bug list)
Environment:
Last Closed: 2015-01-26 14:05:41 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
test-replace-with-version.zip (7.72 KB, application/zip)
2014-06-20 06:58 UTC, Osamu Nagano
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-4424 0 Major Resolved getCacheEntry is not safe 2018-08-29 01:11:25 UTC

Description Osamu Nagano 2014-06-18 07:03:55 UTC
Description of problem:
Versioned update with a multi threaded Hot Rod client results in inconsistency.  Some replaceWithVersion return true ignoring a version update executed in another thread.  This happens with replicated cache and doesn't happen with distributed and local cache.


Version-Release number of selected component (if applicable):
This happens in JDG 6.3.0 ER7 too.
The customer wants the fix until JDG 6.3 release.


How reproducible:
Rarely happens under a heavy concurrency.


Steps to Reproduce:
I'll attach a test client.

Comment 3 Mircea Markus 2014-06-18 14:41:25 UTC
Similar issues in community:
https://issues.jboss.org/browse/ISPN-4286
https://issues.jboss.org/browse/ISPN-2956
Would be good to have some logs to narrow down the problem.

Comment 4 Mircea Markus 2014-06-18 17:11:26 UTC
When the cache is non transactional and there are topology changes, conditional operations (e.g.ReplaceIfUnmodified) cannot work reliable. Please refer to the discussion here[1] and here[2] for more details. 
Adding to that, for non-transactional caches it is not technically possible to achieve this consistency requirement either, as it boils down to updating multiple nodes with the same value - this can only be achieved consistently with 2PC and transactions.
If there's a topology change happening, then the user should consider using a transaction cache on the server, as suggested on [1]. If the are not topology changes this might be just a bug and we should able to respond more once we get the logs. 

[1] http://lists.jboss.org/pipermail/infinispan-dev/2014-June/015079.html
[2] https://issues.jboss.org/browse/ISPN-2956

Comment 9 Osamu Nagano 2014-06-20 06:58:22 UTC
Created attachment 910646 [details]
test-replace-with-version.zip

Comment 10 Osamu Nagano 2014-06-20 06:59:56 UTC
REPRODUCING STEP

1. Configure 2 node cluster with the following cache, "distcache".
Let's focus on JDG 6.3.0.ER7.  The same with 6.2.1.GA and 6.1.0.GA.

~~~
  <distributed-cache name="distcache" mode="SYNC" segments="160" owners="2" start="EAGER" remote-timeout="180000" l1-lifespan="0">
      <locking isolation="READ_COMMITTED" acquire-timeout="60000"/>
      <state-transfer timeout="1800000"/>
  </distributed-cache>
~~~

2. Unzip the attachement, test-replace-with-version.zip, build and run the test client.

~~~
mvn clean compile dependency:copy-dependencies
java -ea -cp 'target/dependency/*:target/classes' \
        -Djava.util.logging.manager=org.jboss.logmanager.LogManager \
        -Dlogging.configuration=file:logging.properties \
        com.example.PutFromNull 'node1:11222;node2:11222' distcache HOGEKEY 40 500
~~~

This increments a counter with key "HOGEKEY" up to 500 in 40 threads, 20000 in total.  Please find the source code, src/main/java/com/example/PutFromNull.java.  Its incrementing part is like this.

~~~
    private void incrementCache() throws Exception {
        while (true) {
            VersionedValue<Long> versioned = cache.getVersioned(key);
            if (versioned == null) {
                if (cache.withFlags(Flag.FORCE_RETURN_VALUE).putIfAbsent(key, 1L) == null) {
                    logger.info("I'm the first putter.");
                    return;
                }
            } else {
                long val = versioned.getValue() + 1;
                if (cache.replaceWithVersion(key, val, versioned.getVersion())) {
                    // Optimistically updated, no more retries.
                    return;
                }
            }
        }
    }
~~~


Actual Result:
~~~
15:37:44,052 INFO  [org.infinispan.client.hotrod.impl.protocol.Codec20] (main) ISPN004006: espresso/10.64.200.6:11322 sent new topology view (id=2) containing 2 addresses: [/10.64.200.6:11322, /10.64.200.6:11422]
15:37:44,130 INFO  [org.infinispan.client.hotrod.RemoteCacheManager] (main) ISPN004021: Infinispan version: 6.1.0.ER7-redhat-1
15:37:44,160 INFO  [com.example.PutFromNull] (main) HOGEKEY is null now.
15:37:44,161 INFO  [com.example.PutFromNull] (pool-1-thread-2) Start to count.
...
15:37:44,166 INFO  [com.example.PutFromNull] (pool-1-thread-4) I'm the first putter.
...
15:37:58,997 INFO  [com.example.PutFromNull] (pool-1-thread-6) Counted 500
15:37:58,999 INFO  [com.example.PutFromNull] (main) Expected: 20000, Actual: 19893 in 14.837 secs.
Exception in thread "main" java.lang.AssertionError
        at com.example.PutFromNull.main(PutFromNull.java:89)
~~~


Expected Result:
Expected and Actual are the same, 20000.


Additional Info:
There is another program, com.example.PutFromZero, which starts from the key existing state.  Its usage is the same with PutFromNull and its result is similar also.  Only the difference is the incrementing part as follows.

~~~
    private void incrementCache() throws Exception {
        while (true) {
            VersionedValue<Long> versioned = cache.getVersioned(key);
            long val = versioned.getValue() + 1;
            if (cache.replaceWithVersion(key, val, versioned.getVersion())) {
                // Optimistically updated, no more retries.
                return;
            }
        }
    }
~~~

Comment 15 Galder Zamarreño 2014-06-20 12:51:47 UTC
The patch might be a bit heavy handed and would most likely affect performance badly. Thanks Takayoshi for the effort.

I've downloaded the test (thanks Osamu for creating it!) and replicated the issue with latest released version (7.0 alpha4). I'm working on trying to find a better solution.

Comment 16 Mircea Markus 2014-06-23 14:34:02 UTC
We've opened ISPN-4424 which us up to date with the problem analysis.

Comment 18 Galder Zamarreño 2014-06-24 14:35:15 UTC
I've found a less disruptive solution for this issue, and the branch containing the fix can be found in https://github.com/galderz/infinispan/tree/t_4424. 

@Takayoshi, @Osamu, it'd be good for you to give it a go. One important note, for the fix to work, the cache must be configured to be transactional and pessimistic locking.

I've started a community development thread with the inner details of the fix, the implications and future improvements: http://lists.jboss.org/pipermail/infinispan-dev/2014-June/015109.html

Comment 19 Galder Zamarreño 2014-06-24 14:45:12 UTC
Btw, @Takayoshi, @Osamu, what do you think if I add the concurrency test you created to the Java Hot Rod client testsuite? It'd be great to have a such a test in our testsuite. Is that Ok?

Comment 20 dereed 2014-06-24 16:03:55 UTC
(In reply to Galder Zamarreño from comment #18)
> One important note, for the fix to work, the cache must be configured to 
> be transactional and pessimistic locking.

I just want to point out that this has larger implications, as we currently 
don't support configuring the cache for anything but <transaction mode="NONE" />.
(See the JDG 6.3 Administration and Configuration Guide section 26.2.2).

So this will require doc updates, and adding full QA testing for transactional JDG server caches.

Comment 22 Mircea Markus 2014-06-24 16:58:02 UTC
Osamu, Takayoshi: would be great if you could validate the stop gap solution we're considering: https://github.com/galderz/infinispan/tree/t_4424

Comment 23 Takayoshi Kimura 2014-06-25 00:29:02 UTC
(In reply to Galder Zamarreño from comment #19)
> Btw, @Takayoshi, @Osamu, what do you think if I add the concurrency test you
> created to the Java Hot Rod client testsuite? It'd be great to have a such a
> test in our testsuite. Is that Ok?

Yes of course, that would be great.

I'm building t_4424 branch now.

Comment 24 Takayoshi Kimura 2014-06-25 00:54:45 UTC
We still see gap in t_4424 build with PESSIMISTIC tx. The client is 40 thread and 10000 loop per thread so 400k replaceWithVersion tries total.

Looks like success rate is almost doubled with the write lock.

* ER7, "default" dist

value=42234
replace_count=42251
Failed

* ER7, "default" dist, tx=NON_XA PESSIMISTIC

value=42918
replace_count=42948
Failed

* ER7 synchronized patch, "default" dist

value=43203
replace_count=43203
Success

* 7.0.0-SNAPSHOT-t_4424, "default" dist

value=43968
replace_count=43982
Failed

* 7.0.0-SNAPSHOT-t_4424, tx=NON_XA PESSIMISTIC

value=72691
replace_count=72692
Failed

value=72521
replace_count=72522
Failed

value=71952
replace_count=71952
Success

value=72704
replace_count=72705
Failed

value=73810
replace_count=73812
Failed

Comment 25 Takayoshi Kimura 2014-06-25 01:46:54 UTC
The below is config for "default" and "pess" caches I tested, in clustered.xml. The "default" is default one and "pess" configured <transaction> with NON_XA PESSIMISTIC.

<distributed-cache name="default" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">
  <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
  <transaction mode="NONE"/>
</distributed-cache>
<distributed-cache name="pess" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER">
  <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/>
  <transaction mode="NON_XA" locking="PESSIMISTIC"/>
</distributed-cache>

Comment 28 Osamu Nagano 2014-06-25 09:02:33 UTC
I've also tested on a different machine against "t_4424" build and "pess" cache (as in comment #25).

test-replace-with-version.zip (@Galder, of course you can use the test program) works well in most cases but Takayoshi reproduced the issue with it.  I'll try to reproduce also and will collect trace log.

concurrent-replace.rb fails easily as reported in comment #24.

Comment 30 Galder Zamarreño 2014-06-25 11:52:16 UTC
I've built a brand fix partially based on the attached diff, but addressing the actual issue in a more fine grained fashion (thx Pedro!). Essentially, the problem is down to incorrect publishing of the cached entry via `getCacheEntry` method. In the new solution, we've synchronized the updating of a cached entry, and it's retrieval. This solution does not force transactions. The result can be found here: https://github.com/galderz/infinispan/tree/t_4424v2

Alongside the fix, I've added a unit test that passes in my local environment. I've also gone and tried the fix in the virtual environment that Takayoshi created (thx for building that up!) and it runs fine there too:

[root@jdgtest bz1110647]# ./run-test-client.sh 
Using cache: 
value=129924
replace_count=129924
Success

[root@jdgtest bz1110647]# ./run-test-client.sh 
Using cache: 
value=99403
replace_count=99403
Success

[root@jdgtest bz1110647]# ./run-test-client.sh 
Using cache: 
value=126387
replace_count=126387
Success

[root@jdgtest bz1110647]# ./run-test-client.sh 
Using cache: 
value=124795
replace_count=124795
Success

The server with the new patch can be found in ~/opt/4424v2 folder. If there's no further issues from your side, I'll send a PR for this fix.

Comment 31 JBoss JIRA Server 2014-06-25 20:06:37 UTC
William Burns <wburns> updated the status of jira ISPN-4424 to Resolved

Comment 32 Takayoshi Kimura 2014-06-26 00:50:32 UTC
4424v2 is working but no success with CR1.

* Upstream 4244v2, default cache

Ran test 5 times and all success.

* JDG630CR1, default cache

value=56177
replace_count=56202
Failed

value=53952
replace_count=53979
Failed

* JDG630CR1, pess cache

Using cache: pess
value=117515
replace_count=117564
Failed

Using cache: pess
value=105454
replace_count=105495
Failed

Comment 34 Galder Zamarreño 2014-06-26 07:34:11 UTC
@Osamu, for these tests, set org.infinispan log level to ERROR?

Comment 35 Osamu Nagano 2014-06-26 07:54:13 UTC
@Galder, no, it was INFO as usual.  Currently all ReplaceIfUnmodifiedRequest users against the default cache will leave a tons of the log messages in server.log.  I think this behaviour is needed to change.

Comment 36 Martin Gencur 2014-07-08 10:46:59 UTC
Executed ReplaceWithVersionConcurrencyTest plus all tests from client/hotrod-client module to verify this. Tests passed.


Note You need to log in before you can comment on or make changes to this bug.