Description of problem: Versioned update with a multi threaded Hot Rod client results in inconsistency. Some replaceWithVersion return true ignoring a version update executed in another thread. This happens with replicated cache and doesn't happen with distributed and local cache. Version-Release number of selected component (if applicable): This happens in JDG 6.3.0 ER7 too. The customer wants the fix until JDG 6.3 release. How reproducible: Rarely happens under a heavy concurrency. Steps to Reproduce: I'll attach a test client.
Similar issues in community: https://issues.jboss.org/browse/ISPN-4286 https://issues.jboss.org/browse/ISPN-2956 Would be good to have some logs to narrow down the problem.
When the cache is non transactional and there are topology changes, conditional operations (e.g.ReplaceIfUnmodified) cannot work reliable. Please refer to the discussion here[1] and here[2] for more details. Adding to that, for non-transactional caches it is not technically possible to achieve this consistency requirement either, as it boils down to updating multiple nodes with the same value - this can only be achieved consistently with 2PC and transactions. If there's a topology change happening, then the user should consider using a transaction cache on the server, as suggested on [1]. If the are not topology changes this might be just a bug and we should able to respond more once we get the logs. [1] http://lists.jboss.org/pipermail/infinispan-dev/2014-June/015079.html [2] https://issues.jboss.org/browse/ISPN-2956
Created attachment 910646 [details] test-replace-with-version.zip
REPRODUCING STEP 1. Configure 2 node cluster with the following cache, "distcache". Let's focus on JDG 6.3.0.ER7. The same with 6.2.1.GA and 6.1.0.GA. ~~~ <distributed-cache name="distcache" mode="SYNC" segments="160" owners="2" start="EAGER" remote-timeout="180000" l1-lifespan="0"> <locking isolation="READ_COMMITTED" acquire-timeout="60000"/> <state-transfer timeout="1800000"/> </distributed-cache> ~~~ 2. Unzip the attachement, test-replace-with-version.zip, build and run the test client. ~~~ mvn clean compile dependency:copy-dependencies java -ea -cp 'target/dependency/*:target/classes' \ -Djava.util.logging.manager=org.jboss.logmanager.LogManager \ -Dlogging.configuration=file:logging.properties \ com.example.PutFromNull 'node1:11222;node2:11222' distcache HOGEKEY 40 500 ~~~ This increments a counter with key "HOGEKEY" up to 500 in 40 threads, 20000 in total. Please find the source code, src/main/java/com/example/PutFromNull.java. Its incrementing part is like this. ~~~ private void incrementCache() throws Exception { while (true) { VersionedValue<Long> versioned = cache.getVersioned(key); if (versioned == null) { if (cache.withFlags(Flag.FORCE_RETURN_VALUE).putIfAbsent(key, 1L) == null) { logger.info("I'm the first putter."); return; } } else { long val = versioned.getValue() + 1; if (cache.replaceWithVersion(key, val, versioned.getVersion())) { // Optimistically updated, no more retries. return; } } } } ~~~ Actual Result: ~~~ 15:37:44,052 INFO [org.infinispan.client.hotrod.impl.protocol.Codec20] (main) ISPN004006: espresso/10.64.200.6:11322 sent new topology view (id=2) containing 2 addresses: [/10.64.200.6:11322, /10.64.200.6:11422] 15:37:44,130 INFO [org.infinispan.client.hotrod.RemoteCacheManager] (main) ISPN004021: Infinispan version: 6.1.0.ER7-redhat-1 15:37:44,160 INFO [com.example.PutFromNull] (main) HOGEKEY is null now. 15:37:44,161 INFO [com.example.PutFromNull] (pool-1-thread-2) Start to count. ... 15:37:44,166 INFO [com.example.PutFromNull] (pool-1-thread-4) I'm the first putter. ... 15:37:58,997 INFO [com.example.PutFromNull] (pool-1-thread-6) Counted 500 15:37:58,999 INFO [com.example.PutFromNull] (main) Expected: 20000, Actual: 19893 in 14.837 secs. Exception in thread "main" java.lang.AssertionError at com.example.PutFromNull.main(PutFromNull.java:89) ~~~ Expected Result: Expected and Actual are the same, 20000. Additional Info: There is another program, com.example.PutFromZero, which starts from the key existing state. Its usage is the same with PutFromNull and its result is similar also. Only the difference is the incrementing part as follows. ~~~ private void incrementCache() throws Exception { while (true) { VersionedValue<Long> versioned = cache.getVersioned(key); long val = versioned.getValue() + 1; if (cache.replaceWithVersion(key, val, versioned.getVersion())) { // Optimistically updated, no more retries. return; } } } ~~~
The patch might be a bit heavy handed and would most likely affect performance badly. Thanks Takayoshi for the effort. I've downloaded the test (thanks Osamu for creating it!) and replicated the issue with latest released version (7.0 alpha4). I'm working on trying to find a better solution.
We've opened ISPN-4424 which us up to date with the problem analysis.
I've found a less disruptive solution for this issue, and the branch containing the fix can be found in https://github.com/galderz/infinispan/tree/t_4424. @Takayoshi, @Osamu, it'd be good for you to give it a go. One important note, for the fix to work, the cache must be configured to be transactional and pessimistic locking. I've started a community development thread with the inner details of the fix, the implications and future improvements: http://lists.jboss.org/pipermail/infinispan-dev/2014-June/015109.html
Btw, @Takayoshi, @Osamu, what do you think if I add the concurrency test you created to the Java Hot Rod client testsuite? It'd be great to have a such a test in our testsuite. Is that Ok?
(In reply to Galder Zamarreño from comment #18) > One important note, for the fix to work, the cache must be configured to > be transactional and pessimistic locking. I just want to point out that this has larger implications, as we currently don't support configuring the cache for anything but <transaction mode="NONE" />. (See the JDG 6.3 Administration and Configuration Guide section 26.2.2). So this will require doc updates, and adding full QA testing for transactional JDG server caches.
Osamu, Takayoshi: would be great if you could validate the stop gap solution we're considering: https://github.com/galderz/infinispan/tree/t_4424
(In reply to Galder Zamarreño from comment #19) > Btw, @Takayoshi, @Osamu, what do you think if I add the concurrency test you > created to the Java Hot Rod client testsuite? It'd be great to have a such a > test in our testsuite. Is that Ok? Yes of course, that would be great. I'm building t_4424 branch now.
We still see gap in t_4424 build with PESSIMISTIC tx. The client is 40 thread and 10000 loop per thread so 400k replaceWithVersion tries total. Looks like success rate is almost doubled with the write lock. * ER7, "default" dist value=42234 replace_count=42251 Failed * ER7, "default" dist, tx=NON_XA PESSIMISTIC value=42918 replace_count=42948 Failed * ER7 synchronized patch, "default" dist value=43203 replace_count=43203 Success * 7.0.0-SNAPSHOT-t_4424, "default" dist value=43968 replace_count=43982 Failed * 7.0.0-SNAPSHOT-t_4424, tx=NON_XA PESSIMISTIC value=72691 replace_count=72692 Failed value=72521 replace_count=72522 Failed value=71952 replace_count=71952 Success value=72704 replace_count=72705 Failed value=73810 replace_count=73812 Failed
The below is config for "default" and "pess" caches I tested, in clustered.xml. The "default" is default one and "pess" configured <transaction> with NON_XA PESSIMISTIC. <distributed-cache name="default" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER"> <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/> <transaction mode="NONE"/> </distributed-cache> <distributed-cache name="pess" mode="SYNC" segments="20" owners="2" remote-timeout="30000" start="EAGER"> <locking isolation="READ_COMMITTED" acquire-timeout="30000" concurrency-level="1000" striping="false"/> <transaction mode="NON_XA" locking="PESSIMISTIC"/> </distributed-cache>
I've also tested on a different machine against "t_4424" build and "pess" cache (as in comment #25). test-replace-with-version.zip (@Galder, of course you can use the test program) works well in most cases but Takayoshi reproduced the issue with it. I'll try to reproduce also and will collect trace log. concurrent-replace.rb fails easily as reported in comment #24.
I've built a brand fix partially based on the attached diff, but addressing the actual issue in a more fine grained fashion (thx Pedro!). Essentially, the problem is down to incorrect publishing of the cached entry via `getCacheEntry` method. In the new solution, we've synchronized the updating of a cached entry, and it's retrieval. This solution does not force transactions. The result can be found here: https://github.com/galderz/infinispan/tree/t_4424v2 Alongside the fix, I've added a unit test that passes in my local environment. I've also gone and tried the fix in the virtual environment that Takayoshi created (thx for building that up!) and it runs fine there too: [root@jdgtest bz1110647]# ./run-test-client.sh Using cache: value=129924 replace_count=129924 Success [root@jdgtest bz1110647]# ./run-test-client.sh Using cache: value=99403 replace_count=99403 Success [root@jdgtest bz1110647]# ./run-test-client.sh Using cache: value=126387 replace_count=126387 Success [root@jdgtest bz1110647]# ./run-test-client.sh Using cache: value=124795 replace_count=124795 Success The server with the new patch can be found in ~/opt/4424v2 folder. If there's no further issues from your side, I'll send a PR for this fix.
William Burns <wburns> updated the status of jira ISPN-4424 to Resolved
4424v2 is working but no success with CR1. * Upstream 4244v2, default cache Ran test 5 times and all success. * JDG630CR1, default cache value=56177 replace_count=56202 Failed value=53952 replace_count=53979 Failed * JDG630CR1, pess cache Using cache: pess value=117515 replace_count=117564 Failed Using cache: pess value=105454 replace_count=105495 Failed
@Osamu, for these tests, set org.infinispan log level to ERROR?
@Galder, no, it was INFO as usual. Currently all ReplaceIfUnmodifiedRequest users against the default cache will leave a tons of the log messages in server.log. I think this behaviour is needed to change.
Executed ReplaceWithVersionConcurrencyTest plus all tests from client/hotrod-client module to verify this. Tests passed.