Description of problem: EAP 6.2.0.ER3 Only dist cache, ejb remote invocations scenario. We're getting stale sesion data in failover tests, it looks like this: 2013/09/26 08:31:10:031 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - not-supported: Sessions: 2000, active: 0, samples: 0, throughput: 0.0 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 0 ms, max: 0 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 0 (0%) 2013/09/26 08:31:15:749 EDT [WARN ][Runner - 424] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Synced next serial to 41 from current 41. 2013/09/26 08:31:15:749 EDT [WARN ][Runner - 424] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data: <org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424.> org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424. at org.jboss.smartfrog.clustering.ejb3.StatefulSBProcessorFactoryImpl$EJB3RequestProcessor.processRequest(StatefulSBProcessorFactoryImpl.java:114) at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52) at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87) at java.lang.Thread.run(Thread.java:662) 2013/09/26 08:31:15:750 EDT [WARN ][Runner - 424] SFCORE_LOG - Error sampling data: <org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424.> org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424. at org.jboss.smartfrog.clustering.ejb3.StatefulSBProcessorFactoryImpl$EJB3RequestProcessor.processRequest(StatefulSBProcessorFactoryImpl.java:114) at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52) at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87) at java.lang.Thread.run(Thread.java:662) They seem to come a long time after a failover occured (~50 seconds) and the data is always older by 1. Some links (look for perf17): https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-jvmkill-dist-async/30 https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-jvmkill-dist-sync/50 https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-async/25 https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-sync/28 https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-sync-3owners/20/ https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-sync-tcpStack/22/ https://jenkins.mw.lab.eng.bos.redhat.com/hudson/eap-6x-failover-ejb-ejbremote-undeploy-dist-async/24
Still an issue, moving to 6.4.
I am reproducing the issue regulary with dist cache, 3+ nodes. The failure pattern is always: 1) update a SFSB 2) a node is shut down 3) update the SFSB again 4) the node is restarted 5) the SFSB is updated on an old node 6) the SFSB is updated on a new node but the previous update is lost Notes: * the failure never happen with repl cache * there are no session updates while a node is stopping/rejoining the cluster Test log fragment: 10:22:52,783 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-8) Counter 5 incremented to 38 on node node-1 10:22:52,989 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-8) Counter 5 incremented to 39 on node node-1 10:22:53,197 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-8) Counter 5 incremented to 40 on node node-1 10:22:53,381 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (main) Starting container 0. [0m10:23:01,106 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (main) Container 0 started. 10:23:06,229 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Counter 5 incremented to 41 on node node-1 10:23:06,811 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Counter 5 incremented to 41 on node node-0 10:23:07,486 ERROR [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Unexpected count from remote counter 5. Expected: 42 received: 41 from node: node-0. 10:23:07,560 INFO [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Counter 5 incremented to 43 on node node-0 Testcase: https://github.com/dpospisil/jboss-eap/blob/10192e7ef39c2bd1acea438a3073094b7ebc1830/testsuite/integration/clust/src/test/java/org/jboss/as/test/clustering/cluster/ejb3/stateful/remote/failover/RemoteEJBClientStatefulBeanFailoverStressTestCase.java
Seems that I have finally found the root cause of the issue. org.jboss.as.ejb3.cache.spi.BackingCache seems to rely on the fact that the underlying cache.get() returns the same object when called multiple times. Even BackingCache.release(K key) first has to a get first as it takes the key only. However in the case of a distributed infinispan cache used as a store via InfinispanBackingCacheEntryStore when it is called on a node which does not hold specific data block this is not the case. The ISPN does remote get and recieves fresh data from the remote node. In suche case, BackingCache.release() fails as it first catch stale data from the cache and writes them back.
@dpospisi I don't think your assumption is correct. In order to support REPEATABLE_READ semantics within a batch, Cache.get() *must* retain a reference to the data that was initially fetched remotely. However, maybe the issue is with the transaction boundaries being in the wrong place, i.e. the initial Cache.get() must reside within the context of the same batch used in release().
After refreshing my memory with the old implementation, I recall that the initial cache.get() and the subsequent get()/put() in release() happen in separate batches. These should really happen within the same batch - which is how the new implementation (in EAP7) works. To workaround this in EAP6, it should be sufficient to enable the L1 cache (by setting a non-zero l1-lifespan).
Yes, that was what I tried as well. However, there seem to be another issue on top of this one which prevents enabling L1 cache. Setting l1-lifespan attribute to a non-zero value just set's the underlying cache config parameter, but does not explicitly enable l1 which is by default disabled. I'll doublecheck and create separate BZ in such case. Quick fix shows that enabling L1 does indeed help and could be used as a workaround for this issue.
I have created BZ1218148. If there are no objections, I will close this one with WONTFIX with known workaround once it is resolved.
pm_ack- Triage 2017-Jan-30 -> close due to lack of support case.