1015189 – EJB SFSB failover does not work with dist cache when l1 cache is not enabled

Bug 1015189 - EJB SFSB failover does not work with dist cache when l1 cache is not enabled

Summary: EJB SFSB failover does not work with dist cache when l1 cache is not enabled

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Clustering
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	jboss-set
QA Contact:	Michal Vinkler
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:	1218148
Blocks:
TreeView+	depends on / blocked

Reported:	2013-10-03 15:21 UTC by Richard Janík
Modified:	2017-01-30 14:11 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-01-30 14:11:14 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Description Richard Janík 2013-10-03 15:21:54 UTC

Description of problem:

EAP 6.2.0.ER3
Only dist cache, ejb remote invocations scenario.

We're getting stale sesion data in failover tests, it looks like this:

2013/09/26 08:31:10:031 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - not-supported: Sessions: 2000, active: 0, samples: 0, throughput: 0.0 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 0 ms, max: 0 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 0 (0%)
2013/09/26 08:31:15:749 EDT [WARN ][Runner - 424] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Synced next serial to 41 from current 41.
2013/09/26 08:31:15:749 EDT [WARN ][Runner - 424] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424.>
        org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424.
	at org.jboss.smartfrog.clustering.ejb3.StatefulSBProcessorFactoryImpl$EJB3RequestProcessor.processRequest(StatefulSBProcessorFactoryImpl.java:114)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87)
	at java.lang.Thread.run(Thread.java:662)

2013/09/26 08:31:15:750 EDT [WARN ][Runner - 424] SFCORE_LOG - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424.>
        org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 41, received: 40, runner: 424.
	at org.jboss.smartfrog.clustering.ejb3.StatefulSBProcessorFactoryImpl$EJB3RequestProcessor.processRequest(StatefulSBProcessorFactoryImpl.java:114)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87)
	at java.lang.Thread.run(Thread.java:662)

They seem to come a long time after a failover occured (~50 seconds) and the data is always older by 1.

Some links (look for perf17):
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-jvmkill-dist-async/30
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-jvmkill-dist-sync/50
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-async/25
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-sync/28
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-sync-3owners/20/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-sync-tcpStack/22/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/eap-6x-failover-ejb-ejbremote-undeploy-dist-async/24

Comment 1 Ladislav Thon 2014-07-08 13:44:15 UTC

Still an issue, moving to 6.4.

Comment 6 Dominik Pospisil 2015-04-20 11:22:26 UTC

I am reproducing the issue regulary with dist cache, 3+ nodes. The failure pattern is always:

1) update a SFSB
2) a node is shut down
3) update the SFSB again
4) the node is restarted
5) the SFSB is updated on an old node
6) the SFSB is updated on a new node but the previous update is lost

Notes:
* the failure never happen with repl cache
* there are no session updates while a node is stopping/rejoining the cluster


Test log fragment:
10:22:52,783 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-8) Counter 5 incremented to 38 on node node-1
10:22:52,989 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-8) Counter 5 incremented to 39 on node node-1
10:22:53,197 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-8) Counter 5 incremented to 40 on node node-1
10:22:53,381 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (main) Starting container 0.
[0m10:23:01,106 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (main) Container 0 started.
10:23:06,229 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Counter 5 incremented to 41 on node node-1
10:23:06,811 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Counter 5 incremented to 41 on node node-0
10:23:07,486 ERROR [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Unexpected count from remote counter 5. Expected: 42 received: 41 from node: node-0.
10:23:07,560 INFO  [org.jboss.as.test.clustering.cluster.ejb3.stateful.remote.failover.RemoteEJBClientStatefulBeanFailoverStressTestCase] (pool-1-thread-64) Counter 5 incremented to 43 on node node-0


Testcase:
https://github.com/dpospisil/jboss-eap/blob/10192e7ef39c2bd1acea438a3073094b7ebc1830/testsuite/integration/clust/src/test/java/org/jboss/as/test/clustering/cluster/ejb3/stateful/remote/failover/RemoteEJBClientStatefulBeanFailoverStressTestCase.java

Comment 7 Dominik Pospisil 2015-04-29 10:23:19 UTC

Seems that I have finally found the root cause of the issue. 

org.jboss.as.ejb3.cache.spi.BackingCache seems to rely on the fact that the underlying cache.get() returns the same object when called multiple times. Even BackingCache.release(K key) first has to a get first as it takes the key only.

However in the case of a distributed infinispan cache used as a store via InfinispanBackingCacheEntryStore when it is called on a node which does not hold specific data block this is not the case. The ISPN does remote get and recieves fresh data from the remote node. In suche case, BackingCache.release() fails as it first catch stale data from the cache and writes them back.

Comment 8 Paul Ferraro 2015-04-30 14:31:03 UTC

@dpospisi I don't think your assumption is correct.  In order to support REPEATABLE_READ semantics within a batch, Cache.get() *must* retain a reference to the data that was initially fetched remotely.  However, maybe the issue is with the transaction boundaries being in the wrong place, i.e. the initial Cache.get() must reside within the context of the same batch used in release().

Comment 9 Paul Ferraro 2015-04-30 15:38:01 UTC

After refreshing my memory with the old implementation, I recall that the initial cache.get() and the subsequent get()/put() in release() happen in separate batches.  These should really happen within the same batch - which is how the new implementation (in EAP7) works.
To workaround this in EAP6, it should be sufficient to enable the L1 cache (by setting a non-zero l1-lifespan).

Comment 10 Dominik Pospisil 2015-05-04 09:29:50 UTC

Yes, that was what I tried as well. However, there seem to be another issue on top of this one which prevents enabling L1 cache. Setting l1-lifespan attribute to a non-zero value just set's the underlying cache config parameter, but does not explicitly enable l1 which is by default disabled. I'll doublecheck and create separate BZ in such case. Quick fix shows that enabling L1 does indeed help and could be used as a workaround for this issue.

Comment 11 Dominik Pospisil 2015-05-04 09:42:58 UTC

I have created BZ1218148. If there are no objections, I will close this one with WONTFIX with known workaround once it is resolved.

Comment 13 Jiri Pallich 2017-01-30 14:11:14 UTC

pm_ack-
Triage 2017-Jan-30 -> close due to lack of support case.

Note You need to log in before you can comment on or make changes to this bug.