Bug 901164 (JBPAPP6-1281) - Servlet @Inject-ing SFSB timeouts/receives stale data/sessions are lost after failover
Summary: Servlet @Inject-ing SFSB timeouts/receives stale data/sessions are lost after...
Keywords:
Status: CLOSED EOL
Alias: JBPAPP6-1281
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Clustering
Version: 6.0.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: EAP 6.4.0
Assignee: Paul Ferraro
QA Contact: Michal Vinkler
URL: http://jira.jboss.org/jira/browse/JBP...
Whiteboard:
Depends On: 959495 1149197
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-05 16:26 UTC by Richard Janík
Modified: 2019-08-19 12:49 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
CCWR from pferraro Cause: Consequence: Fix: Result:
Clone Of:
Environment:
Last Closed: 2019-08-19 12:49:21 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker JBPAPP6-1281 0 Major Closed Stale session data received with ejb servlet using DIST on undeploy 2017-11-08 20:30:22 UTC

Description Richard Janík 2012-11-05 16:26:12 UTC
project_key: JBPAPP6

There were many RequestProcessingExceptions seen in EAP 6.0.1.ER3:
{code}

2012/10/25 11:48:02:723 EDT [WARN ][Runner - 1404] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404>
        org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404
	at org.jboss.smartfrog.loaddriver.http.AbstractSerialNumberValidatorFactoryImpl$SerialNumberValidator.processRequest(AbstractSerialNumberValidatorFactoryImpl.java:125)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87)
	at java.lang.Thread.run(Thread.java:662)

2012/10/25 11:48:02:723 EDT [WARN ][Runner - 1404] SFCORE_LOG - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404>
        org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404
	at org.jboss.smartfrog.loaddriver.http.AbstractSerialNumberValidatorFactoryImpl$SerialNumberValidator.processRequest(AbstractSerialNumberValidatorFactoryImpl.java:125)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87)
	at java.lang.Thread.run(Thread.java:662)
{code}

These occur on undeploy, DIST, and both SYNC and ASYNC, with ejb servlet, not remote invocations. After node failure, there seem to be sessions which are not successfully migrated:

before failover (active sessions = 2000):
{code}
2012/10/25 11:45:51:745 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Total: Sessions: 2000, active: 2000, samples: 4967, throughput: 496.7 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 24 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 4967 (100%)
2012/10/25 11:45:51:745 EDT [DEBUG][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Updated totals: Sessions: 0, active: 20000, samples: 66706, throughput: 6,669.5 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 297 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 66706 (100%)
2012/10/25 11:45:51:745 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf18: Sessions: 2000, active: 500, samples: 1239, throughput: 123.9 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 17 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1239 (100%)
2012/10/25 11:45:51:745 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf19: Sessions: 2000, active: 500, samples: 1242, throughput: 124.2 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 24 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1242 (100%)
2012/10/25 11:45:51:746 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf20: Sessions: 2000, active: 500, samples: 1243, throughput: 124.3 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 17 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1243 (100%)
2012/10/25 11:45:51:746 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf21: Sessions: 2000, active: 500, samples: 1243, throughput: 124.3 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 15 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1243 (100%)
2012/10/25 11:46:01:729 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Failing node 0 (perf18)
{code}

after failover (active sessions = 1667):
{code}
2012/10/25 11:46:21:749 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Total: Sessions: 2000, active: 1667, samples: 5019, throughput: 501.8 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 25 ms, max: 692 ms, sampling errors: 829, unhealthy samples: 0, valid samples: 4190 (83%)
2012/10/25 11:46:21:750 EDT [DEBUG][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Updated totals: Sessions: 0, active: 25667, samples: 81725, throughput: 8,171.1 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 2 ms, max: 692 ms, sampling errors: 1162, unhealthy samples: 0, valid samples: 80563 (98%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - UNKNOWN: Sessions: 2000, active: 0, samples: 0, throughput: 0.0 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 0 ms, max: 0 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 0 (0%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf18: Sessions: 2000, active: 0, samples: 0, throughput: 0.0 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 0 ms, max: 0 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 0 (0%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf19: Sessions: 2000, active: 667, samples: 1679, throughput: 167.9 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 24 ms, max: 684 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1679 (100%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf20: Sessions: 2000, active: 500, samples: 1256, throughput: 125.6 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 26 ms, max: 688 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1256 (100%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf21: Sessions: 2000, active: 500, samples: 1255, throughput: 125.5 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 25 ms, max: 692 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1255 (100%)
{code}

The number will later rise to 1833 (in this case) and more sessions may be recovered when another node fails and the load is redistributed. Runners that are assigned sessions on failing node will successfully detect failover and just then will receive stale session data repeatedly.

Valid samples constitute 85% - 95% of all samples taken as you can see in failover.txt (linked).

builds (failover.txt artifact is here):
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-async/7/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-sync/7/

perf17 (client side, with exceptions):
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-async/7/console-perf17/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-sync/7/console-perf17/

Possibly linked to JBPAPP-9086?

Comment 1 Anne-Louise Tangring 2012-11-13 20:53:14 UTC
Docs QE Status: Removed: NEW 


Comment 3 Ladislav Thon 2013-03-28 14:36:22 UTC
This is marked as a blocker because it means that when failure happens, sessions are lost.

(If it's REPL, all sessions come back when the failure is recovered. If it's DIST, some sessions are lost forever.)

Comment 4 Paul Ferraro 2013-04-04 11:37:08 UTC
Since DIST is not the default configuration, I suggest that this issue not be flagged as a blocker.

Comment 5 Jitka Kozana 2013-04-05 07:02:05 UTC
Please see comment #2: this issue is now not only limited to DIST cache and application udeploy. It was seen in REPL cache and other failure types as well.

This BZ name contains DIST, since originally it was seen only while using DIST cache. I will rename this BZ, so the name will reflect current situation better. 

I suggest the blocker stays.

Comment 6 Paul Ferraro 2013-04-18 16:01:17 UTC
It is expected behavior that failures using REPL_ASYNC can result in stale session data on the failover nodes.  If a use case has *zero* tolerance for state session data upon failover, then REPL_SYNC should be used instead.  As far as I can tell, this issue does not affect REPL_SYNC, correct?

The comment above applies to DIST_ASYNC as well.

As far as lost sessions seen with DIST_SYNC - was this issue reproduced using ER4?

Comment 7 Jitka Kozana 2013-04-19 08:46:07 UTC
Yes, it was seen again in ER4 runs, DIST SYNC and even in graceful shutdown. Here is the link to the job:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-dist-sync/49/

Comment 8 Paul Ferraro 2013-04-19 12:38:46 UTC
So - to summarize - the issue is with DIST_SYNC only (since we can expect stale sessions on failover using REPL_ASYNC & DIST_ASYNC) - and, given that this is not our default mode, this should not be a blocker.  I suspect the issue due to a race condition between invalidation of the locally stored sessions on view change and the rebalancing of the distributed cache.

FYI - I've filed an upstream jira to more gracefully handle clean shutdown/undeploy for ASYNC modes so that there are no stale sessions - since these scenarios are not exception conditions (as opposed to jvmkill).  The clean shutdown logic that was added back in EAP5 is not sufficient to prevent stale sessions on clean shutdown/undeploy.  This will be done as part of a larger effort to redesign web session clustering entirely.
https://issues.jboss.org/browse/AS7-6947

Comment 9 Jitka Kozana 2013-04-19 13:55:52 UTC
I see my last comment #7 was not clear. 
Let me rephrase and sum up. 

During ER4 testing, we saw this error in many scenarios.  So here ([1], [2], [3], [4]) are links to some of them. We are aware that few stale data can be seen after jvmkill scenario, therefore I have selected links to shutdown scenarios. 

We saw this issue in _all_ cache setups we tested, eg REPL_ASYNC, REPL_SYNC, DIST_ASYNC, DIST_SYNC. 

Yes, DIST cache is not the default setting, but it is supported cache configuration, yes or no? 

Moreover, with REPL_SYNC (see [1]), no data should be lost. Therefore, the blocker stays.

All these links: httpsession replication, failuretype: graceful shutdown
[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-sync/81/
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-dist-sync/49/
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-async/66/
[4] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-dist-async/48/

Comment 10 Paul Ferraro 2013-04-23 15:55:08 UTC
Question about the REPL_SYNC test...
When the server is shutdown, there can still be failed requests (the web subsystem does not yet correctly implement clean shutdown) - we're not considering these when identifying stale session data on failover, correct?
Otherwise, I see lots of suspicious NPEs (due to BZ 900549), which very well may be the culprit.

Comment 11 Paul Ferraro 2013-04-26 00:47:50 UTC
I've established that this is a side effect of BZ 900549, and should be fixed by:
https://github.com/jbossas/jboss-eap/pull/122

Comment 12 Paul Ferraro 2013-04-26 12:50:53 UTC
Whoops - the above pull request was not yet merged - correcting Status...

Comment 13 Paul Ferraro 2013-04-30 15:51:12 UTC
This may also be due to ISPN-2974.

Comment 14 Jitka Kozana 2013-05-03 09:03:49 UTC
For update on testing this issue with ISPN 5.2.6.Final, please see BZ 900549, comment 86.

Comment 15 Jitka Kozana 2013-05-03 09:22:45 UTC
Some statistics on the occurences of this issue, test scenario: graceful shutdown, REPL_SYNC.

   ER5: 1132 occurences of 901164
     (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/82/artifact/report/parsed_logs_client/index.html)
   ER6 with ISPN 5.2.6.Final: 206  occurences of 901164
     (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/artifact/report/parsed_logs_client/index.html)

Comment 16 Radoslav Husar 2013-05-03 10:28:13 UTC
Very good; so ISPN-2974 seems to account for 82% of broken sessions on failover, the rest should be fixed by PR #122.

Comment 17 Jitka Kozana 2013-05-03 11:05:35 UTC
@Rado: Please see BZ 900549, comment 86. So was the pull request 122 made obsolete by the ISPN upgrade to 5.2.6.Final as you suggest in the discussion on the pull request?

Comment 18 Paul Ferraro 2013-05-03 12:22:27 UTC
@Jitka
No, we need both of them.  The commit in comment 86 effectively makes eviction/passivation transaction safe which prevents concurrent session access from reading passivating session data incorrectly.  The ISPN upgrade to 5.2.6.Final is necessary to prevent loss of session information/metadata during passivation.

Comment 19 Jitka Kozana 2013-05-03 12:48:45 UTC
@Paul, thank you for the clarification.

Comment 20 Jitka Kozana 2013-05-04 08:47:18 UTC
I have tested with ER6 and both ISPN 2.5.6. and the pull 122 and the issue is still present. 

Parsed client logs: 252 occurences of 901164
(https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/parsed_logs_client/index.html)

Comment 21 Rostislav Svoboda 2013-05-04 15:38:19 UTC
Based on last comment about results with ISPN 2.5.6. ++ pull 122 changes I'm changing status back to ASSIGNED.
This needs additional work. Number of exceptions is lower with new ISPN but it's not fully fixed.

This BZ is not used as reference for any PR  I'm aware of.
  https://github.com/jbossas/jboss-eap/pull/122 seems to be merged according https://github.com/jbossas/jboss-eap/commits/6.x
  Infinispan upgrade is tracked on https://bugzilla.redhat.com/show_bug.cgi?id=956988

Comment 22 Dimitris Andreadis 2013-05-06 07:15:07 UTC
Unless there is some breakthrough, I believe a decision will be needed if we keep working on this, or waive it for 6.1 and reschedule for 6.2.

Comment 25 Jitka Kozana 2013-05-14 13:00:43 UTC
For future reference: 6.1.0. ER8 test run to reproduce this:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/85/

Comment 26 Paul Ferraro 2013-07-05 03:54:48 UTC
Jitka,
Can you validate that this commit fixes this issue?
https://github.com/pferraro/jboss-eap/commit/f4b710931da30a401cdd2b3aada16e9cc25d5631

Comment 27 Paul Ferraro 2013-07-05 15:35:28 UTC
https://github.com/jbossas/jboss-eap/pull/222

Comment 29 Jitka Kozana 2013-07-29 06:45:47 UTC
Still seeing in 6.1.1.ER3 (but it was expected, see comment #28).

Comment 36 Jitka Kozana 2013-08-19 13:57:23 UTC
The issue was seen again in EAP 6.1.1.ER6.

Comment 38 Paul Ferraro 2013-08-19 15:12:33 UTC
All of the failure noted above, since the fix in #c27 have been with ASYNC tests (see #c28 and #c30).  As I've mentioned already (see #c6 and #c8): stale session data is expected behavior when using ASYNC mode!

The only SYNC test that still seems to be failing is this eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync test.  I need to details as to what exactly is happening in this test. However, I suspect that while this test might be using SYNC mode for web sessions, it may very well still be using ASYNC mode for SFSBs (default is ASYNC).  If this is the case, then stale session data is expected behavior.

@Jitka, can you comment?

Comment 39 Radoslav Husar 2013-08-19 15:23:04 UTC
Looking at the configuration [1], it uses ASYNC for web sessions and SYNC for EJB sessions. 

[1] http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync/lastSuccessfulBuild/artifact/report/config/jboss-perf18/standalone-ha.xml

Comment 41 Paul Ferraro 2013-08-19 18:25:39 UTC
@Radoslav Thanks!  Just as I suspected - this is no longer an issue and should be closed.

Comment 42 Jitka Kozana 2013-08-20 08:08:40 UTC
We have re-configured the cache setup as suggested in comment #29. Now the jobs [1] and [2] use SYNC for both cache containers (web and ejb). The issue is still present.

In [1], in the server log [3], after the failover is finished (= the application is redeployed) and the node is back in the cluster, we are still seeing new SFSB being created. This is the error on client side:

Stale session data received. Expected 93, received 0

The same issue can be seen in the test scenario [2]. 

Therefore, this issue is not fixed. The customers will lose their session data.

We suggest not closing the issue and continue with investigation.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-repl-sync/30/
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync/13/
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-repl-sync/30/console-perf18

Comment 44 Radoslav Husar 2013-08-20 11:35:03 UTC
Is ejbservlet the only scenario where you are seeing this issue?

Can we please get results for jvmkill/repl/sync combination? It looks like we don't have this job yet http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-repl-sync/

Comment 45 Scott Mumford 2013-08-29 03:45:19 UTC
Marking for exclusion from the 6.1.1 Release Notes document as an entry for this bug could not be completed or verified in time.

Comment 46 Paul Ferraro 2013-08-29 23:15:39 UTC
This will be addressed by the new web session clustering implementation scheduled for 6.3.

Comment 47 Radoslav Husar 2014-03-18 16:27:09 UTC
Needs to be revalidated following the Infinispan upgrade.

Comment 49 Michal Karm Babacek 2014-04-17 20:10:46 UTC
OMG, what's this TCMS-went-crazy stuff? :-)

Comment 50 Ladislav Thon 2014-04-22 06:30:31 UTC
Please filter by the X-Bugzilla-Who header. No other fast & easy way, I'm afraid.

Comment 51 Ladislav Thon 2014-07-08 13:09:27 UTC
Still an issue, moving to 6.4.

Comment 52 Kabir Khan 2014-10-08 12:25:29 UTC
Should be fixed by 5.2.11.CR1 upgrade 1149197

Comment 53 Ladislav Thon 2014-10-17 12:27:12 UTC
Wasn't fixed by the Infinispan upgrade in EAP 6.4.0.DR5. Moving back to ASSIGNED.

Comment 54 Ladislav Thon 2014-10-30 13:08:05 UTC
During EAP 6.4 testing, we've seen this only in the ejb-servlet scenario.

Comment 55 Radoslav Husar 2014-10-30 13:15:45 UTC
Discussing with Jitka, it looks like the answer to the question in comment #44 is yes for EAP 6.4 builds with the latest Infinispan upgrade. The only scenario this is consistently seen now is the /ejbservlet (Servlet @Inject-ing SFSB).

There are numerous error 500 in the logs, which are hinting at timeouts on the server side and loadbalancer's timeout is kicking in.

The expectation is a slowdown after the failover, so please try with smaller amount of sessions (e.g. 200) and extra logging (e.g. for the EJBs). Also please post latest runs on the BZ as some of the older links are broken, thanks!

Comment 56 Radoslav Husar 2014-10-30 13:29:11 UTC
Please also get results for all combinations, especially this one: eap-6x-failover-ejb-ejbservlet-jvmkill-repl-sync. Thanks.

Comment 57 Ladislav Thon 2014-10-30 13:48:11 UTC
Adding Michal, who started on clustering QA recently and will be running those investigation jobs.

Comment 59 Paul Ferraro 2014-11-12 18:51:33 UTC
Status: I'm not yet convinced that these failures are indicative of a bug and not a flaw in the test itself - therefore I am holding off on ACK'ing this.

Comment 60 Radoslav Husar 2014-11-13 13:36:29 UTC
I could actually reproduce this locally and manually with only 1 session outside the system tests. I haven't yet extracted the pattern though.

What I can get to looks like a deadlock/locking issue the servlet can be requested from 1 node, when requested from other 2 nodes the client doesn't get a response, no timeouts seem to kick in (the timeouts seen in the system tests are  coming from the loadbalancer, configured TO is 30 seconds)

A thread dump should point to cause, I only have the shutdown log with exceptions atm https://gist.github.com/rhusar/d74d301225733c8a22d8

Comment 61 Paul Ferraro 2014-11-23 15:26:26 UTC
Question about the test itself: how are the @Stateful EJB references persisted across requests?  Are these EJBs @SessionScoped?

Comment 62 Ladislav Thon 2014-11-26 09:33:24 UTC
Yes, the EJB is @SessionScoped.

Specifically, the servlet that is being accessed is: https://github.com/clusterbench/clusterbench/blob/master/clusterbench-ee6-web/src/main/java/org/jboss/test/clusterbench/web/ejb/LocalEjbServlet.java

And the injected EJB is: https://github.com/clusterbench/clusterbench/blob/master/clusterbench-ee6-ejb/src/main/java/org/jboss/test/clusterbench/ejb/stateful/LocalStatefulSB.java

Rado should be able to answer all questions about the code, since he is the primary author :-)

Comment 63 Kabir Khan 2014-12-12 10:53:27 UTC
Devel-nacking since there is no time to do this in the 6.4.0 timeframe, and it is not a blocker


Note You need to log in before you can comment on or make changes to this bug.