901164 – (JBPAPP6-1281) Servlet @Inject-ing SFSB timeouts/receives stale data/sessions are lost after failover

Bug 901164 (JBPAPP6-1281) - Servlet @Inject-ing SFSB timeouts/receives stale data/sessions are lost after failover

Summary: Servlet @Inject-ing SFSB timeouts/receives stale data/sessions are lost after...

Keywords:
Status:	CLOSED EOL
Alias:	JBPAPP6-1281
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Clustering
Sub Component:
Version:	6.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	EAP 6.4.0
Assignee:	Paul Ferraro
QA Contact:	Michal Vinkler
Docs Contact:
URL:	http://jira.jboss.org/jira/browse/JBP...
Whiteboard:
Depends On:	959495 1149197
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-05 16:26 UTC by Richard Janík
Modified:	2019-08-19 12:49 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-08-19 12:49:21 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	JBPAPP6-1281	0	Major	Closed	Stale session data received with ejb servlet using DIST on undeploy	2017-11-08 20:30:22 UTC

Description Richard Janík 2012-11-05 16:26:12 UTC

project_key: JBPAPP6

There were many RequestProcessingExceptions seen in EAP 6.0.1.ER3:
{code}

2012/10/25 11:48:02:723 EDT [WARN ][Runner - 1404] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404>
        org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404
	at org.jboss.smartfrog.loaddriver.http.AbstractSerialNumberValidatorFactoryImpl$SerialNumberValidator.processRequest(AbstractSerialNumberValidatorFactoryImpl.java:125)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87)
	at java.lang.Thread.run(Thread.java:662)

2012/10/25 11:48:02:723 EDT [WARN ][Runner - 1404] SFCORE_LOG - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404>
        org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 35, received 34, Runner: 1404
	at org.jboss.smartfrog.loaddriver.http.AbstractSerialNumberValidatorFactoryImpl$SerialNumberValidator.processRequest(AbstractSerialNumberValidatorFactoryImpl.java:125)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:87)
	at java.lang.Thread.run(Thread.java:662)
{code}

These occur on undeploy, DIST, and both SYNC and ASYNC, with ejb servlet, not remote invocations. After node failure, there seem to be sessions which are not successfully migrated:

before failover (active sessions = 2000):
{code}
2012/10/25 11:45:51:745 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Total: Sessions: 2000, active: 2000, samples: 4967, throughput: 496.7 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 24 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 4967 (100%)
2012/10/25 11:45:51:745 EDT [DEBUG][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Updated totals: Sessions: 0, active: 20000, samples: 66706, throughput: 6,669.5 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 297 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 66706 (100%)
2012/10/25 11:45:51:745 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf18: Sessions: 2000, active: 500, samples: 1239, throughput: 123.9 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 17 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1239 (100%)
2012/10/25 11:45:51:745 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf19: Sessions: 2000, active: 500, samples: 1242, throughput: 124.2 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 24 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1242 (100%)
2012/10/25 11:45:51:746 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf20: Sessions: 2000, active: 500, samples: 1243, throughput: 124.3 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 17 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1243 (100%)
2012/10/25 11:45:51:746 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf21: Sessions: 2000, active: 500, samples: 1243, throughput: 124.3 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 1 ms, max: 15 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1243 (100%)
2012/10/25 11:46:01:729 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Failing node 0 (perf18)
{code}

after failover (active sessions = 1667):
{code}
2012/10/25 11:46:21:749 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Total: Sessions: 2000, active: 1667, samples: 5019, throughput: 501.8 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 25 ms, max: 692 ms, sampling errors: 829, unhealthy samples: 0, valid samples: 4190 (83%)
2012/10/25 11:46:21:750 EDT [DEBUG][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Updated totals: Sessions: 0, active: 25667, samples: 81725, throughput: 8,171.1 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 2 ms, max: 692 ms, sampling errors: 1162, unhealthy samples: 0, valid samples: 80563 (98%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - UNKNOWN: Sessions: 2000, active: 0, samples: 0, throughput: 0.0 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 0 ms, max: 0 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 0 (0%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf18: Sessions: 2000, active: 0, samples: 0, throughput: 0.0 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 0 ms, max: 0 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 0 (0%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf19: Sessions: 2000, active: 667, samples: 1679, throughput: 167.9 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 24 ms, max: 684 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1679 (100%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf20: Sessions: 2000, active: 500, samples: 1256, throughput: 125.6 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 26 ms, max: 688 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1256 (100%)
2012/10/25 11:46:21:750 EDT [INFO ][StatsRunner] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - perf21: Sessions: 2000, active: 500, samples: 1255, throughput: 125.5 samples/s, bandwidth: 0.0 MB/s, response min: 0 ms, mean: 25 ms, max: 692 ms, sampling errors: 0, unhealthy samples: 0, valid samples: 1255 (100%)
{code}

The number will later rise to 1833 (in this case) and more sessions may be recovered when another node fails and the load is redistributed. Runners that are assigned sessions on failing node will successfully detect failover and just then will receive stale session data repeatedly.

Valid samples constitute 85% - 95% of all samples taken as you can see in failover.txt (linked).

builds (failover.txt artifact is here):
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-async/7/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-sync/7/

perf17 (client side, with exceptions):
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-async/7/console-perf17/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-sync/7/console-perf17/

Possibly linked to JBPAPP-9086?

Comment 1 Anne-Louise Tangring 2012-11-13 20:53:14 UTC

Docs QE Status: Removed: NEW

Comment 2 Ladislav Thon 2013-03-05 11:22:40 UTC

Also affects EAP 6.1.0.ER1, see

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-dist-async/2/artifact/report/failover.txt
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-dist-sync/2/artifact/report/failover.txt
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-dist-sync/9/artifact/report/failover.txt

And what's more, this problem of some sessions not failing over to another node affects REPL too. In this case, once the failed node comes back, all sessions become OK again. But the problem is there, see

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-async/28/artifact/report/failover.txt
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-netDown-repl-async/2/artifact/report/failover.txt
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync/2/artifact/report/failover.txt

Comment 3 Ladislav Thon 2013-03-28 14:36:22 UTC

This is marked as a blocker because it means that when failure happens, sessions are lost.

(If it's REPL, all sessions come back when the failure is recovered. If it's DIST, some sessions are lost forever.)

Comment 4 Paul Ferraro 2013-04-04 11:37:08 UTC

Since DIST is not the default configuration, I suggest that this issue not be flagged as a blocker.

Comment 5 Jitka Kozana 2013-04-05 07:02:05 UTC

Please see comment #2: this issue is now not only limited to DIST cache and application udeploy. It was seen in REPL cache and other failure types as well.

This BZ name contains DIST, since originally it was seen only while using DIST cache. I will rename this BZ, so the name will reflect current situation better. 

I suggest the blocker stays.

Comment 6 Paul Ferraro 2013-04-18 16:01:17 UTC

It is expected behavior that failures using REPL_ASYNC can result in stale session data on the failover nodes.  If a use case has *zero* tolerance for state session data upon failover, then REPL_SYNC should be used instead.  As far as I can tell, this issue does not affect REPL_SYNC, correct?

The comment above applies to DIST_ASYNC as well.

As far as lost sessions seen with DIST_SYNC - was this issue reproduced using ER4?

Comment 7 Jitka Kozana 2013-04-19 08:46:07 UTC

Yes, it was seen again in ER4 runs, DIST SYNC and even in graceful shutdown. Here is the link to the job:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-dist-sync/49/

Comment 8 Paul Ferraro 2013-04-19 12:38:46 UTC

So - to summarize - the issue is with DIST_SYNC only (since we can expect stale sessions on failover using REPL_ASYNC & DIST_ASYNC) - and, given that this is not our default mode, this should not be a blocker.  I suspect the issue due to a race condition between invalidation of the locally stored sessions on view change and the rebalancing of the distributed cache.

FYI - I've filed an upstream jira to more gracefully handle clean shutdown/undeploy for ASYNC modes so that there are no stale sessions - since these scenarios are not exception conditions (as opposed to jvmkill).  The clean shutdown logic that was added back in EAP5 is not sufficient to prevent stale sessions on clean shutdown/undeploy.  This will be done as part of a larger effort to redesign web session clustering entirely.
https://issues.jboss.org/browse/AS7-6947

Comment 9 Jitka Kozana 2013-04-19 13:55:52 UTC

I see my last comment #7 was not clear. 
Let me rephrase and sum up. 

During ER4 testing, we saw this error in many scenarios.  So here ([1], [2], [3], [4]) are links to some of them. We are aware that few stale data can be seen after jvmkill scenario, therefore I have selected links to shutdown scenarios. 

We saw this issue in _all_ cache setups we tested, eg REPL_ASYNC, REPL_SYNC, DIST_ASYNC, DIST_SYNC. 

Yes, DIST cache is not the default setting, but it is supported cache configuration, yes or no? 

Moreover, with REPL_SYNC (see [1]), no data should be lost. Therefore, the blocker stays.

All these links: httpsession replication, failuretype: graceful shutdown
[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-sync/81/
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-dist-sync/49/
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-async/66/
[4] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-dist-async/48/

Comment 10 Paul Ferraro 2013-04-23 15:55:08 UTC

Question about the REPL_SYNC test...
When the server is shutdown, there can still be failed requests (the web subsystem does not yet correctly implement clean shutdown) - we're not considering these when identifying stale session data on failover, correct?
Otherwise, I see lots of suspicious NPEs (due to BZ 900549), which very well may be the culprit.

Comment 11 Paul Ferraro 2013-04-26 00:47:50 UTC

I've established that this is a side effect of BZ 900549, and should be fixed by:
https://github.com/jbossas/jboss-eap/pull/122

Comment 12 Paul Ferraro 2013-04-26 12:50:53 UTC

Whoops - the above pull request was not yet merged - correcting Status...

Comment 13 Paul Ferraro 2013-04-30 15:51:12 UTC

This may also be due to ISPN-2974.

Comment 14 Jitka Kozana 2013-05-03 09:03:49 UTC

For update on testing this issue with ISPN 5.2.6.Final, please see BZ 900549, comment 86.

Comment 15 Jitka Kozana 2013-05-03 09:22:45 UTC

Some statistics on the occurences of this issue, test scenario: graceful shutdown, REPL_SYNC.

   ER5: 1132 occurences of 901164
     (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/82/artifact/report/parsed_logs_client/index.html)
   ER6 with ISPN 5.2.6.Final: 206  occurences of 901164
     (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/artifact/report/parsed_logs_client/index.html)

Comment 16 Radoslav Husar 2013-05-03 10:28:13 UTC

Very good; so ISPN-2974 seems to account for 82% of broken sessions on failover, the rest should be fixed by PR #122.

Comment 17 Jitka Kozana 2013-05-03 11:05:35 UTC

@Rado: Please see BZ 900549, comment 86. So was the pull request 122 made obsolete by the ISPN upgrade to 5.2.6.Final as you suggest in the discussion on the pull request?

Comment 18 Paul Ferraro 2013-05-03 12:22:27 UTC

@Jitka
No, we need both of them.  The commit in comment 86 effectively makes eviction/passivation transaction safe which prevents concurrent session access from reading passivating session data incorrectly.  The ISPN upgrade to 5.2.6.Final is necessary to prevent loss of session information/metadata during passivation.

Comment 19 Jitka Kozana 2013-05-03 12:48:45 UTC

@Paul, thank you for the clarification.

Comment 20 Jitka Kozana 2013-05-04 08:47:18 UTC

I have tested with ER6 and both ISPN 2.5.6. and the pull 122 and the issue is still present. 

Parsed client logs: 252 occurences of 901164
(https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/parsed_logs_client/index.html)

Comment 21 Rostislav Svoboda 2013-05-04 15:38:19 UTC

Based on last comment about results with ISPN 2.5.6. ++ pull 122 changes I'm changing status back to ASSIGNED.
This needs additional work. Number of exceptions is lower with new ISPN but it's not fully fixed.

This BZ is not used as reference for any PR  I'm aware of.
  https://github.com/jbossas/jboss-eap/pull/122 seems to be merged according https://github.com/jbossas/jboss-eap/commits/6.x
  Infinispan upgrade is tracked on https://bugzilla.redhat.com/show_bug.cgi?id=956988

Comment 22 Dimitris Andreadis 2013-05-06 07:15:07 UTC

Unless there is some breakthrough, I believe a decision will be needed if we keep working on this, or waive it for 6.1 and reschedule for 6.2.

Comment 25 Jitka Kozana 2013-05-14 13:00:43 UTC

For future reference: 6.1.0. ER8 test run to reproduce this:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/85/

Comment 26 Paul Ferraro 2013-07-05 03:54:48 UTC

Jitka,
Can you validate that this commit fixes this issue?
https://github.com/pferraro/jboss-eap/commit/f4b710931da30a401cdd2b3aada16e9cc25d5631

Comment 27 Paul Ferraro 2013-07-05 15:35:28 UTC

https://github.com/jbossas/jboss-eap/pull/222

Comment 28 Jitka Kozana 2013-07-08 13:03:38 UTC

Paul,

I have built EAP with your patch, but the issue is still present. 

See this job: 
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/77/

and this server log:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/77/artifact/report/config/jboss-perf18/server.log

Comment 29 Jitka Kozana 2013-07-29 06:45:47 UTC

Still seeing in 6.1.1.ER3 (but it was expected, see comment #28).

Comment 30 Ladislav Thon 2013-08-06 13:50:12 UTC

Just noting that we are still seeing this in 6.1.1.ER4. So, even though bug 900549 was indeed fixed, this one was NOT.

Some runs with 6.1.1.ER4 that exhibit this issue:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-async/37/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-repl-async/35/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-repl-async/26/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-jvmkill-dist-async-3owners/12/

Comment 31 Jitka Kozana 2013-08-06 13:55:54 UTC

Also seen with REPL_SYNC:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync/10/

Comment 36 Jitka Kozana 2013-08-19 13:57:23 UTC

The issue was seen again in EAP 6.1.1.ER6.

Comment 38 Paul Ferraro 2013-08-19 15:12:33 UTC

All of the failure noted above, since the fix in #c27 have been with ASYNC tests (see #c28 and #c30).  As I've mentioned already (see #c6 and #c8): stale session data is expected behavior when using ASYNC mode!

The only SYNC test that still seems to be failing is this eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync test.  I need to details as to what exactly is happening in this test. However, I suspect that while this test might be using SYNC mode for web sessions, it may very well still be using ASYNC mode for SFSBs (default is ASYNC).  If this is the case, then stale session data is expected behavior.

@Jitka, can you comment?

Comment 39 Radoslav Husar 2013-08-19 15:23:04 UTC

Looking at the configuration [1], it uses ASYNC for web sessions and SYNC for EJB sessions. 

[1] http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync/lastSuccessfulBuild/artifact/report/config/jboss-perf18/standalone-ha.xml

Comment 41 Paul Ferraro 2013-08-19 18:25:39 UTC

@Radoslav Thanks!  Just as I suspected - this is no longer an issue and should be closed.

Comment 42 Jitka Kozana 2013-08-20 08:08:40 UTC

We have re-configured the cache setup as suggested in comment #29. Now the jobs [1] and [2] use SYNC for both cache containers (web and ejb). The issue is still present.

In [1], in the server log [3], after the failover is finished (= the application is redeployed) and the node is back in the cluster, we are still seeing new SFSB being created. This is the error on client side:

Stale session data received. Expected 93, received 0

The same issue can be seen in the test scenario [2]. 

Therefore, this issue is not fixed. The customers will lose their session data.

We suggest not closing the issue and continue with investigation.

[1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-repl-sync/30/
[2] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync/13/
[3] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-undeploy-repl-sync/30/console-perf18

Comment 44 Radoslav Husar 2013-08-20 11:35:03 UTC

Is ejbservlet the only scenario where you are seeing this issue?

Can we please get results for jvmkill/repl/sync combination? It looks like we don't have this job yet http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-repl-sync/

Comment 45 Scott Mumford 2013-08-29 03:45:19 UTC

Marking for exclusion from the 6.1.1 Release Notes document as an entry for this bug could not be completed or verified in time.

Comment 46 Paul Ferraro 2013-08-29 23:15:39 UTC

This will be addressed by the new web session clustering implementation scheduled for 6.3.

Comment 47 Radoslav Husar 2014-03-18 16:27:09 UTC

Needs to be revalidated following the Infinispan upgrade.

Comment 48 Richard Janík 2014-03-20 14:56:53 UTC

Still present, some SYNC logs from DR2 (ejbservlet-jvmkill-dist, ejbservlet-shutdown-repl, session-undeploy-dist):

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-dist-sync/30/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-shutdown-repl-sync/31/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-sync/42/

Comment 49 Michal Karm Babacek 2014-04-17 20:10:46 UTC

OMG, what's this TCMS-went-crazy stuff? :-)

Comment 50 Ladislav Thon 2014-04-22 06:30:31 UTC

Please filter by the X-Bugzilla-Who header. No other fast & easy way, I'm afraid.

Comment 51 Ladislav Thon 2014-07-08 13:09:27 UTC

Still an issue, moving to 6.4.

Comment 52 Kabir Khan 2014-10-08 12:25:29 UTC

Should be fixed by 5.2.11.CR1 upgrade 1149197

Comment 53 Ladislav Thon 2014-10-17 12:27:12 UTC

Wasn't fixed by the Infinispan upgrade in EAP 6.4.0.DR5. Moving back to ASSIGNED.

Comment 54 Ladislav Thon 2014-10-30 13:08:05 UTC

During EAP 6.4 testing, we've seen this only in the ejb-servlet scenario.

Comment 55 Radoslav Husar 2014-10-30 13:15:45 UTC

Discussing with Jitka, it looks like the answer to the question in comment #44 is yes for EAP 6.4 builds with the latest Infinispan upgrade. The only scenario this is consistently seen now is the /ejbservlet (Servlet @Inject-ing SFSB).

There are numerous error 500 in the logs, which are hinting at timeouts on the server side and loadbalancer's timeout is kicking in.

The expectation is a slowdown after the failover, so please try with smaller amount of sessions (e.g. 200) and extra logging (e.g. for the EJBs). Also please post latest runs on the BZ as some of the older links are broken, thanks!

Comment 56 Radoslav Husar 2014-10-30 13:29:11 UTC

Please also get results for all combinations, especially this one: eap-6x-failover-ejb-ejbservlet-jvmkill-repl-sync. Thanks.

Comment 57 Ladislav Thon 2014-10-30 13:48:11 UTC

Adding Michal, who started on clustering QA recently and will be running those investigation jobs.

Comment 58 Michal Vinkler 2014-11-05 12:14:09 UTC

Please find the logs from the requested runs below (ejbservlet-jvmkill-repl-sync, ejbservlet-jvmkill-repl-async, ejbservlet-jvmkill-dist-sync, ejbservlet-jvmkill-dist-async).

Test conditions:
EAP 6.4.0.DR5
200 active sessions
enabled DEBUG level logging for EJBs 

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-repl-sync-BZ901164/1/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-repl-async-BZ901164/3/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-dist-sync-BZ901164/2/
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbservlet-jvmkill-dist-async-BZ901164/2/

Comment 59 Paul Ferraro 2014-11-12 18:51:33 UTC

Status: I'm not yet convinced that these failures are indicative of a bug and not a flaw in the test itself - therefore I am holding off on ACK'ing this.

Comment 60 Radoslav Husar 2014-11-13 13:36:29 UTC

I could actually reproduce this locally and manually with only 1 session outside the system tests. I haven't yet extracted the pattern though.

What I can get to looks like a deadlock/locking issue the servlet can be requested from 1 node, when requested from other 2 nodes the client doesn't get a response, no timeouts seem to kick in (the timeouts seen in the system tests are  coming from the loadbalancer, configured TO is 30 seconds)

A thread dump should point to cause, I only have the shutdown log with exceptions atm https://gist.github.com/rhusar/d74d301225733c8a22d8

Comment 61 Paul Ferraro 2014-11-23 15:26:26 UTC

Question about the test itself: how are the @Stateful EJB references persisted across requests?  Are these EJBs @SessionScoped?

Comment 62 Ladislav Thon 2014-11-26 09:33:24 UTC

Yes, the EJB is @SessionScoped.

Specifically, the servlet that is being accessed is: https://github.com/clusterbench/clusterbench/blob/master/clusterbench-ee6-web/src/main/java/org/jboss/test/clusterbench/web/ejb/LocalEjbServlet.java

And the injected EJB is: https://github.com/clusterbench/clusterbench/blob/master/clusterbench-ee6-ejb/src/main/java/org/jboss/test/clusterbench/ejb/stateful/LocalStatefulSB.java

Rado should be able to answer all questions about the code, since he is the primary author :-)

Comment 63 Kabir Khan 2014-12-12 10:53:27 UTC

Devel-nacking since there is no time to do this in the 6.4.0 timeframe, and it is not a blocker

Note You need to log in before you can comment on or make changes to this bug.