Bug 900549 (JBPAPP6-1169) - Transient NullPointerException in org.jboss.as.web.session.ClusteredSession.update() on failover
Summary: Transient NullPointerException in org.jboss.as.web.session.ClusteredSession.u...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: JBPAPP6-1169
Product: JBoss Enterprise Application Platform 6
Classification: JBoss
Component: Clustering
Version: 6.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ER4
: EAP 6.1.1
Assignee: Radoslav Husar
QA Contact: Jitka Kozana
Russell Dickenson
URL: http://jira.jboss.org/jira/browse/JBP...
Whiteboard:
Depends On: 918562 920193 959495 959875
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-05-29 15:19 UTC by Radoslav Husar
Modified: 2014-05-27 01:28 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
As this is now and "Bug Fix", the information required for an appropriate Release Notes entry is: Cause: What was the root cause of this issue? Consequence: What were the consequences of this issue. In other words, what would have been the effect of this issue on a customer? Fix: What was done to resolve this issue? Result: What are the consequences now for a customer?
Clone Of:
Environment:
Last Closed: 2013-09-16 20:26:23 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
server-6.0.2.ER2.log.gz (1005.93 KB, application/gzip)
2012-10-18 16:03 UTC, Aleksandar Kostadinov
no flags Details
jar for eap 6.1 (80.20 KB, application/x-java-archive)
2013-05-03 14:01 UTC, Radoslav Husar
no flags Details
server.log when NPE without passivation was thrown (251.89 KB, text/x-log)
2013-05-17 15:33 UTC, Aleksandar Kostadinov
no flags Details
clusterbench app with passivation disabled (33.15 KB, application/octet-stream)
2013-05-17 15:35 UTC, Aleksandar Kostadinov
no flags Details
patched *ClusteredSession implementation to always replicate metadata (600.34 KB, application/x-java-archive)
2013-07-16 16:09 UTC, Radoslav Husar
no flags Details
EAP 6.1.1 ER3 server log (349.47 KB, text/plain)
2013-07-30 21:20 UTC, Aleksandar Kostadinov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 900552 0 low CLOSED Fix logging of "Standard expiration of session X failed; switching to a brute force cleanup. Problem is null" 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker AS7-4932 0 Blocker Resolved CLONE - Using session passivation results in WeldListener: java.lang.NullPointerException on normal operation 2019-06-17 10:58:18 UTC
Red Hat Issue Tracker ISPN-2903 0 Critical Resolved Manual eviction should not delete entry from cache store 2019-06-17 10:58:18 UTC
Red Hat Issue Tracker JBPAPP6-1169 0 Major Closed transient "WeldListener: java.lang.NullPointerException" with clustering and session replication 2019-06-17 10:58:18 UTC

Internal Links: 900552

Description Radoslav Husar 2012-05-29 15:19:04 UTC
Affects: Release Notes
Complexity: High
Workaround: Workaround Exists
Workaround Description: Turn off session passivation completely.
project_key: JBPAPP6

NPE is thrown in roughly 0.36% of HTTP session request processing in the test.

This results in response code 503 returned to the client with the exception.

The test is using passivation-enabled WAR of clusterbench 
https://github.com/rhusar/clusterbench

Here is a shorter soak test run that uncovered the issue 
https://hudson.qa.jboss.com/hudson/view/EAP6/view/EAP6-Clustering-Soak/job/eap-6x-soak-http-repl-async-web-perf17only/6/

{noformat}
[JBossINF] 19:52:10,113 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-perf20/10.16.90.58:8009-2570) Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.NullPointerException
[JBossINF] 	at org.jboss.as.web.session.ClusteredSession.update(ClusteredSession.java:972) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1]
[JBossINF] 	at org.jboss.as.web.session.DistributableSessionManager.loadSession(DistributableSessionManager.java:1377) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1]
[JBossINF] 	at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:673) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1]
[JBossINF] 	at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:84) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.connector.Request.doGetSession(Request.java:2618) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.catalina.connector.Request.getSession(Request.java:2375) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:841) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:72) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:58) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:31) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:16) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1]
[JBossINF] 	at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:134) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1]
[JBossINF] 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:368) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:505) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:452) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) [jbossweb-7.0.16.Final-redhat-1.jar:]
[JBossINF] 	at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_30]
{noformat}

Comment 2 Radoslav Husar 2012-05-31 15:31:58 UTC
Security: Removed: JBoss Internal Added: Public


Comment 3 Radoslav Husar 2012-05-31 15:32:59 UTC
Linking to related passivation issue JBPAPP-9186 which is a logging issue but might be covering another issue.

Comment 4 Radoslav Husar 2012-05-31 15:32:59 UTC
Link: Added: This issue relates to JBPAPP-9186


Comment 5 Radoslav Husar 2012-05-31 15:33:21 UTC
Link: Added: This issue Cloned to JBPAPP-9206


Comment 6 Radoslav Husar 2012-05-31 15:36:27 UTC
Labels: Added: eap6_need_triage


Comment 7 Misty Stanley-Jones 2012-06-01 05:25:29 UTC
Release Notes Docs Status: Added: Not Yet Documented
Affects: Added: Release Notes


Comment 8 Paul Ferraro 2012-06-01 12:02:53 UTC
This is not a blocker (it doesn't affect the default clustering configuration for web sessions) - though it is critical enough that it needs to be fixed for 6.0.1.
I noticed that the NPE is triggered outside the scope of the Infinispan batch/transaction (notice ClusteredSessionValve does not appear in the stacktrace) - which might be the source of the problem.

Comment 9 Misty Stanley-Jones 2012-06-01 13:25:08 UTC
Release Notes Docs Status: Removed: Not Yet Documented Added: Documented as Known Issue
Release Notes Text: Added: A <code>WeldListener: java.lang.NullPointerException</code> is thrown occasionally when using session passivation. The client receives a HTTP response <code>503</code>, along with the exception text. The cause of this bug is under investigation.


Comment 10 Radoslav Husar 2012-06-01 13:59:47 UTC
Docs QE Status: Removed: NEW 


Comment 12 Radoslav Husar 2012-06-04 11:37:00 UTC
Workaround Description: Added: Turn off session passivation completely.
Workaround: Added: Workaround Exists
Complexity: Added: High


Comment 14 Anne-Louise Tangring 2012-06-04 20:02:41 UTC
This is not a blocker for the EAP 6 release, but we need to document in the release notes that it is broken. 

Comment 15 Paul Ferraro 2012-06-05 16:40:51 UTC
Link: Added: This issue depends ISPN-2095


Comment 16 Rajesh Rajasekaran 2012-07-11 19:41:28 UTC
Labels: Removed: eap6_need_triage Added: eap601candidate


Comment 18 Brian Stansberry 2012-08-15 15:24:02 UTC
Labels: Removed: eap601candidate Added: as713tracking eap601candidate


Comment 19 Brian Stansberry 2012-08-17 21:11:54 UTC
Resolved in upstream for 7.1.3 tag.

Comment 20 Jitka Kozana 2012-10-01 14:49:41 UTC
Still seeing this NPE exception in EAP 6.0.1.ER2, please find the server logs below:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-sync/74/console-perf19/
Reopening.

Comment 22 Shelly McGowan 2012-10-03 00:14:38 UTC
Reopened today.   Moving to 6.0.1.ER4.

Comment 24 Radoslav Husar 2012-10-03 12:08:53 UTC
The linked job was definitely using a SNAPSHOT and not EAP, just look for this line:
{noformat}[JBossINF] 16:49:45,510 INFO  [org.jboss.as] (Controller Boot Thread) JBAS015874: JBoss AS 7.1.3.Final-SNAPSHOT "Arges" started in 5639ms - Started 397 of 559 services (161 services are passive or on-demand){noformat}

I fixed the job and it needs to be simply rerun:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-sync/

Comment 28 Jitka Kozana 2012-10-08 11:38:11 UTC
Verified, closing.

Comment 29 Dana Mison 2012-10-16 05:29:58 UTC
Writer: Added: mistysj


Comment 30 Dana Mison 2012-10-17 22:51:17 UTC
reopening to update release notes (these are still documented as known issues from the previous release)

Comment 31 Dana Mison 2012-10-17 22:52:03 UTC
Writer: Removed: mistysj Added: Darrin


Comment 32 Dana Mison 2012-10-17 23:04:39 UTC
Setting these to "Not Yet Documented" since they required updating from known to resolved issues.

Comment 33 Dana Mison 2012-10-17 23:04:39 UTC
Release Notes Docs Status: Removed: Documented as Known Issue Added: Not Yet Documented


Comment 34 Aleksandar Kostadinov 2012-10-18 15:56:08 UTC
Link: Added: This issue is duplicated by JBPAPP-10226


Comment 35 Aleksandar Kostadinov 2012-10-18 16:03:16 UTC
Attaching full server log from 6.0.2 ER2 where the problem manifested itself. It seems like that the problem was not completely fixed. I'm not sure though the clusterbench program has exact same configuration as what Rado had before. Additionally the environment I am running (unlike Rado) is: EC2 large instances, RHEL6, S3_PING discovery protocol, jgroups tcp stack. Additionally the problem is hard to reproduce. From 289165 only 34 samples failed and it took several hours for that to happen.
Please let me know if you need any additional details.

Comment 36 Aleksandar Kostadinov 2012-10-18 16:03:16 UTC
Attachment: Added: server-6.0.2.ER2.log.gz


Comment 37 Aleksandar Kostadinov 2012-10-18 16:28:04 UTC
Link: Added: This issue depends WELD-831


Comment 38 Misty Stanley-Jones 2012-10-18 16:50:13 UTC
Resolving as docs edits can be made in RESOLVED but not in CLOSED. Darrin, in future can you open CLOSED bugs and then just immediately resolve them with the same status they were closed with?

Comment 40 Anne-Louise Tangring 2012-10-25 14:46:17 UTC
This issue is not a blocker nor a critical for EAP 6.0.1. Thus it is moved to TBD EAP 6 to be triaged for the next EAP 6.x release.

Comment 41 Dana Mison 2012-11-05 03:25:05 UTC
Release Notes Docs Status: Removed: Not Yet Documented Added: Documented as Known Issue


Comment 42 Dana Mison 2012-11-05 03:25:25 UTC
Release Notes Text: Removed: A <code>WeldListener: java.lang.NullPointerException</code> is thrown occasionally when using session passivation. The client receives a HTTP response <code>503</code>, along with the exception text. The cause of this bug is under investigation. Added: A `WeldListener: java.lang.NullPointerException` is thrown occasionally when using session passivation. The client receives a HTTP response `503`, along with the exception text. The cause of this bug is under investigation.


Comment 43 Jitka Kozana 2012-11-05 10:23:29 UTC
{code}
02:05:31,548 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-perf18/10.16.90.54:8009-360) Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.NullPointerException
	at org.jboss.as.web.session.ClusteredSession.update(ClusteredSession.java:979) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3]
	at org.jboss.as.web.session.DistributableSessionManager.loadSession(DistributableSessionManager.java:1377) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3]
	at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:673) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3]
	at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:84) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3]
	at org.apache.catalina.connector.Request.doGetSession(Request.java:2618) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.catalina.connector.Request.getSession(Request.java:2375) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:841) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:72) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1]
	at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:58) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1]
	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:31) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1]
	at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:16) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1]
	at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:134) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1]
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:505) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:453) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) [jbossweb-7.0.17.Final-redhat-1.jar:]
	at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_30]
{code}
seen again during ER3 testing here:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-netDown-dist-sync/lastSuccessfulBuild/artifact/report/config/jboss-perf18/server.log

Comment 44 Jitka Kozana 2012-11-06 10:13:08 UTC
Raising to critical.

From our last testing results, it seems the issue is still present, even on application with no passivation configured. It was seen by Aleks during his EC2 testing and now again by us in ER3 testing. 
The effect on customers/users of this NPE is
{code} java.net.SocketTimeoutException: Read timed out {code}
which means customer's request is lost.  

Comment 46 Anne-Louise Tangring 2012-11-06 18:47:55 UTC
This issue has been triaged by the EAP PM team and it was decdied it is not a blocker for the EAP 6.0.1 release. Changing to Major.

Comment 47 Anne-Louise Tangring 2012-11-06 18:47:55 UTC
Labels: Removed: as713tracking eap601candidate Added: as713tracking


Comment 48 Anne-Louise Tangring 2012-11-13 20:43:46 UTC
Release Notes Docs Status: Removed: Documented as Known Issue 
Writer: Removed: Darrin 
Release Notes Text: Removed: A `WeldListener: java.lang.NullPointerException` is thrown occasionally when using session passivation. The client receives a HTTP response `503`, along with the exception text. The cause of this bug is under investigation. 


Comment 49 Anne-Louise Tangring 2012-11-14 17:08:30 UTC
This issue was marked with priority Major and since only Critical or Blocker issues are considered at this time, the issue is moved to TBD EAP 6. Should this be considered, please contact the EAP PM team.

Comment 50 Ladislav Thon 2012-12-07 09:07:59 UTC
Seen this on 6.0.1.ER4.1 as well (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-netDown-repl-sync/4/console-perf21/). Added 6.0.1.ER4 to _Affects Version/s_ as we don't have 6.0.1.ER4.1 in JIRA.

Comment 51 Ladislav Thon 2013-02-06 09:34:31 UTC
Also affects 6.1.0.DR3.

Comment 54 JBoss JIRA Server 2013-03-07 15:55:06 UTC
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-2095 to Resolved

Comment 55 JBoss JIRA Server 2013-03-07 15:55:06 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2095

Original issue was fixed, and the fix versions have been released.

Please open a new jira, link to this one, and assign it to me. Please indicate the AS version failing with this, and attach TRACE logs on org.infinispan if available.

Comment 56 JBoss JIRA Server 2013-03-07 16:58:19 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

The issue can be reproduced against AS master, the 7.2 branch.

Comment 57 JBoss JIRA Server 2013-03-07 16:58:45 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

The issue can be reproduced against AS master or the 7.2 branch.

Comment 59 JBoss JIRA Server 2013-03-07 17:06:48 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

TRACE logs from test run on each server.

Comment 60 JBoss JIRA Server 2013-03-08 10:04:51 UTC
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-2903 to Coding In Progress

Comment 61 JBoss JIRA Server 2013-03-08 11:15:32 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903

I'm a bit confused with this test. I can't find any tests named as AtomicMapTestCase in the AS source tree in https://github.com/jbossas/jboss-as, nor can I find a test that creates a atomic-map.war deployment (in case the name of the test has changed). Why is the test not included there in the first place?? The reason I ask is cos I'm confused by the test HTTP parameters: `operation=put&key=1&value=a`, and the log messages:

{code}Invoked with command PutKeyValueCommand{key=a, value=AtomicHashMap...{code}

Unfortunately, the toString implementation of AtomicHashMap does not show contents any more due to concurrent iterator issues with the previous toString implementation. Looking to see if I can trace back changes to it via atomic hash map delta logging.

Also, is the test still failing in the same line as it did before? Or is failing in a different line?

Comment 62 JBoss JIRA Server 2013-03-08 12:06:17 UTC
Radoslav Husar <rhusar> made a comment on jira ISPN-2903

The test is only attached to this Jira at the moment.

https://issues.jboss.org/secure/attachment/12361174/AtomicMapServlet.java
https://issues.jboss.org/secure/attachment/12361175/AtomicMapTestCase.java

Comment 63 JBoss JIRA Server 2013-03-08 13:42:18 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903

I tried to modify the original test case I created for Infinispan to see if it failed but couldn't get it to fail.

I've tried running it on AS8 and it fails in the same place as it did before:
org.jboss.as.test.clustering.cluster.AtomicMapTestCase.test(AtomicMapTestCase.java:107)

I'll try to add some more TRACE to see more closely how the contents of the AtomicHashMap progress.

Comment 64 JBoss JIRA Server 2013-03-08 13:42:38 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903

I tried to modify the original test case I created for Infinispan to see if it failed but couldn't get it to fail.

I've tried running it on AS8 and it fails in the same place as it did before:
{code}org.jboss.as.test.clustering.cluster.AtomicMapTestCase.test(AtomicMapTestCase.java:107){code}

I'll try to add some more TRACE to see more closely how the contents of the AtomicHashMap progress.

Comment 65 JBoss JIRA Server 2013-03-08 14:56:35 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

This test is not part of the AS testsuite because it does not test any integration code.  It's actually testing Infinispan code, i.e. it merely validates that AtomicMap entries are not lost upon eviction.

Comment 66 JBoss JIRA Server 2013-03-08 14:58:45 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

This test is not part of the AS testsuite because it does not test any integration code.  If anything, it belongs in the Infinispan testsuite (without the AS test harness, of course).  This test merely validates that AtomicMap entries are not lost upon eviction - and was explicitly setup to reproduce the linked EAP issue.

Comment 67 JBoss JIRA Server 2013-03-08 14:59:47 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

Yes - this test is failing at exactly the same spot as it was before.

Comment 68 JBoss JIRA Server 2013-03-08 15:02:19 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

Perhaps the cache isn't configured the same way in your test (within the Infinispan testsuite) as it is in ours?  The cache in the AS test has passivation to a file cache store enabled.

Comment 69 JBoss JIRA Server 2013-03-08 15:41:28 UTC
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903

Perhaps the cache isn't configured the same way in your test (within the Infinispan testsuite) as it is in ours?  The cache in the AS test uses passivation to a file cache store.

Comment 70 JBoss JIRA Server 2013-03-08 15:51:09 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903

^ I did check the configuration, there were some differences, but didn't make a difference so far.

Comment 71 JBoss JIRA Server 2013-03-11 11:05:40 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903

Paul/Rado, found the root cause. I was using an unbounded container in testing and the issue only happens with a bounded container, and indeed, it's a side effect of ISPN-2384. I'm working on a fix...

Comment 72 JBoss JIRA Server 2013-03-11 11:34:52 UTC
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903

I've got a fix now, testing... I'll send a pull req as soon as it's ready.

p.s. Paul/Rado, you guys should add org.jboss.as.test.clustering.cluster.AtomicMapTestCase to your automates testsuite to avoid future surprises...

Comment 73 JBoss JIRA Server 2013-03-12 11:51:54 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-2903

integrated master + 5.2.x

Comment 75 Paul Ferraro 2013-03-20 15:17:14 UTC
The conditions of the NPE look different in 6.1.0.ER3.  Previously the NPE only happened after the eviction period was reached.  In 6.1.0.ER3, the NPE appears immediately after failback.  Investigating...

Comment 76 Jitka Kozana 2013-03-20 15:30:55 UTC
We just started ER3 and we will look for this exception in many more test scenarios. What scenarios/log setup/anything else will be useful for you here?

Comment 77 Paul Ferraro 2013-04-04 14:52:03 UTC
My hunch is that there is still another issue with AtomicMaps and the interaction with the cache store - I'm attempting to reproduce the issue locally.

Comment 78 Paul Ferraro 2013-04-23 15:35:06 UTC
Status: Still under investigation.  I suspect this is to blame for BZ 901164.

Comment 79 Paul Ferraro 2013-04-25 16:22:29 UTC
Quick status update: I've finally managed to reproduce the issue locally.  I'm still isolating the exact cause.

Comment 80 Paul Ferraro 2013-04-26 00:44:35 UTC
Finally...
https://github.com/jbossas/jboss-eap/pull/122

Comment 82 Paul Ferraro 2013-04-30 15:46:50 UTC
The Infinispan team has identified another condition that can trigger this bug, specifically ISPN-2974.  The fix is currently be ported to Infinispan 5.2.x - and should soon be available for consumption by EAP 6.1.
What should we do with the status of this BZ?

Comment 84 Jitka Kozana 2013-05-02 06:22:06 UTC
The issue is still present in ER6.

Link to job (note: this is the first job run today with ER6 where this issue was seen again. If you need to verify if this is present in any other test configuration, please let us know):

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/26/

Link to server logs:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/26/artifact/report/config/jboss-perf18/server.log

Comment 85 Radoslav Husar 2013-05-02 10:48:25 UTC
Correct, this is to be expected in ER6 since the changes are not in ER6 codebase yet (useful information now would be to build ER6 + the patch and run the test).

Comment 86 Jitka Kozana 2013-05-03 09:03:19 UTC
I have tested this issue with ER6 and ISPN 5.2.6.Final (taken from BZ 956988, comment 36) and the issue is still present. 

We are still seeing the NPE and we are still seeing BZ 901164 (lost sessions after failover). 

For the testing, I have chosen scenario REPL_SYNC and _graceful_ shutdown.

Looking at the test results, it looks like some of the NPE's (and the consequence 901164) were eliminated by the ISPN upgrade, but most likely there is another cause of these NPEs that was not discovered and fixed by the ISPN upgrade.  

@Rado, please elaborate on your comment on https://github.com/jbossas/jboss-eap/pull/122 : the pull 122 was made obsolete by the ISPN upgrade? Brian is asking the same thing on the pull request discussion.

Link to job: 
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/

Some statistics on the exception numbers:
client side:
   ER5: 36478 occurences of 900549
     (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/82/artifact/report/parsed_logs_client/index.html)
   ER6 with ISPN 5.2.6.Final: 6370 occurences of 900549
     (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/artifact/report/parsed_logs_client/index.html)

server side:
  ER5: 18241 occurences of 900549
   (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/82/artifact/report/parsed_logs_server/index.html)
  ER6 with ISPN 5.2.6.Final: 3185 of 900549
   (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/artifact/report/parsed_logs_server/index.html)

Comment 87 Paul Ferraro 2013-05-03 12:19:35 UTC
@Jitka
All NPE won't go away completely until ER7, which will contain the following commit:
https://github.com/jbossas/jboss-eap/commit/4db033959d89b46990b54f62b938e7b7158d57e9

Comment 89 Dimitris Andreadis 2013-05-03 13:02:49 UTC
Just a clarification. A second pull request will be used for the Infinispan update on this Bugzilla.

Comment 90 Jitka Kozana 2013-05-03 13:17:11 UTC
@Paul: Can we have a jar (I assume it will be jboss-as-clustering-web-infinispan.jar) with the mentioned commit (comment #87) for testing, please?

Comment 91 Radoslav Husar 2013-05-03 13:43:17 UTC
That would be great, I will shortly attach a jar to test that contains the fix (rather workaround for ISPN-3048) which should be tested together with Infinispan 5.2.6 upgrade that contains the other mentioned fixes.

PS: the comment on PR #122 is about upstream, none of which is relevant to this EAP version.

Comment 92 Radoslav Husar 2013-05-03 14:01:25 UTC
Created attachment 743234 [details]
jar for eap 6.1

Please see attached jar (git revision 4db033959d89b46990b54f62b938e7b7158d57e9).

Just replace over existing jar or update module.xml accordingly.

Comment 94 Rostislav Svoboda 2013-05-04 15:38:22 UTC
Based on last comment about results with ISPN 2.5.6. ++ pull 122 changes I'm changing status back to ASSIGNED.
This needs additional work. Number of exceptions is lower with new ISPN but it's not fully fixed.

This BZ is not used as reference for any PR  I'm aware of.
  https://github.com/jbossas/jboss-eap/pull/122 seems to be merged according https://github.com/jbossas/jboss-eap/commits/6.x
  Infinispan upgrade is tracked on https://bugzilla.redhat.com/show_bug.cgi?id=956988

Comment 95 Paul Ferraro 2013-05-05 01:30:51 UTC
Hmmmm. The test I created locally to reproduce this problem still passes. Can you try re-running this test making the following change to the configuration of the web cache container in the infinispan subsystem?

<file-store fetch-state="false"/>

Comment 96 Paul Ferraro 2013-05-05 03:52:56 UTC
Actually, scratch that...
Instead, re-run with:

<file-store purge="false"/>

Comment 97 Jitka Kozana 2013-05-05 10:44:31 UTC
I have rerun with 
<file-store purge="false"/>
and the issue is still present.

Link to job:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/6/
Link to server log:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/6/artifact/report/config/jboss-perf01/server.log

Please note, the logs are not parsed in this particular run, because I have migrated the test to another segment of lab, so we can test performance.

Comment 98 Jitka Kozana 2013-05-05 11:13:35 UTC
I have rerun with 
<file-store fetch-state="false"/>
and the issue is still present. 

Link to job:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/7/

Link to server log:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/7/artifact/report/config/jboss-perf01/server.log

Again, the logs are not parsed, therefore no exceptions will be found in the output of logparser.

Comment 99 Paul Ferraro 2013-05-05 11:52:22 UTC
Hmm.  It looks like the default isolation level isn't getting applied correctly.  The web (and ejb, for that matter) cache containers should use REPEATABLE_READ isolation (which is the default in the appropriate managed resource), but it looks like, in practice, they are only using READ_COMMITTED.
Please re-run this test using the following change to the caches of the web cache container:
<locking isolation="REPEATABLE_READ"/>

Comment 100 Paul Ferraro 2013-05-05 11:59:02 UTC
If no <locking/> element is called, the default from the LockingResource never gets applied - and Infinispan's default isolation is used.

Comment 101 Jitka Kozana 2013-05-05 12:01:10 UTC
Thanks for the update Paul, just to be clear: no change to the <file-store> element, just the locking isolation, right?

Comment 102 Paul Ferraro 2013-05-05 12:32:55 UTC
Yes, you can drop the changes to <file-store/>.

Comment 104 Paul Ferraro 2013-05-05 15:33:44 UTC
OK - there must still be an issue with merging deltas with passivated atomic maps in Infinispan.  My recommendation at this point, is to remove the blocker status - and document this as still (albeit less of) an issue (the same as it was in 6.0.1) - and recommend that users not enable passivation (it is not on by default).  If memory usage is a concern, it can be mitigated using distribution mode.  I will reopen this issue with the Infinispan team.

Comment 105 Mark Little 2013-05-05 17:59:16 UTC
I agree with Paul, but we should pull in Mircea, Galder and whomever else is necessary to chase this down post 6.1.

Comment 106 John Doyle 2013-05-06 00:15:50 UTC
I agree as well.

Comment 108 Paul Ferraro 2013-05-06 20:12:35 UTC
I've set this to POST, meaning that this issue should be reverified following the resolution of BZ 959875.

Comment 110 Jitka Kozana 2013-05-14 12:25:35 UTC
For future reference: 6.1.0.ER8 run of job with this issue:

https://jenkins.mw.lab.eng.bos.redhat.com/job/eap-6x-failover-http-session-shutdown-repl-async/70/

Comment 111 Aleksandar Kostadinov 2013-05-17 15:33:41 UTC
Created attachment 749389 [details]
server.log when NPE without passivation was thrown

@Paul, I've run clustering failover tests in EC2 and I'm still hitting NPEs and looking at the clustedbench app it does NOT have passivation active. I'm attaching 6.1.0 ER8 full server log and the app itself so you can review.

Comment 112 Aleksandar Kostadinov 2013-05-17 15:35:04 UTC
Created attachment 749390 [details]
clusterbench app with passivation disabled

Comment 114 Paul Ferraro 2013-07-05 03:58:52 UTC
Jitka,
Can you validate that this commit fixes this issue?
https://github.com/pferraro/jboss-eap/commit/f4b710931da30a401cdd2b3aada16e9cc25d5631

Comment 115 Paul Ferraro 2013-07-05 15:35:25 UTC
https://github.com/jbossas/jboss-eap/pull/222

Comment 116 Jitka Kozana 2013-07-08 13:28:44 UTC
Please see: 
https://bugzilla.redhat.com/show_bug.cgi?id=901164#c28

Comment 118 Radoslav Husar 2013-07-16 16:09:21 UTC
Created attachment 774403 [details]
patched *ClusteredSession implementation to always replicate metadata

@Jitka: attaching the patched jar, please run the test with this one. Thanks!

The jar contains patched *ClusteredSession implementation that always replicates metadata by which maintains session correctness but has performance implications.

Comment 119 Jitka Kozana 2013-07-17 06:32:49 UTC
@Rado, it works! no NPE in my test run:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/78/

See the parsed logs: 
client: 
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/78/artifact/report/parsed_logs_client/index.html

server:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/78/artifact/report/parsed_logs_server/index.html

However, I have 2 concerns:
1. I am getting quite a few 500 error code. Please take a look.
2. Please elaborate on "maintains session correctness but has performance implications."

Thank you.

Comment 120 Radoslav Husar 2013-07-17 09:15:18 UTC
Great to see we have one solution to the problem, thanks for the runs. 

@Jitka Could we also get performance stress tests for both write-heavy and read-heavy scenarios to see what is the actual performance implication?


I was looking at the logs and there are no relevant exceptions in the server logs. The SC 500 are due to non-graceful shutdown, i.e. the AJP connector shuts down and the LB tries to connect resulting in SC 500. This is normally solved by session draining or using mod_cluster which this test does not do, moreover we do not retry in these situations as to not mask other problems (we currently retry on 404 and 503 only). The corresponding load balancer log message is as follow:

[Wed Jul 17 02:13:12.288 2013] [7603:139679944914688] [error] ajp_send_request::jk_ajp_common.c (1585): (perf18) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=111)


Just to clear up some confusion from the previous assumtions and comments: this issue happens *on failover* and is not related passivation, nor dist/repl mode, nor to CDI/Weld. The stacktrace for non-CDI application is as follows (if anybody searches for the BZ):

17:15:27,682 ERROR [org.apache.catalina.connector] (http-/127.0.0.1:8080-2) JBWEB001018: An exception or error occurred in the container during the request processing: java.lang.NullPointerException
	at org.jboss.as.web.session.ClusteredSession.update(ClusteredSession.java:973) [jboss-as-web-7.2.0.Final-redhat-X.jar:7.2.0.Final-redhat-X]
	at org.jboss.as.web.session.DistributableSessionManager.loadSession(DistributableSessionManager.java:1397) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final]
	at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:681) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final]
	at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:84) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.catalina.connector.Request.doGetSession(Request.java:2605) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.catalina.connector.Request.getSession(Request.java:2357) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:99) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:653) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) [jbossweb-7.2.0.Final.jar:7.2.0.Final]
	at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_25]

Updating BZ summary, setting flags, fields..

Comment 121 Jitka Kozana 2013-07-17 10:10:09 UTC
Rado,
thanks for the quick update. So here is what I will do:

1. will try the test with mod_cluster to look at the 500 codes
2. will run the following performance tests:
 - write heavy, scenario session, repl async
 - write heavy, scenario session, repl sync
 - read heavy, sceanario session, repl async

I'll update this BZ once the runs are ready.

Comment 123 Dimitris Andreadis 2013-07-18 13:25:50 UTC
Rado, have you fixed that upstream, too? JIRA/pull requests?

Comment 124 Radoslav Husar 2013-07-23 08:42:00 UTC
Just noting that the fix is not relevant upstream (different implementation).

Comment 125 Aleksandar Kostadinov 2013-07-30 21:20:26 UTC
Created attachment 780843 [details]
EAP 6.1.1 ER3 server log

I just ran a cluster failover test with EAP 6.1.1 ER3 in EC2 and hit the exceptions again. Attaching server log.

Comment 126 Radoslav Husar 2013-07-31 13:29:16 UTC
@Aleks, the fix have made it to 6.1.1.ER4, not earlier.

Comment 127 Ladislav Thon 2013-08-05 11:12:49 UTC
Bug 992988 is possibly related. For now, I decided to file it as a new bug instead of moving this one back to assigned, feel free to close 992988 as a duplicate.

Comment 128 Jitka Kozana 2013-08-05 12:29:25 UTC
If we decide to mark BZ 992988 as a duplicate of this bug, than this bug is still alive and cannot make a transition to the VERIFIED state.

Comment 129 Radoslav Husar 2013-08-05 14:41:05 UTC
BZ 992988 is not duplicate and neither is related to this bug. Treat them separately.

Any intermediate results to share regarding verification of this bug?

Comment 130 Scott Mumford 2013-08-06 03:10:50 UTC
Since this bug now seems to be fixed, the release note text above is out of date.

Is it possible to get an executive summary of what was causing this bug (beyond 'failover') and how it was resolved for the updated note.

Thanks in advance.

Comment 131 Jitka Kozana 2013-08-06 08:20:45 UTC
No more NPEs seen: verified during EAP 6.1.1.ER4 testing cycle.

Comment 132 Aleksandar Kostadinov 2013-08-07 05:56:28 UTC
I'm also positive the problem is finally fixed. I run the tests several times with ER4 in the EC2 where the issue was very easy to reproduce. Then increased the load and ran for a day. There was no single NPE.

Cheers!

Comment 133 Radoslav Husar 2013-08-07 10:23:55 UTC
Thanks Aleks for confirming on EC2 too.

Comment 134 Scott Mumford 2013-08-26 00:57:16 UTC
Setting for exclusion from 6.1.1 Release notes and removing NEEDINFO request.

Comment 136 Radoslav Husar 2013-08-30 17:23:00 UTC
Correcting documentation related fields: should be documented as fixed issue.

Comment 137 Russell Dickenson 2013-09-01 23:57:07 UTC
Attention: Jitka Kudrnacova

The reason this issue was marked as to be excluded from Release Notes is that the deadline for information on the status of bugs for inclusion was Friday, August 30. Although this issue has now been fixed, we have no draft information from which to write a Release Notes entry. In the absence of such information we had no choice but to exclude it.


Note You need to log in before you can comment on or make changes to this bug.