Affects: Release Notes Complexity: High Workaround: Workaround Exists Workaround Description: Turn off session passivation completely. project_key: JBPAPP6 NPE is thrown in roughly 0.36% of HTTP session request processing in the test. This results in response code 503 returned to the client with the exception. The test is using passivation-enabled WAR of clusterbench https://github.com/rhusar/clusterbench Here is a shorter soak test run that uncovered the issue https://hudson.qa.jboss.com/hudson/view/EAP6/view/EAP6-Clustering-Soak/job/eap-6x-soak-http-repl-async-web-perf17only/6/ {noformat} [JBossINF] 19:52:10,113 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-perf20/10.16.90.58:8009-2570) Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.NullPointerException [JBossINF] at org.jboss.as.web.session.ClusteredSession.update(ClusteredSession.java:972) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1] [JBossINF] at org.jboss.as.web.session.DistributableSessionManager.loadSession(DistributableSessionManager.java:1377) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1] [JBossINF] at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:673) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1] [JBossINF] at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:84) [jboss-as-web-7.1.2.Final-redhat-1.jar:7.1.2.Final-redhat-1] [JBossINF] at org.apache.catalina.connector.Request.doGetSession(Request.java:2618) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.catalina.connector.Request.getSession(Request.java:2375) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:841) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:72) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1] [JBossINF] at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:58) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1] [JBossINF] at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:31) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1] [JBossINF] at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:16) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1] [JBossINF] at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:134) [weld-core-1.1.8.Final-redhat-1.jar:1.1.8.Final-redhat-1] [JBossINF] at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:368) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:505) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:452) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) [jbossweb-7.0.16.Final-redhat-1.jar:] [JBossINF] at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_30] {noformat}
Security: Removed: JBoss Internal Added: Public
Linking to related passivation issue JBPAPP-9186 which is a logging issue but might be covering another issue.
Link: Added: This issue relates to JBPAPP-9186
Link: Added: This issue Cloned to JBPAPP-9206
Labels: Added: eap6_need_triage
Release Notes Docs Status: Added: Not Yet Documented Affects: Added: Release Notes
This is not a blocker (it doesn't affect the default clustering configuration for web sessions) - though it is critical enough that it needs to be fixed for 6.0.1. I noticed that the NPE is triggered outside the scope of the Infinispan batch/transaction (notice ClusteredSessionValve does not appear in the stacktrace) - which might be the source of the problem.
Release Notes Docs Status: Removed: Not Yet Documented Added: Documented as Known Issue Release Notes Text: Added: A <code>WeldListener: java.lang.NullPointerException</code> is thrown occasionally when using session passivation. The client receives a HTTP response <code>503</code>, along with the exception text. The cause of this bug is under investigation.
Docs QE Status: Removed: NEW
Workaround Description: Added: Turn off session passivation completely. Workaround: Added: Workaround Exists Complexity: Added: High
This is not a blocker for the EAP 6 release, but we need to document in the release notes that it is broken.
Link: Added: This issue depends ISPN-2095
Labels: Removed: eap6_need_triage Added: eap601candidate
Labels: Removed: eap601candidate Added: as713tracking eap601candidate
Resolved in upstream for 7.1.3 tag.
Still seeing this NPE exception in EAP 6.0.1.ER2, please find the server logs below: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-sync/74/console-perf19/ Reopening.
Reopened today. Moving to 6.0.1.ER4.
The linked job was definitely using a SNAPSHOT and not EAP, just look for this line: {noformat}[JBossINF] 16:49:45,510 INFO [org.jboss.as] (Controller Boot Thread) JBAS015874: JBoss AS 7.1.3.Final-SNAPSHOT "Arges" started in 5639ms - Started 397 of 559 services (161 services are passive or on-demand){noformat} I fixed the job and it needs to be simply rerun: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-sync/
Rerun the job [1] and the NPE is not present. [1] https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-sync/
Verified, closing.
Writer: Added: mistysj
reopening to update release notes (these are still documented as known issues from the previous release)
Writer: Removed: mistysj Added: Darrin
Setting these to "Not Yet Documented" since they required updating from known to resolved issues.
Release Notes Docs Status: Removed: Documented as Known Issue Added: Not Yet Documented
Link: Added: This issue is duplicated by JBPAPP-10226
Attaching full server log from 6.0.2 ER2 where the problem manifested itself. It seems like that the problem was not completely fixed. I'm not sure though the clusterbench program has exact same configuration as what Rado had before. Additionally the environment I am running (unlike Rado) is: EC2 large instances, RHEL6, S3_PING discovery protocol, jgroups tcp stack. Additionally the problem is hard to reproduce. From 289165 only 34 samples failed and it took several hours for that to happen. Please let me know if you need any additional details.
Attachment: Added: server-6.0.2.ER2.log.gz
Link: Added: This issue depends WELD-831
Resolving as docs edits can be made in RESOLVED but not in CLOSED. Darrin, in future can you open CLOSED bugs and then just immediately resolve them with the same status they were closed with?
This issue is not a blocker nor a critical for EAP 6.0.1. Thus it is moved to TBD EAP 6 to be triaged for the next EAP 6.x release.
Release Notes Docs Status: Removed: Not Yet Documented Added: Documented as Known Issue
Release Notes Text: Removed: A <code>WeldListener: java.lang.NullPointerException</code> is thrown occasionally when using session passivation. The client receives a HTTP response <code>503</code>, along with the exception text. The cause of this bug is under investigation. Added: A `WeldListener: java.lang.NullPointerException` is thrown occasionally when using session passivation. The client receives a HTTP response `503`, along with the exception text. The cause of this bug is under investigation.
{code} 02:05:31,548 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[default-host]] (ajp-perf18/10.16.90.54:8009-360) Exception sending request initialized lifecycle event to listener instance of class org.jboss.weld.servlet.WeldListener: java.lang.NullPointerException at org.jboss.as.web.session.ClusteredSession.update(ClusteredSession.java:979) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3] at org.jboss.as.web.session.DistributableSessionManager.loadSession(DistributableSessionManager.java:1377) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3] at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:673) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3] at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:84) [jboss-as-web-7.1.3.Final-redhat-3.jar:7.1.3.Final-redhat-3] at org.apache.catalina.connector.Request.doGetSession(Request.java:2618) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.catalina.connector.Request.getSession(Request.java:2375) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.catalina.connector.RequestFacade.getSession(RequestFacade.java:841) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.getSession(LazySessionBeanStore.java:72) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1] at org.jboss.weld.context.beanstore.http.LazySessionBeanStore.<init>(LazySessionBeanStore.java:58) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1] at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:31) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1] at org.jboss.weld.context.http.HttpSessionContextImpl.associate(HttpSessionContextImpl.java:16) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1] at org.jboss.weld.servlet.WeldListener.requestInitialized(WeldListener.java:134) [weld-core-1.1.10.Final-redhat-1.jar:1.1.10.Final-redhat-1] at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:143) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:505) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:453) [jbossweb-7.0.17.Final-redhat-1.jar:] at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931) [jbossweb-7.0.17.Final-redhat-1.jar:] at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_30] {code} seen again during ER3 testing here: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-netDown-dist-sync/lastSuccessfulBuild/artifact/report/config/jboss-perf18/server.log
Raising to critical. From our last testing results, it seems the issue is still present, even on application with no passivation configured. It was seen by Aleks during his EC2 testing and now again by us in ER3 testing. The effect on customers/users of this NPE is {code} java.net.SocketTimeoutException: Read timed out {code} which means customer's request is lost.
This issue has been triaged by the EAP PM team and it was decdied it is not a blocker for the EAP 6.0.1 release. Changing to Major.
Labels: Removed: as713tracking eap601candidate Added: as713tracking
Release Notes Docs Status: Removed: Documented as Known Issue Writer: Removed: Darrin Release Notes Text: Removed: A `WeldListener: java.lang.NullPointerException` is thrown occasionally when using session passivation. The client receives a HTTP response `503`, along with the exception text. The cause of this bug is under investigation.
This issue was marked with priority Major and since only Critical or Blocker issues are considered at this time, the issue is moved to TBD EAP 6. Should this be considered, please contact the EAP PM team.
Seen this on 6.0.1.ER4.1 as well (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Failover/job/eap-6x-failover-http-session-netDown-repl-sync/4/console-perf21/). Added 6.0.1.ER4 to _Affects Version/s_ as we don't have 6.0.1.ER4.1 in JIRA.
Also affects 6.1.0.DR3.
Affects 6.1.0.ER1: Seen in majority of our tests. For example here: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/view/EAP6/view/EAP6-Clustering/view/EAP6-Failover/job/eap-6x-failover-http-session-shutdown-repl-async/62/artifact/report/config/jboss-perf18/server.log
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-2095 to Resolved
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2095 Original issue was fixed, and the fix versions have been released. Please open a new jira, link to this one, and assign it to me. Please indicate the AS version failing with this, and attach TRACE logs on org.infinispan if available.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 The issue can be reproduced against AS master, the 7.2 branch.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 The issue can be reproduced against AS master or the 7.2 branch.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 TRACE logs from test run on each server.
Galder Zamarreño <galder.zamarreno> updated the status of jira ISPN-2903 to Coding In Progress
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903 I'm a bit confused with this test. I can't find any tests named as AtomicMapTestCase in the AS source tree in https://github.com/jbossas/jboss-as, nor can I find a test that creates a atomic-map.war deployment (in case the name of the test has changed). Why is the test not included there in the first place?? The reason I ask is cos I'm confused by the test HTTP parameters: `operation=put&key=1&value=a`, and the log messages: {code}Invoked with command PutKeyValueCommand{key=a, value=AtomicHashMap...{code} Unfortunately, the toString implementation of AtomicHashMap does not show contents any more due to concurrent iterator issues with the previous toString implementation. Looking to see if I can trace back changes to it via atomic hash map delta logging. Also, is the test still failing in the same line as it did before? Or is failing in a different line?
Radoslav Husar <rhusar> made a comment on jira ISPN-2903 The test is only attached to this Jira at the moment. https://issues.jboss.org/secure/attachment/12361174/AtomicMapServlet.java https://issues.jboss.org/secure/attachment/12361175/AtomicMapTestCase.java
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903 I tried to modify the original test case I created for Infinispan to see if it failed but couldn't get it to fail. I've tried running it on AS8 and it fails in the same place as it did before: org.jboss.as.test.clustering.cluster.AtomicMapTestCase.test(AtomicMapTestCase.java:107) I'll try to add some more TRACE to see more closely how the contents of the AtomicHashMap progress.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903 I tried to modify the original test case I created for Infinispan to see if it failed but couldn't get it to fail. I've tried running it on AS8 and it fails in the same place as it did before: {code}org.jboss.as.test.clustering.cluster.AtomicMapTestCase.test(AtomicMapTestCase.java:107){code} I'll try to add some more TRACE to see more closely how the contents of the AtomicHashMap progress.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 This test is not part of the AS testsuite because it does not test any integration code. It's actually testing Infinispan code, i.e. it merely validates that AtomicMap entries are not lost upon eviction.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 This test is not part of the AS testsuite because it does not test any integration code. If anything, it belongs in the Infinispan testsuite (without the AS test harness, of course). This test merely validates that AtomicMap entries are not lost upon eviction - and was explicitly setup to reproduce the linked EAP issue.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 Yes - this test is failing at exactly the same spot as it was before.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 Perhaps the cache isn't configured the same way in your test (within the Infinispan testsuite) as it is in ours? The cache in the AS test has passivation to a file cache store enabled.
Paul Ferraro <paul.ferraro> made a comment on jira ISPN-2903 Perhaps the cache isn't configured the same way in your test (within the Infinispan testsuite) as it is in ours? The cache in the AS test uses passivation to a file cache store.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903 ^ I did check the configuration, there were some differences, but didn't make a difference so far.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903 Paul/Rado, found the root cause. I was using an unbounded container in testing and the issue only happens with a bounded container, and indeed, it's a side effect of ISPN-2384. I'm working on a fix...
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2903 I've got a fix now, testing... I'll send a pull req as soon as it's ready. p.s. Paul/Rado, you guys should add org.jboss.as.test.clustering.cluster.AtomicMapTestCase to your automates testsuite to avoid future surprises...
Mircea Markus <mmarkus> made a comment on jira ISPN-2903 integrated master + 5.2.x
I'm still seeing this in EAP 6.1.0.ER3, see https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/65/console-perf18/
The conditions of the NPE look different in 6.1.0.ER3. Previously the NPE only happened after the eviction period was reached. In 6.1.0.ER3, the NPE appears immediately after failback. Investigating...
We just started ER3 and we will look for this exception in many more test scenarios. What scenarios/log setup/anything else will be useful for you here?
My hunch is that there is still another issue with AtomicMaps and the interaction with the cache store - I'm attempting to reproduce the issue locally.
Status: Still under investigation. I suspect this is to blame for BZ 901164.
Quick status update: I've finally managed to reproduce the issue locally. I'm still isolating the exact cause.
Finally... https://github.com/jbossas/jboss-eap/pull/122
The Infinispan team has identified another condition that can trigger this bug, specifically ISPN-2974. The fix is currently be ported to Infinispan 5.2.x - and should soon be available for consumption by EAP 6.1. What should we do with the status of this BZ?
The issue is still present in ER6. Link to job (note: this is the first job run today with ER6 where this issue was seen again. If you need to verify if this is present in any other test configuration, please let us know): https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/26/ Link to server logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-undeploy-dist-async/26/artifact/report/config/jboss-perf18/server.log
Correct, this is to be expected in ER6 since the changes are not in ER6 codebase yet (useful information now would be to build ER6 + the patch and run the test).
I have tested this issue with ER6 and ISPN 5.2.6.Final (taken from BZ 956988, comment 36) and the issue is still present. We are still seeing the NPE and we are still seeing BZ 901164 (lost sessions after failover). For the testing, I have chosen scenario REPL_SYNC and _graceful_ shutdown. Looking at the test results, it looks like some of the NPE's (and the consequence 901164) were eliminated by the ISPN upgrade, but most likely there is another cause of these NPEs that was not discovered and fixed by the ISPN upgrade. @Rado, please elaborate on your comment on https://github.com/jbossas/jboss-eap/pull/122 : the pull 122 was made obsolete by the ISPN upgrade? Brian is asking the same thing on the pull request discussion. Link to job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/ Some statistics on the exception numbers: client side: ER5: 36478 occurences of 900549 (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/82/artifact/report/parsed_logs_client/index.html) ER6 with ISPN 5.2.6.Final: 6370 occurences of 900549 (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/artifact/report/parsed_logs_client/index.html) server side: ER5: 18241 occurences of 900549 (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync/82/artifact/report/parsed_logs_server/index.html) ER6 with ISPN 5.2.6.Final: 3185 of 900549 (https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/3/artifact/report/parsed_logs_server/index.html)
@Jitka All NPE won't go away completely until ER7, which will contain the following commit: https://github.com/jbossas/jboss-eap/commit/4db033959d89b46990b54f62b938e7b7158d57e9
Just a clarification. A second pull request will be used for the Infinispan update on this Bugzilla.
@Paul: Can we have a jar (I assume it will be jboss-as-clustering-web-infinispan.jar) with the mentioned commit (comment #87) for testing, please?
That would be great, I will shortly attach a jar to test that contains the fix (rather workaround for ISPN-3048) which should be tested together with Infinispan 5.2.6 upgrade that contains the other mentioned fixes. PS: the comment on PR #122 is about upstream, none of which is relevant to this EAP version.
Created attachment 743234 [details] jar for eap 6.1 Please see attached jar (git revision 4db033959d89b46990b54f62b938e7b7158d57e9). Just replace over existing jar or update module.xml accordingly.
I have tested with ER6 with the new jar (comment #92) and ISPN 5.2.6.Final and this issue is still present. I have double checked my EAP zip, the jars are present, their module.xml's are properly edited. Link to server logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/config/jboss-perf18/server.log https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/config/jboss-perf19/server.log https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/config/jboss-perf20/server.log https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/config/jboss-perf21/server.log Client logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/console-perf17 Parsed server logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/parsed_logs_server/index.html Parsed client logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/4/artifact/report/parsed_logs_client/index.html
Based on last comment about results with ISPN 2.5.6. ++ pull 122 changes I'm changing status back to ASSIGNED. This needs additional work. Number of exceptions is lower with new ISPN but it's not fully fixed. This BZ is not used as reference for any PR I'm aware of. https://github.com/jbossas/jboss-eap/pull/122 seems to be merged according https://github.com/jbossas/jboss-eap/commits/6.x Infinispan upgrade is tracked on https://bugzilla.redhat.com/show_bug.cgi?id=956988
Hmmmm. The test I created locally to reproduce this problem still passes. Can you try re-running this test making the following change to the configuration of the web cache container in the infinispan subsystem? <file-store fetch-state="false"/>
Actually, scratch that... Instead, re-run with: <file-store purge="false"/>
I have rerun with <file-store purge="false"/> and the issue is still present. Link to job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/6/ Link to server log: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/6/artifact/report/config/jboss-perf01/server.log Please note, the logs are not parsed in this particular run, because I have migrated the test to another segment of lab, so we can test performance.
I have rerun with <file-store fetch-state="false"/> and the issue is still present. Link to job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/7/ Link to server log: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/7/artifact/report/config/jboss-perf01/server.log Again, the logs are not parsed, therefore no exceptions will be found in the output of logparser.
Hmm. It looks like the default isolation level isn't getting applied correctly. The web (and ejb, for that matter) cache containers should use REPEATABLE_READ isolation (which is the default in the appropriate managed resource), but it looks like, in practice, they are only using READ_COMMITTED. Please re-run this test using the following change to the caches of the web cache container: <locking isolation="REPEATABLE_READ"/>
If no <locking/> element is called, the default from the LockingResource never gets applied - and Infinispan's default isolation is used.
Thanks for the update Paul, just to be clear: no change to the <file-store> element, just the locking isolation, right?
Yes, you can drop the changes to <file-store/>.
No, <locking isolation="REPEATABLE_READ"/> is not helping either. Still seeing NPEs and lost sessions (BZ 901164). Link to job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/9/ Parsed server logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/9/artifact/report/parsed_logs_server/index.html Parsed client logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-sync-900549/9/artifact/report/parsed_logs_client/index.html
OK - there must still be an issue with merging deltas with passivated atomic maps in Infinispan. My recommendation at this point, is to remove the blocker status - and document this as still (albeit less of) an issue (the same as it was in 6.0.1) - and recommend that users not enable passivation (it is not on by default). If memory usage is a concern, it can be mitigated using distribution mode. I will reopen this issue with the Infinispan team.
I agree with Paul, but we should pull in Mircea, Galder and whomever else is necessary to chase this down post 6.1.
I agree as well.
I've set this to POST, meaning that this issue should be reverified following the resolution of BZ 959875.
For future reference: 6.1.0.ER8 run of job with this issue: https://jenkins.mw.lab.eng.bos.redhat.com/job/eap-6x-failover-http-session-shutdown-repl-async/70/
Created attachment 749389 [details] server.log when NPE without passivation was thrown @Paul, I've run clustering failover tests in EC2 and I'm still hitting NPEs and looking at the clustedbench app it does NOT have passivation active. I'm attaching 6.1.0 ER8 full server log and the app itself so you can review.
Created attachment 749390 [details] clusterbench app with passivation disabled
Jitka, Can you validate that this commit fixes this issue? https://github.com/pferraro/jboss-eap/commit/f4b710931da30a401cdd2b3aada16e9cc25d5631
https://github.com/jbossas/jboss-eap/pull/222
Please see: https://bugzilla.redhat.com/show_bug.cgi?id=901164#c28
The patch does NOT fixes this problem. See this job: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/77/ And the parsed client logs: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/77/artifact/report/parsed_logs_client/index.html
Created attachment 774403 [details] patched *ClusteredSession implementation to always replicate metadata @Jitka: attaching the patched jar, please run the test with this one. Thanks! The jar contains patched *ClusteredSession implementation that always replicates metadata by which maintains session correctness but has performance implications.
@Rado, it works! no NPE in my test run: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/78/ See the parsed logs: client: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/78/artifact/report/parsed_logs_client/index.html server: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-shutdown-repl-async/78/artifact/report/parsed_logs_server/index.html However, I have 2 concerns: 1. I am getting quite a few 500 error code. Please take a look. 2. Please elaborate on "maintains session correctness but has performance implications." Thank you.
Great to see we have one solution to the problem, thanks for the runs. @Jitka Could we also get performance stress tests for both write-heavy and read-heavy scenarios to see what is the actual performance implication? I was looking at the logs and there are no relevant exceptions in the server logs. The SC 500 are due to non-graceful shutdown, i.e. the AJP connector shuts down and the LB tries to connect resulting in SC 500. This is normally solved by session draining or using mod_cluster which this test does not do, moreover we do not retry in these situations as to not mask other problems (we currently retry on 404 and 503 only). The corresponding load balancer log message is as follow: [Wed Jul 17 02:13:12.288 2013] [7603:139679944914688] [error] ajp_send_request::jk_ajp_common.c (1585): (perf18) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=111) Just to clear up some confusion from the previous assumtions and comments: this issue happens *on failover* and is not related passivation, nor dist/repl mode, nor to CDI/Weld. The stacktrace for non-CDI application is as follows (if anybody searches for the BZ): 17:15:27,682 ERROR [org.apache.catalina.connector] (http-/127.0.0.1:8080-2) JBWEB001018: An exception or error occurred in the container during the request processing: java.lang.NullPointerException at org.jboss.as.web.session.ClusteredSession.update(ClusteredSession.java:973) [jboss-as-web-7.2.0.Final-redhat-X.jar:7.2.0.Final-redhat-X] at org.jboss.as.web.session.DistributableSessionManager.loadSession(DistributableSessionManager.java:1397) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final] at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:681) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final] at org.jboss.as.web.session.DistributableSessionManager.findSession(DistributableSessionManager.java:84) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final] at org.apache.catalina.connector.Request.doGetSession(Request.java:2605) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.apache.catalina.connector.Request.getSession(Request.java:2357) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:99) [jboss-as-web-7.2.0.Final.jar:7.2.0.Final] at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:653) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) [jbossweb-7.2.0.Final.jar:7.2.0.Final] at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_25] Updating BZ summary, setting flags, fields..
Rado, thanks for the quick update. So here is what I will do: 1. will try the test with mod_cluster to look at the 500 codes 2. will run the following performance tests: - write heavy, scenario session, repl async - write heavy, scenario session, repl sync - read heavy, sceanario session, repl async I'll update this BZ once the runs are ready.
Rado, have you fixed that upstream, too? JIRA/pull requests?
Just noting that the fix is not relevant upstream (different implementation).
Created attachment 780843 [details] EAP 6.1.1 ER3 server log I just ran a cluster failover test with EAP 6.1.1 ER3 in EC2 and hit the exceptions again. Attaching server log.
@Aleks, the fix have made it to 6.1.1.ER4, not earlier.
Bug 992988 is possibly related. For now, I decided to file it as a new bug instead of moving this one back to assigned, feel free to close 992988 as a duplicate.
If we decide to mark BZ 992988 as a duplicate of this bug, than this bug is still alive and cannot make a transition to the VERIFIED state.
BZ 992988 is not duplicate and neither is related to this bug. Treat them separately. Any intermediate results to share regarding verification of this bug?
Since this bug now seems to be fixed, the release note text above is out of date. Is it possible to get an executive summary of what was causing this bug (beyond 'failover') and how it was resolved for the updated note. Thanks in advance.
No more NPEs seen: verified during EAP 6.1.1.ER4 testing cycle.
I'm also positive the problem is finally fixed. I run the tests several times with ER4 in the EC2 where the issue was very easy to reproduce. Then increased the load and ran for a day. There was no single NPE. Cheers!
Thanks Aleks for confirming on EC2 too.
Setting for exclusion from 6.1.1 Release notes and removing NEEDINFO request.
Correcting documentation related fields: should be documented as fixed issue.
Attention: Jitka Kudrnacova The reason this issue was marked as to be excluded from Release Notes is that the deadline for information on the status of bugs for inclusion was Friday, August 30. Although this issue has now been fixed, we have no draft information from which to write a Release Notes entry. In the absence of such information we had no choice but to exclude it.