Created attachment 884568 [details] perf13.log Description of problem: There is TimeoutException when starting portal in cluster mode. Follow steps to reproduce. Version-Release number of selected component (if applicable): rhjp6.2.dr02 Steps to Reproduce: 1. I've used two machines in lab (perf13, perf14) 2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server 3. start portal (on perf13): sh standalone.sh -b perf13.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf13 4. start portal (on perf14): sh standalone.sh -b perf14.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf14 5. There are exceptions in log on perf13 when perf14 is being started (see attached perf13.log) 6. When I've tried to do a quick sanity check on perf13, I got another exceptions (perf13_01.log) Additional info: There are additional pages on perf14O (see perf14_pages.png)
Created attachment 884569 [details] perf13_01.log
Created attachment 884570 [details] perf14_pages
Hi, I have one important doubt about steps that can affect to the case: "2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server" In a cluster, environment, database should be unique, having two different databases at same time can create some unexpected behaviour. Please, could you try to reproduce the steps but using a shared database for both nodes. It's probably that issue will remain, but then we can have a more close trace of the issue. Meanwhile I'm going to setup a similar environment to reproduce it. Thanks, Lucas
I confirm I could reproduce issue with a single h2 database for both nodes. Investigating.
Issue found: - clusterName="" is not set up properly due ${infinispan-cluster-name} system variable is not defined. Workaround: - Start ha configuration with this proper variable, for example: bin/standalone.sh -b node1 -c standalone-ha.xml -Djboss.node.name=node1 -Dinfinispan-cluster-name=gatein-cluster bin/standalone.sh -b node2 -c standalone-ha.xml -Djboss.node.name=node2 -Dinfinispan-cluster-name=gatein-portal I'm going to prepare a fix to define this property by default.
Another issue found: - RSVP.ack_on_delivery=true on JGroups configuration where Infinispan recommend set to false to avoid deadlocks. Some preliminar tests seems this also can be related in the overall issue. [1] https://issues.jboss.org/browse/ISPN-2612 [2] https://issues.jboss.org/browse/ISPN-2713
I tried applying the workaround suggested by Dan Berindei [1] and it removed the TimeoutException. However the redundant pages (that shouldn't be visible) are still there, even when I added the "infinispan-cluster-name" property. Additional Steps: - set RSVP.ack_on_delivery=false in gatein/gatein.ear/portal.war/WEB-INF/classes/jgroups/gatein-udp.xml as suggested in [1] for both nodes (all the other gatein jgroups configs already have this set) - set -Dinfinispan-cluster-name=gatein-portal for both nodes [1] https://bugzilla.redhat.com/show_bug.cgi?id=1087244#c2
Created attachment 888599 [details] No repeated pages screenshot I've sent a PR for master in https://github.com/gatein/gatein-portal/pull/832 With this fix and a clean database I can't reproduce repeated pages. Please, could you repeat the test with a clean database ? May be there is an issue with initial database, this can help us to scope it. Thanks, Lucas
Yes, it was the db. The additional pages were caused by data from portal 6.1 which we used for comparison. I don't see them with clean db. (Standalone h2 stores the dbs in ${user.home} instead of /data and I forgot to clean them between the tests.)
https://github.com/gatein/gatein-portal/pull/832 was merged in upstream.
Verified in ER02