Bug 1085927
| Summary: | org.jgroups.TimeoutException when starting two nodes in cluster | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Portal Platform 6 | Reporter: | vramik | ||||||||||
| Component: | Portal | Assignee: | Lucas Ponce <lponce> | ||||||||||
| Status: | CLOSED UPSTREAM | QA Contact: | vramik | ||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 6.2.0 | CC: | epp-bugs, ppalaga, tkyjovsk | ||||||||||
| Target Milestone: | ER02 | ||||||||||||
| Target Release: | 6.2.0 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: |
It was discovered that some variables in jgroups and infinispan (ack_on_delivery and clusterName) were not properly defined. The improperly defined variable were causing TimeoutException errors in a clustered environment. JGroups and Infinispan configurations have been updated to define ack_on_delivery and clustername variables correctly, which fixes the TimeoutException errors originally encountered.
|
Story Points: | --- | ||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2025-02-10 03:35:35 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 884569 [details]
perf13_01.log
Created attachment 884570 [details]
perf14_pages
Hi, I have one important doubt about steps that can affect to the case: "2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server" In a cluster, environment, database should be unique, having two different databases at same time can create some unexpected behaviour. Please, could you try to reproduce the steps but using a shared database for both nodes. It's probably that issue will remain, but then we can have a more close trace of the issue. Meanwhile I'm going to setup a similar environment to reproduce it. Thanks, Lucas I confirm I could reproduce issue with a single h2 database for both nodes. Investigating. Issue found:
- clusterName="" is not set up properly due ${infinispan-cluster-name} system variable is not defined.
Workaround:
- Start ha configuration with this proper variable, for example:
bin/standalone.sh -b node1 -c standalone-ha.xml -Djboss.node.name=node1 -Dinfinispan-cluster-name=gatein-cluster
bin/standalone.sh -b node2 -c standalone-ha.xml -Djboss.node.name=node2 -Dinfinispan-cluster-name=gatein-portal
I'm going to prepare a fix to define this property by default.
Another issue found: - RSVP.ack_on_delivery=true on JGroups configuration where Infinispan recommend set to false to avoid deadlocks. Some preliminar tests seems this also can be related in the overall issue. [1] https://issues.jboss.org/browse/ISPN-2612 [2] https://issues.jboss.org/browse/ISPN-2713 I tried applying the workaround suggested by Dan Berindei [1] and it removed the TimeoutException. However the redundant pages (that shouldn't be visible) are still there, even when I added the "infinispan-cluster-name" property. Additional Steps: - set RSVP.ack_on_delivery=false in gatein/gatein.ear/portal.war/WEB-INF/classes/jgroups/gatein-udp.xml as suggested in [1] for both nodes (all the other gatein jgroups configs already have this set) - set -Dinfinispan-cluster-name=gatein-portal for both nodes [1] https://bugzilla.redhat.com/show_bug.cgi?id=1087244#c2 Created attachment 888599 [details] No repeated pages screenshot I've sent a PR for master in https://github.com/gatein/gatein-portal/pull/832 With this fix and a clean database I can't reproduce repeated pages. Please, could you repeat the test with a clean database ? May be there is an issue with initial database, this can help us to scope it. Thanks, Lucas Yes, it was the db. The additional pages were caused by data from portal 6.1 which we used for comparison. I don't see them with clean db.
(Standalone h2 stores the dbs in ${user.home} instead of /data and I forgot to clean them between the tests.)
https://github.com/gatein/gatein-portal/pull/832 was merged in upstream. Verified in ER02 This product has been discontinued or is no longer tracked in Red Hat Bugzilla. |
Created attachment 884568 [details] perf13.log Description of problem: There is TimeoutException when starting portal in cluster mode. Follow steps to reproduce. Version-Release number of selected component (if applicable): rhjp6.2.dr02 Steps to Reproduce: 1. I've used two machines in lab (perf13, perf14) 2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server 3. start portal (on perf13): sh standalone.sh -b perf13.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf13 4. start portal (on perf14): sh standalone.sh -b perf14.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf14 5. There are exceptions in log on perf13 when perf14 is being started (see attached perf13.log) 6. When I've tried to do a quick sanity check on perf13, I got another exceptions (perf13_01.log) Additional info: There are additional pages on perf14O (see perf14_pages.png)