Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1085927 - org.jgroups.TimeoutException when starting two nodes in cluster
org.jgroups.TimeoutException when starting two nodes in cluster
Status: VERIFIED
Product: JBoss Enterprise Portal Platform 6
Classification: JBoss
Component: Portal (Show other bugs)
6.2.0
Unspecified Unspecified
unspecified Severity urgent
: ER02
: 6.2.0
Assigned To: Lucas Ponce
vramik
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2014-04-09 12:24 EDT by vramik
Modified: 2014-07-09 01:35 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
It was discovered that some variables in jgroups and infinispan (ack_on_delivery and clusterName) were not properly defined. The improperly defined variable were causing TimeoutException errors in a clustered environment. JGroups and Infinispan configurations have been updated to define ack_on_delivery and clustername variables correctly, which fixes the TimeoutException errors originally encountered.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
perf13.log (8.22 KB, text/x-log)
2014-04-09 12:24 EDT, vramik
no flags Details
perf13_01.log (29.72 KB, text/x-log)
2014-04-09 12:25 EDT, vramik
no flags Details
perf14_pages (100.42 KB, image/png)
2014-04-09 12:25 EDT, vramik
no flags Details
No repeated pages screenshot (181.19 KB, image/png)
2014-04-22 13:11 EDT, Lucas Ponce
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
JBoss Issue Tracker GTNPORTAL-3451 Major Resolved Infinispan transport clusterName property not properly defined 2018-09-07 12:32 EDT

  None (edit)
Description vramik 2014-04-09 12:24:59 EDT
Created attachment 884568 [details]
perf13.log

Description of problem:
There is TimeoutException when starting portal in cluster mode. Follow steps to reproduce.

Version-Release number of selected component (if applicable):
rhjp6.2.dr02

Steps to Reproduce:
1. I've used two machines in lab (perf13, perf14)
2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server
3. start portal (on perf13): sh standalone.sh -b perf13.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf13
4. start portal (on perf14): sh standalone.sh -b perf14.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf14
5. There are exceptions in log on perf13 when perf14 is being started (see attached perf13.log)
6. When I've tried to do a quick sanity check on perf13, I got another exceptions (perf13_01.log)

Additional info:
There are additional pages on perf14O (see perf14_pages.png)
Comment 1 vramik 2014-04-09 12:25:32 EDT
Created attachment 884569 [details]
perf13_01.log
Comment 2 vramik 2014-04-09 12:25:59 EDT
Created attachment 884570 [details]
perf14_pages
Comment 4 Lucas Ponce 2014-04-22 05:44:02 EDT
Hi,

I have one important doubt about steps that can affect to the case:

"2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server"

In a cluster, environment, database should be unique, having two different databases at same time can create some unexpected behaviour.

Please, could you try to reproduce the steps but using a shared database for both nodes.

It's probably that issue will remain, but then we can have a more close trace of the issue.

Meanwhile I'm going to setup a similar environment to reproduce it.

Thanks,
Lucas
Comment 5 Lucas Ponce 2014-04-22 07:38:04 EDT
I confirm I could reproduce issue with a single h2 database for both nodes.

Investigating.
Comment 7 Lucas Ponce 2014-04-22 08:34:43 EDT
Issue found:

- clusterName="" is not set up properly due ${infinispan-cluster-name} system variable is not defined.

Workaround:

- Start ha configuration with this proper variable, for example:

 bin/standalone.sh -b node1 -c standalone-ha.xml -Djboss.node.name=node1 -Dinfinispan-cluster-name=gatein-cluster

bin/standalone.sh -b node2 -c standalone-ha.xml -Djboss.node.name=node2 -Dinfinispan-cluster-name=gatein-portal


I'm going to prepare a fix to define this property by default.
Comment 8 Lucas Ponce 2014-04-22 11:00:14 EDT
Another issue found:

- RSVP.ack_on_delivery=true on JGroups configuration where Infinispan recommend set to false to avoid deadlocks.

Some preliminar tests seems this also can be related in the overall issue.

[1] https://issues.jboss.org/browse/ISPN-2612
[2] https://issues.jboss.org/browse/ISPN-2713
Comment 9 Tomas Kyjovsky 2014-04-22 12:40:12 EDT
I tried applying the workaround suggested by Dan Berindei [1] and it removed the TimeoutException. However the redundant pages (that shouldn't be visible) are still there, even when I added the "infinispan-cluster-name" property.


Additional Steps:

- set RSVP.ack_on_delivery=false in gatein/gatein.ear/portal.war/WEB-INF/classes/jgroups/gatein-udp.xml as suggested in [1] for both nodes
   (all the other gatein jgroups configs already have this set)

- set -Dinfinispan-cluster-name=gatein-portal for both nodes


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1087244#c2
Comment 10 Lucas Ponce 2014-04-22 13:11:04 EDT
Created attachment 888599 [details]
No repeated pages screenshot

I've sent a PR for master in

https://github.com/gatein/gatein-portal/pull/832

With this fix and a clean database I can't reproduce repeated pages.

Please, could you repeat the test with a clean database ?

May be there is an issue with initial database, this can help us to scope it.

Thanks,
Lucas
Comment 11 Tomas Kyjovsky 2014-04-23 13:02:25 EDT
Yes, it was the db. The additional pages were caused by data from portal 6.1 which we used for comparison. I don't see them with clean db.

(Standalone h2 stores the dbs in ${user.home} instead of /data and I forgot to clean them between the tests.)
Comment 12 Peter Palaga 2014-04-23 15:40:50 EDT
https://github.com/gatein/gatein-portal/pull/832 was merged in upstream.
Comment 13 vramik 2014-05-13 17:50:53 EDT
Verified in ER02

Note You need to log in before you can comment on or make changes to this bug.