Bug 1085927 - org.jgroups.TimeoutException when starting two nodes in cluster
Summary: org.jgroups.TimeoutException when starting two nodes in cluster
Keywords:
Status: VERIFIED
Alias: None
Product: JBoss Enterprise Portal Platform 6
Classification: JBoss
Component: Portal
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ER02
: 6.2.0
Assignee: Lucas Ponce
QA Contact: vramik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-09 16:24 UTC by vramik
Modified: 2014-07-09 05:35 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
It was discovered that some variables in jgroups and infinispan (ack_on_delivery and clusterName) were not properly defined. The improperly defined variable were causing TimeoutException errors in a clustered environment. JGroups and Infinispan configurations have been updated to define ack_on_delivery and clustername variables correctly, which fixes the TimeoutException errors originally encountered.
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
perf13.log (8.22 KB, text/x-log)
2014-04-09 16:24 UTC, vramik
no flags Details
perf13_01.log (29.72 KB, text/x-log)
2014-04-09 16:25 UTC, vramik
no flags Details
perf14_pages (100.42 KB, image/png)
2014-04-09 16:25 UTC, vramik
no flags Details
No repeated pages screenshot (181.19 KB, image/png)
2014-04-22 17:11 UTC, Lucas Ponce
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker GTNPORTAL-3451 0 Major Resolved Infinispan transport clusterName property not properly defined 2018-12-27 07:45:47 UTC

Description vramik 2014-04-09 16:24:59 UTC
Created attachment 884568 [details]
perf13.log

Description of problem:
There is TimeoutException when starting portal in cluster mode. Follow steps to reproduce.

Version-Release number of selected component (if applicable):
rhjp6.2.dr02

Steps to Reproduce:
1. I've used two machines in lab (perf13, perf14)
2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server
3. start portal (on perf13): sh standalone.sh -b perf13.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf13
4. start portal (on perf14): sh standalone.sh -b perf14.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf14
5. There are exceptions in log on perf13 when perf14 is being started (see attached perf13.log)
6. When I've tried to do a quick sanity check on perf13, I got another exceptions (perf13_01.log)

Additional info:
There are additional pages on perf14O (see perf14_pages.png)

Comment 1 vramik 2014-04-09 16:25:32 UTC
Created attachment 884569 [details]
perf13_01.log

Comment 2 vramik 2014-04-09 16:25:59 UTC
Created attachment 884570 [details]
perf14_pages

Comment 4 Lucas Ponce 2014-04-22 09:44:02 UTC
Hi,

I have one important doubt about steps that can affect to the case:

"2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server"

In a cluster, environment, database should be unique, having two different databases at same time can create some unexpected behaviour.

Please, could you try to reproduce the steps but using a shared database for both nodes.

It's probably that issue will remain, but then we can have a more close trace of the issue.

Meanwhile I'm going to setup a similar environment to reproduce it.

Thanks,
Lucas

Comment 5 Lucas Ponce 2014-04-22 11:38:04 UTC
I confirm I could reproduce issue with a single h2 database for both nodes.

Investigating.

Comment 7 Lucas Ponce 2014-04-22 12:34:43 UTC
Issue found:

- clusterName="" is not set up properly due ${infinispan-cluster-name} system variable is not defined.

Workaround:

- Start ha configuration with this proper variable, for example:

 bin/standalone.sh -b node1 -c standalone-ha.xml -Djboss.node.name=node1 -Dinfinispan-cluster-name=gatein-cluster

bin/standalone.sh -b node2 -c standalone-ha.xml -Djboss.node.name=node2 -Dinfinispan-cluster-name=gatein-portal


I'm going to prepare a fix to define this property by default.

Comment 8 Lucas Ponce 2014-04-22 15:00:14 UTC
Another issue found:

- RSVP.ack_on_delivery=true on JGroups configuration where Infinispan recommend set to false to avoid deadlocks.

Some preliminar tests seems this also can be related in the overall issue.

[1] https://issues.jboss.org/browse/ISPN-2612
[2] https://issues.jboss.org/browse/ISPN-2713

Comment 9 Tomas Kyjovsky 2014-04-22 16:40:12 UTC
I tried applying the workaround suggested by Dan Berindei [1] and it removed the TimeoutException. However the redundant pages (that shouldn't be visible) are still there, even when I added the "infinispan-cluster-name" property.


Additional Steps:

- set RSVP.ack_on_delivery=false in gatein/gatein.ear/portal.war/WEB-INF/classes/jgroups/gatein-udp.xml as suggested in [1] for both nodes
   (all the other gatein jgroups configs already have this set)

- set -Dinfinispan-cluster-name=gatein-portal for both nodes


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1087244#c2

Comment 10 Lucas Ponce 2014-04-22 17:11:04 UTC
Created attachment 888599 [details]
No repeated pages screenshot

I've sent a PR for master in

https://github.com/gatein/gatein-portal/pull/832

With this fix and a clean database I can't reproduce repeated pages.

Please, could you repeat the test with a clean database ?

May be there is an issue with initial database, this can help us to scope it.

Thanks,
Lucas

Comment 11 Tomas Kyjovsky 2014-04-23 17:02:25 UTC
Yes, it was the db. The additional pages were caused by data from portal 6.1 which we used for comparison. I don't see them with clean db.

(Standalone h2 stores the dbs in ${user.home} instead of /data and I forgot to clean them between the tests.)

Comment 12 Peter Palaga 2014-04-23 19:40:50 UTC
https://github.com/gatein/gatein-portal/pull/832 was merged in upstream.

Comment 13 vramik 2014-05-13 21:50:53 UTC
Verified in ER02


Note You need to log in before you can comment on or make changes to this bug.