Bug 1085927 - org.jgroups.TimeoutException when starting two nodes in cluster
Summary: org.jgroups.TimeoutException when starting two nodes in cluster
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: JBoss Enterprise Portal Platform 6
Classification: JBoss
Component: Portal
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ER02
: 6.2.0
Assignee: Lucas Ponce
QA Contact: vramik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-09 16:24 UTC by vramik
Modified: 2025-02-10 03:35 UTC (History)
3 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-02-10 03:35:35 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
perf13.log (8.22 KB, text/x-log)
2014-04-09 16:24 UTC, vramik
no flags Details
perf13_01.log (29.72 KB, text/x-log)
2014-04-09 16:25 UTC, vramik
no flags Details
perf14_pages (100.42 KB, image/png)
2014-04-09 16:25 UTC, vramik
no flags Details
No repeated pages screenshot (181.19 KB, image/png)
2014-04-22 17:11 UTC, Lucas Ponce
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker GTNPORTAL-3451 0 Major Resolved Infinispan transport clusterName property not properly defined 2018-12-27 07:45:47 UTC

Description vramik 2014-04-09 16:24:59 UTC
Created attachment 884568 [details]
perf13.log

Description of problem:
There is TimeoutException when starting portal in cluster mode. Follow steps to reproduce.

Version-Release number of selected component (if applicable):
rhjp6.2.dr02

Steps to Reproduce:
1. I've used two machines in lab (perf13, perf14)
2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server
3. start portal (on perf13): sh standalone.sh -b perf13.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf13
4. start portal (on perf14): sh standalone.sh -b perf14.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf14
5. There are exceptions in log on perf13 when perf14 is being started (see attached perf13.log)
6. When I've tried to do a quick sanity check on perf13, I got another exceptions (perf13_01.log)

Additional info:
There are additional pages on perf14O (see perf14_pages.png)

Comment 1 vramik 2014-04-09 16:25:32 UTC
Created attachment 884569 [details]
perf13_01.log

Comment 2 vramik 2014-04-09 16:25:59 UTC
Created attachment 884570 [details]
perf14_pages

Comment 4 Lucas Ponce 2014-04-22 09:44:02 UTC
Hi,

I have one important doubt about steps that can affect to the case:

"2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server"

In a cluster, environment, database should be unique, having two different databases at same time can create some unexpected behaviour.

Please, could you try to reproduce the steps but using a shared database for both nodes.

It's probably that issue will remain, but then we can have a more close trace of the issue.

Meanwhile I'm going to setup a similar environment to reproduce it.

Thanks,
Lucas

Comment 5 Lucas Ponce 2014-04-22 11:38:04 UTC
I confirm I could reproduce issue with a single h2 database for both nodes.

Investigating.

Comment 7 Lucas Ponce 2014-04-22 12:34:43 UTC
Issue found:

- clusterName="" is not set up properly due ${infinispan-cluster-name} system variable is not defined.

Workaround:

- Start ha configuration with this proper variable, for example:

 bin/standalone.sh -b node1 -c standalone-ha.xml -Djboss.node.name=node1 -Dinfinispan-cluster-name=gatein-cluster

bin/standalone.sh -b node2 -c standalone-ha.xml -Djboss.node.name=node2 -Dinfinispan-cluster-name=gatein-portal


I'm going to prepare a fix to define this property by default.

Comment 8 Lucas Ponce 2014-04-22 15:00:14 UTC
Another issue found:

- RSVP.ack_on_delivery=true on JGroups configuration where Infinispan recommend set to false to avoid deadlocks.

Some preliminar tests seems this also can be related in the overall issue.

[1] https://issues.jboss.org/browse/ISPN-2612
[2] https://issues.jboss.org/browse/ISPN-2713

Comment 9 Tomas Kyjovsky 2014-04-22 16:40:12 UTC
I tried applying the workaround suggested by Dan Berindei [1] and it removed the TimeoutException. However the redundant pages (that shouldn't be visible) are still there, even when I added the "infinispan-cluster-name" property.


Additional Steps:

- set RSVP.ack_on_delivery=false in gatein/gatein.ear/portal.war/WEB-INF/classes/jgroups/gatein-udp.xml as suggested in [1] for both nodes
   (all the other gatein jgroups configs already have this set)

- set -Dinfinispan-cluster-name=gatein-portal for both nodes


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1087244#c2

Comment 10 Lucas Ponce 2014-04-22 17:11:04 UTC
Created attachment 888599 [details]
No repeated pages screenshot

I've sent a PR for master in

https://github.com/gatein/gatein-portal/pull/832

With this fix and a clean database I can't reproduce repeated pages.

Please, could you repeat the test with a clean database ?

May be there is an issue with initial database, this can help us to scope it.

Thanks,
Lucas

Comment 11 Tomas Kyjovsky 2014-04-23 17:02:25 UTC
Yes, it was the db. The additional pages were caused by data from portal 6.1 which we used for comparison. I don't see them with clean db.

(Standalone h2 stores the dbs in ${user.home} instead of /data and I forgot to clean them between the tests.)

Comment 12 Peter Palaga 2014-04-23 19:40:50 UTC
https://github.com/gatein/gatein-portal/pull/832 was merged in upstream.

Comment 13 vramik 2014-05-13 21:50:53 UTC
Verified in ER02

Comment 15 Red Hat Bugzilla 2025-02-10 03:35:35 UTC
This product has been discontinued or is no longer tracked in Red Hat Bugzilla.


Note You need to log in before you can comment on or make changes to this bug.