1085927 – org.jgroups.TimeoutException when starting two nodes in cluster

Bug 1085927 - org.jgroups.TimeoutException when starting two nodes in cluster

Summary: org.jgroups.TimeoutException when starting two nodes in cluster

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	JBoss Enterprise Portal Platform 6
Classification:	JBoss
Component:	Portal
Sub Component:
Version:	6.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ER02
Target Release:	6.2.0
Assignee:	Lucas Ponce
QA Contact:	vramik
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-04-09 16:24 UTC by vramik
Modified:	2025-02-10 03:35 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2025-02-10 03:35:35 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
perf13.log (8.22 KB, text/x-log) 2014-04-09 16:24 UTC, vramik	no flags	Details
perf13_01.log (29.72 KB, text/x-log) 2014-04-09 16:25 UTC, vramik	no flags	Details
perf14_pages (100.42 KB, image/png) 2014-04-09 16:25 UTC, vramik	no flags	Details
No repeated pages screenshot (181.19 KB, image/png) 2014-04-22 17:11 UTC, Lucas Ponce	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	GTNPORTAL-3451	0	Major	Resolved	Infinispan transport clusterName property not properly defined	2018-12-27 07:45:47 UTC

Description vramik 2014-04-09 16:24:59 UTC

Created attachment 884568 [details]
perf13.log

Description of problem:
There is TimeoutException when starting portal in cluster mode. Follow steps to reproduce.

Version-Release number of selected component (if applicable):
rhjp6.2.dr02

Steps to Reproduce:
1. I've used two machines in lab (perf13, perf14)
2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server
3. start portal (on perf13): sh standalone.sh -b perf13.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf13
4. start portal (on perf14): sh standalone.sh -b perf14.mw.lab.eng.bos.redhat.com -c standalone-ha.xml -Djboss.node.name=perf14
5. There are exceptions in log on perf13 when perf14 is being started (see attached perf13.log)
6. When I've tried to do a quick sanity check on perf13, I got another exceptions (perf13_01.log)

Additional info:
There are additional pages on perf14O (see perf14_pages.png)

Comment 1 vramik 2014-04-09 16:25:32 UTC

Created attachment 884569 [details]
perf13_01.log

Comment 2 vramik 2014-04-09 16:25:59 UTC

Created attachment 884570 [details]
perf14_pages

Comment 4 Lucas Ponce 2014-04-22 09:44:02 UTC

Hi,

I have one important doubt about steps that can affect to the case:

"2. start h2 db on both machines: java -cp modules/system/layers/base/com/h2database/h2/main/h2-1.3.168-redhat-2.jar org.h2.tools.Server"

In a cluster, environment, database should be unique, having two different databases at same time can create some unexpected behaviour.

Please, could you try to reproduce the steps but using a shared database for both nodes.

It's probably that issue will remain, but then we can have a more close trace of the issue.

Meanwhile I'm going to setup a similar environment to reproduce it.

Thanks,
Lucas

Comment 5 Lucas Ponce 2014-04-22 11:38:04 UTC

I confirm I could reproduce issue with a single h2 database for both nodes.

Investigating.

Comment 7 Lucas Ponce 2014-04-22 12:34:43 UTC

Issue found:

- clusterName="" is not set up properly due ${infinispan-cluster-name} system variable is not defined.

Workaround:

- Start ha configuration with this proper variable, for example:

 bin/standalone.sh -b node1 -c standalone-ha.xml -Djboss.node.name=node1 -Dinfinispan-cluster-name=gatein-cluster

bin/standalone.sh -b node2 -c standalone-ha.xml -Djboss.node.name=node2 -Dinfinispan-cluster-name=gatein-portal


I'm going to prepare a fix to define this property by default.

Comment 8 Lucas Ponce 2014-04-22 15:00:14 UTC

Another issue found:

- RSVP.ack_on_delivery=true on JGroups configuration where Infinispan recommend set to false to avoid deadlocks.

Some preliminar tests seems this also can be related in the overall issue.

[1] https://issues.jboss.org/browse/ISPN-2612
[2] https://issues.jboss.org/browse/ISPN-2713

Comment 9 Tomas Kyjovsky 2014-04-22 16:40:12 UTC

I tried applying the workaround suggested by Dan Berindei [1] and it removed the TimeoutException. However the redundant pages (that shouldn't be visible) are still there, even when I added the "infinispan-cluster-name" property.


Additional Steps:

- set RSVP.ack_on_delivery=false in gatein/gatein.ear/portal.war/WEB-INF/classes/jgroups/gatein-udp.xml as suggested in [1] for both nodes
   (all the other gatein jgroups configs already have this set)

- set -Dinfinispan-cluster-name=gatein-portal for both nodes


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1087244#c2

Comment 10 Lucas Ponce 2014-04-22 17:11:04 UTC

Created attachment 888599 [details]
No repeated pages screenshot

I've sent a PR for master in

https://github.com/gatein/gatein-portal/pull/832

With this fix and a clean database I can't reproduce repeated pages.

Please, could you repeat the test with a clean database ?

May be there is an issue with initial database, this can help us to scope it.

Thanks,
Lucas

Comment 11 Tomas Kyjovsky 2014-04-23 17:02:25 UTC

Yes, it was the db. The additional pages were caused by data from portal 6.1 which we used for comparison. I don't see them with clean db.

(Standalone h2 stores the dbs in ${user.home} instead of /data and I forgot to clean them between the tests.)

Comment 12 Peter Palaga 2014-04-23 19:40:50 UTC

https://github.com/gatein/gatein-portal/pull/832 was merged in upstream.

Comment 13 vramik 2014-05-13 21:50:53 UTC

Verified in ER02

Comment 15 Red Hat Bugzilla 2025-02-10 03:35:35 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.

Note You need to log in before you can comment on or make changes to this bug.