Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 903457

Summary: JBoss clustering broken
Product: OpenShift Container Platform Reporter: Brenton Leanhardt <bleanhar>
Component: ContainersAssignee: Brenton Leanhardt <bleanhar>
Status: CLOSED ERRATA QA Contact: libra bugs <libra-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 1.0.0CC: gpei, jcrossley, jhonce, jialiu, libra-onpremise-devel, lmeyer, mmcgrath, plarsen, qgong, wdecoste, xjia
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The clustering mechanism used with JBoss, JGroups, had a restriction that kept it from connecting properly via ports on the gear IP to gears on other nodes. Consequence: Session replication was broken in scaled JBoss applications if the gears ended up on different nodes. Thus user HTTP sessions (the users' application state) could be lost if a gear was scaled down or crashed. Fix: An updated JGroups JAR was created, which is included as part of the JBoss 6.1 update. Result: Clustering and session replication works properly for scaled JBoss apps.
Story Points: ---
Clone Of: 883944 Environment:
Last Closed: 2013-07-09 18:19:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 883944    
Bug Blocks:    

Description Brenton Leanhardt 2013-01-24 03:09:00 UTC
+++ This bug was initially created as a clone of Bug #883944 +++

Description of problem:
AS/EAP will not form a cluster in prod


Version-Release number of selected component (if applicable):


How reproducible:
100%


Steps to Reproduce:
1. Create scaled AS/EAP applications
2. Look for cluster size in logs
3.
  
Actual results:
cluster size remains 1


Expected results:
cluster size should be 2


Additional info:

--- Additional comment from Bill DeCoste on 2012-12-05 13:04:31 EST ---

Cluster forms in a devenv, but not stage or prod.

--- Additional comment from Bill DeCoste on 2012-12-06 17:18:21 EST ---

The issue is with clustering in prod is that the 2 scaled gears are on different nodes and this value in the JVM args causes a failure with creating a socket from a loopback to an actual remote address.

To workaround the problem in .openshift/action_hooks/pre_start_jbosseap-6.0 add the following:

export JAVA_OPTS="-Xmx256m -XX:MaxPermSize=128m -Dorg.jboss.resolver.warning=true -Djava.net.preferIPv4Stack=true -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -Djboss.node.name=${OPENSHIFT_GEAR_DNS} -Djgroups.bind_addr=${OPENSHIFT_GEAR_DNS} -Dorg.apache.coyote.http11.Http11Protocol.COMPRESSION=on"

--- Additional comment from Jim Crossley on 2012-12-07 09:55:06 EST ---

With the above workaround, I get the following error in server.log and my app fails to deploy:

2012/12/07 09:49:18,258 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 80) MSC00001: Failed to start service jboss.jgroups.channel.web: org.jboss.msc.service.StartException in service jboss.jgroups.channel.web: java.net.BindException: No available port to bind to in range [7600 .. 7650]
	at org.jboss.as.clustering.jgroups.subsystem.ChannelService.start(ChannelService.java:51)
	at org.jboss.as.clustering.msc.AsynchronousService$1.run(AsynchronousService.java:82) [jboss-as-clustering-common-7.1.x.incremental.129.jar:7.1.x.incremental.129]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.7.0_09-icedtea]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.7.0_09-icedtea]
	at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea]
	at org.jboss.threads.JBossThread.run(JBossThread.java:122) [jboss-threads-2.0.0.GA.jar:2.0.0.GA]
Caused by: java.net.BindException: No available port to bind to in range [7600 .. 7650]
	at org.jgroups.util.Util.createServerSocket(Util.java:3168)
	at org.jgroups.blocks.TCPConnectionMap.<init>(TCPConnectionMap.java:90)
	at org.jgroups.blocks.TCPConnectionMap.<init>(TCPConnectionMap.java:55)
	at org.jgroups.protocols.TCP.createConnectionMap(TCP.java:132)
	at org.jgroups.protocols.TCP.start(TCP.java:64)
	at org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:938)
	at org.jgroups.JChannel.startStack(JChannel.java:841)
	at org.jgroups.JChannel.connect(JChannel.java:277)
	at org.jgroups.JChannel.connect(JChannel.java:261)
	at org.jboss.as.clustering.jgroups.subsystem.ChannelService.start(ChannelService.java:48)
	... 5 more

--- Additional comment from Jim Crossley on 2012-12-11 12:10:20 EST ---

Per Bill:

1) We bind to a loopback via jgroups.bind.addr. We can only bind to a
loopback due to security via selinux

2) org.jgroups.blocks.TCPConnectionMap tries to connect to the remote
node and gets an Invalid Arguments exception from java.net. I did some
digging and this is because the JVM will not allow a socket bound to a
loopback to connect to a remote address.

3) If we try to bind via jgroups.bind.addr to a routable address (e.g.
10.whatever) then selinux blows us up.

--- Additional comment from Bill DeCoste on 2012-12-11 21:27:17 EST ---

looks like it's this first line that's causing the problem (jgroups3.0.9) in the constructor of TCPConnectionMap.TCPConnection:

    this.sock.bind(new InetSocketAddress(bind_addr, 0));
    Util.connect(this.sock, destAddr, sock_conn_timeout);

If this line is commented out the cluster forms

If bind_addr is the loopback then I get the Invalid argument going from loopback to remote

If bind_addr is the routable IP/hostname then selinux denies the binding.

I tested a C app in the same OpenShift environment and if there is no explicit client binding or if the binding is explicitly "eth0" then it works. If the client binding is "lo" then we see the same behavior as JGroups (fails for remote, works for local).

I've also tested connectivity between JBoss to remote MySQL and that works fine.

I propose we add config to JGroups that disables that explicit client binding.

--- Additional comment from Bill DeCoste on 2012-12-12 12:34:39 EST ---

Created attachment 662542 [details]
AS7 JGroups patch

--- Additional comment from Bill DeCoste on 2012-12-12 12:35:16 EST ---

Created attachment 662543 [details]
EAP6 JGroups patch

--- Additional comment from Bill DeCoste on 2012-12-12 12:36:31 EST ---

The above attachments can be added and pushed to a git repo to correct this issue. Working on an official patch from the JGroups, AS, and EAP teams.

--- Additional comment from Bill DeCoste on 2012-12-13 10:36:26 EST ---

https://issues.jboss.org/browse/JGRP-1555

--- Additional comment from Bill DeCoste on 2012-12-17 11:58:07 EST ---

https://github.com/belaban/JGroups/pull/69

--- Additional comment from Bill DeCoste on 2012-12-19 11:23:56 EST ---

JGroups3.0.16.Final has been released. This contains the config change required to fix this problem.

--- Additional comment from Mike McGrath on 2013-01-02 14:19:12 EST ---

ping, any news since the break?

--- Additional comment from Bill DeCoste on 2013-01-02 14:34:45 EST ---

We have patches for AS7 and EAP6 that a user can apply. I am still waiting to hear from engineering on the status of an official patch for both.

--- Additional comment from  on 2013-01-22 04:25:46 EST ---

Hi, Bill
QE try to reproduce this bug. After I created an scalable JbossEAP app, I checked the server.log and boot.log under <app-dir>/jbosseap-6.0/logs/, but I couldn't find any logs that contain "cluster size". 
So could you tell us where we could get the cluster size of the JbossEAP application?

--- Additional comment from qgong on 2013-01-22 05:12:35 EST ---

Can't reproduce in devenv_2607 multi_node_env
1.Setup multi_node_env with 2 nodes
2.Create scalable jbossas application in multi_node_env, make sure the gears are in different nodes
check the haproxy_gear_dns

[qsjbossas-qgong16.dev.rhcloud.com ~]\> env|grep DNS
OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_GEAR_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_MYSQL_DB_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com

check the db_gear_dns
[4f820aa384-qgong16.dev.rhcloud.com ~]\> env|grep DNS
OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com

check the web_gear_dns
[3068d95dd5-qgong16.dev.rhcloud.com ~]\> env|grep DNS
OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_GEAR_DNS=3068d95dd5-qgong16.dev.rhcloud.com
OPENSHIFT_MYSQL_DB_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com


3.add mongodb, and update the index.jsp, git push,find could connect db well in any one gear
4.check the event.log of this app. no error found.

--- Additional comment from Jhon Honce on 2013-01-22 12:42:38 EST ---

See comment#13. Waiting on upstream to apply fix

Comment 10 Luke Meyer 2013-06-28 13:47:14 UTC
The fix has been to apply a rebuilt jgroups JAR manually. Has JBoss released updated JBoss RPMs with this fix?

Comment 11 Gaoyun Pei 2013-07-01 09:42:19 UTC
(In reply to Luke Meyer from comment #10)
> The fix has been to apply a rebuilt jgroups JAR manually. Has JBoss released
> updated JBoss RPMs with this fix?

QE test this on OSE 1.2 RC2 puddle without replacing the jgroups packages on node.
The jgroups used on nodes is version: jgroups-3.2.7.Final-redhat-1.jar 

A scalable jbosseap app was created and scale it up, two gears were located on two nodes. After deploying the sfsbTest built war, restart the app and grep the "cluster" related log on the two gears:

[app0-1234.moverc2.com 51d1445fd6bfd2f9ca000084]\> tailf jbosseap/logs/server.log  |grep cluster
2013/07/01 02:24:20,781 INFO  [org.jboss.as.controller.management-deprecated] (ServerService Thread Pool -- 11) JBAS014627: Attribute clustered is deprecated, and it might be removed in future version!
2013/07/01 02:24:22,589 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 39) JBAS010280: Activating Infinispan subsystem.
2013/07/01 02:24:24,481 INFO  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 45) JBAS010260: Activating JGroups subsystem.
2013/07/01 02:24:33,183 INFO  [org.hornetq.core.server] (MSC service thread 1-2) HQ221000: live server is starting with configuration HornetQ Configuration (clustered=false,backup=false,sharedStore=true,journalDirectory=/var/lib/openshift/51d1445fd6bfd2f9ca000084/jbosseap/standalone/data/messagingjournal,bindingsDirectory=/var/lib/openshift/51d1445fd6bfd2f9ca000084/jbosseap/standalone/data/messagingbindings,largeMessagesDirectory=/var/lib/openshift/51d1445fd6bfd2f9ca000084/jbosseap/standalone/data/messaginglargemessages,pagingDirectory=/var/lib/openshift/51d1445fd6bfd2f9ca000084/jbosseap/standalone/data/messagingpaging)
2013/07/01 02:24:49,092 WARN  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 64) JBAS010265: property bind_addr for protocol TCP attempting to override socket binding value 127.11.207.1 : property value 127.11.207.1 will be ignored
2013/07/01 02:24:49,188 WARN  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 64) JBAS010265: property bind_port for protocol TCP attempting to override socket binding value 7600 : property value 7600 will be ignored
2013/07/01 02:24:53,300 INFO  [stdout] (ServerService Thread Pool -- 64) GMS: address=app0-1234.moverc2.com/web, cluster=web, physical address=10.4.59.149:63262
2013/07/01 02:24:56,404 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 64) ISPN000094: Received new cluster view: [app0-1234.moverc2.com/web|0] [app0-1234.moverc2.com/web]
2013/07/01 02:24:58,988 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 65) JBAS010281: Started default-host/sfsbTest-1.0 cache from web container
2013/07/01 02:24:58,980 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 64) JBAS010281: Started repl cache from web container
2013/07/01 02:24:59,193 INFO  [org.jboss.as.clustering] (MSC service thread 1-4) JBAS010238: Number of cluster members: 1

[51d144dcd6bfd2f9ca0000a5-1234.moverc2.com 51d144dcd6bfd2f9ca0000a5]\> tailf jbosseap/logs/server.log |grep cluster
2013/07/01 02:24:53,483 INFO  [org.jboss.as.controller.management-deprecated] (ServerService Thread Pool -- 7) JBAS014627: Attribute clustered is deprecated, and it might be removed in future version!
2013/07/01 02:24:55,487 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 39) JBAS010280: Activating Infinispan subsystem.
2013/07/01 02:24:56,288 INFO  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 45) JBAS010260: Activating JGroups subsystem.
2013/07/01 02:25:09,797 INFO  [org.hornetq.core.server] (MSC service thread 1-3) HQ221000: live server is starting with configuration HornetQ Configuration (clustered=false,backup=false,sharedStore=true,journalDirectory=/var/lib/openshift/51d144dcd6bfd2f9ca0000a5/jbosseap/standalone/data/messagingjournal,bindingsDirectory=/var/lib/openshift/51d144dcd6bfd2f9ca0000a5/jbosseap/standalone/data/messagingbindings,largeMessagesDirectory=/var/lib/openshift/51d144dcd6bfd2f9ca0000a5/jbosseap/standalone/data/messaginglargemessages,pagingDirectory=/var/lib/openshift/51d144dcd6bfd2f9ca0000a5/jbosseap/standalone/data/messagingpaging)
2013/07/01 02:25:27,684 WARN  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 66) JBAS010265: property bind_addr for protocol TCP attempting to override socket binding value 127.13.19.1 : property value 127.13.19.1 will be ignored
2013/07/01 02:25:27,780 WARN  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 66) JBAS010265: property bind_port for protocol TCP attempting to override socket binding value 7600 : property value 7600 will be ignored
2013/07/01 02:25:31,480 INFO  [stdout] (ServerService Thread Pool -- 66) GMS: address=51d144dcd6bfd2f9ca0000a5-1234.moverc2.com/web, cluster=web, physical address=10.4.59.149:36497
2013/07/01 02:25:34,593 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 66) ISPN000094: Received new cluster view: [51d144dcd6bfd2f9ca0000a5-1234.moverc2.com/web|0] [51d144dcd6bfd2f9ca0000a5-1234.moverc2.com/web]
2013/07/01 02:25:37,778 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 66) JBAS010281: Started repl cache from web container
2013/07/01 02:25:37,777 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 65) JBAS010281: Started default-host/sfsbTest-1.0 cache from web container
2013/07/01 02:25:37,879 INFO  [org.jboss.as.clustering] (MSC service thread 1-2) JBAS010238: Number of cluster members: 1

Comment 13 errata-xmlrpc 2013-07-09 18:19:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2013-1030.html