Bug 883944 - JBoss clustering broken in prod
Summary: JBoss clustering broken in prod
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OKD
Classification: Red Hat
Component: Containers
Version: 2.x
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Bill DeCoste
QA Contact: libra bugs
URL:
Whiteboard:
Depends On:
Blocks: 903457
TreeView+ depends on / blocked
 
Reported: 2012-12-05 15:55 UTC by Bill DeCoste
Modified: 2015-05-14 23:03 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 903457 (view as bug list)
Environment:
Last Closed: 2013-04-20 03:34:10 UTC
Target Upstream Version:


Attachments (Terms of Use)
AS7 JGroups patch (1.61 MB, application/x-gzip)
2012-12-12 17:34 UTC, Bill DeCoste
no flags Details
EAP6 JGroups patch (1.61 MB, application/x-gzip)
2012-12-12 17:35 UTC, Bill DeCoste
no flags Details

Description Bill DeCoste 2012-12-05 15:55:16 UTC
Description of problem:
AS/EAP will not form a cluster in prod


Version-Release number of selected component (if applicable):


How reproducible:
100%


Steps to Reproduce:
1. Create scaled AS/EAP applications
2. Look for cluster size in logs
3.
  
Actual results:
cluster size remains 1


Expected results:
cluster size should be 2


Additional info:

Comment 1 Bill DeCoste 2012-12-05 18:04:31 UTC
Cluster forms in a devenv, but not stage or prod.

Comment 2 Bill DeCoste 2012-12-06 22:18:21 UTC
The issue is with clustering in prod is that the 2 scaled gears are on different nodes and this value in the JVM args causes a failure with creating a socket from a loopback to an actual remote address.

To workaround the problem in .openshift/action_hooks/pre_start_jbosseap-6.0 add the following:

export JAVA_OPTS="-Xmx256m -XX:MaxPermSize=128m -Dorg.jboss.resolver.warning=true -Djava.net.preferIPv4Stack=true -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -Djboss.node.name=${OPENSHIFT_GEAR_DNS} -Djgroups.bind_addr=${OPENSHIFT_GEAR_DNS} -Dorg.apache.coyote.http11.Http11Protocol.COMPRESSION=on"

Comment 3 Jim Crossley 2012-12-07 14:55:06 UTC
With the above workaround, I get the following error in server.log and my app fails to deploy:

2012/12/07 09:49:18,258 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 80) MSC00001: Failed to start service jboss.jgroups.channel.web: org.jboss.msc.service.StartException in service jboss.jgroups.channel.web: java.net.BindException: No available port to bind to in range [7600 .. 7650]
	at org.jboss.as.clustering.jgroups.subsystem.ChannelService.start(ChannelService.java:51)
	at org.jboss.as.clustering.msc.AsynchronousService$1.run(AsynchronousService.java:82) [jboss-as-clustering-common-7.1.x.incremental.129.jar:7.1.x.incremental.129]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.7.0_09-icedtea]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.7.0_09-icedtea]
	at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea]
	at org.jboss.threads.JBossThread.run(JBossThread.java:122) [jboss-threads-2.0.0.GA.jar:2.0.0.GA]
Caused by: java.net.BindException: No available port to bind to in range [7600 .. 7650]
	at org.jgroups.util.Util.createServerSocket(Util.java:3168)
	at org.jgroups.blocks.TCPConnectionMap.<init>(TCPConnectionMap.java:90)
	at org.jgroups.blocks.TCPConnectionMap.<init>(TCPConnectionMap.java:55)
	at org.jgroups.protocols.TCP.createConnectionMap(TCP.java:132)
	at org.jgroups.protocols.TCP.start(TCP.java:64)
	at org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:938)
	at org.jgroups.JChannel.startStack(JChannel.java:841)
	at org.jgroups.JChannel.connect(JChannel.java:277)
	at org.jgroups.JChannel.connect(JChannel.java:261)
	at org.jboss.as.clustering.jgroups.subsystem.ChannelService.start(ChannelService.java:48)
	... 5 more

Comment 4 Jim Crossley 2012-12-11 17:10:20 UTC
Per Bill:

1) We bind to a loopback via jgroups.bind.addr. We can only bind to a
loopback due to security via selinux

2) org.jgroups.blocks.TCPConnectionMap tries to connect to the remote
node and gets an Invalid Arguments exception from java.net. I did some
digging and this is because the JVM will not allow a socket bound to a
loopback to connect to a remote address.

3) If we try to bind via jgroups.bind.addr to a routable address (e.g.
10.whatever) then selinux blows us up.

Comment 5 Bill DeCoste 2012-12-12 02:27:17 UTC
looks like it's this first line that's causing the problem (jgroups3.0.9) in the constructor of TCPConnectionMap.TCPConnection:

    this.sock.bind(new InetSocketAddress(bind_addr, 0));
    Util.connect(this.sock, destAddr, sock_conn_timeout);

If this line is commented out the cluster forms

If bind_addr is the loopback then I get the Invalid argument going from loopback to remote

If bind_addr is the routable IP/hostname then selinux denies the binding.

I tested a C app in the same OpenShift environment and if there is no explicit client binding or if the binding is explicitly "eth0" then it works. If the client binding is "lo" then we see the same behavior as JGroups (fails for remote, works for local).

I've also tested connectivity between JBoss to remote MySQL and that works fine.

I propose we add config to JGroups that disables that explicit client binding.

Comment 6 Bill DeCoste 2012-12-12 17:34:39 UTC
Created attachment 662542 [details]
AS7 JGroups patch

Comment 7 Bill DeCoste 2012-12-12 17:35:16 UTC
Created attachment 662543 [details]
EAP6 JGroups patch

Comment 8 Bill DeCoste 2012-12-12 17:36:31 UTC
The above attachments can be added and pushed to a git repo to correct this issue. Working on an official patch from the JGroups, AS, and EAP teams.

Comment 9 Bill DeCoste 2012-12-13 15:36:26 UTC
https://issues.jboss.org/browse/JGRP-1555

Comment 10 Bill DeCoste 2012-12-17 16:58:07 UTC
https://github.com/belaban/JGroups/pull/69

Comment 11 Bill DeCoste 2012-12-19 16:23:56 UTC
JGroups3.0.16.Final has been released. This contains the config change required to fix this problem.

Comment 12 Mike McGrath 2013-01-02 19:19:12 UTC
ping, any news since the break?

Comment 13 Bill DeCoste 2013-01-02 19:34:45 UTC
We have patches for AS7 and EAP6 that a user can apply. I am still waiting to hear from engineering on the status of an official patch for both.

Comment 14 Gaoyun Pei 2013-01-22 09:25:46 UTC
Hi, Bill
QE try to reproduce this bug. After I created an scalable JbossEAP app, I checked the server.log and boot.log under <app-dir>/jbosseap-6.0/logs/, but I couldn't find any logs that contain "cluster size". 
So could you tell us where we could get the cluster size of the JbossEAP application?

Comment 15 Rony Gong 🔥 2013-01-22 10:12:35 UTC
Can't reproduce in devenv_2607 multi_node_env
1.Setup multi_node_env with 2 nodes
2.Create scalable jbossas application in multi_node_env, make sure the gears are in different nodes
check the haproxy_gear_dns

[qsjbossas-qgong16.dev.rhcloud.com ~]\> env|grep DNS
OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_GEAR_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_MYSQL_DB_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com

check the db_gear_dns
[4f820aa384-qgong16.dev.rhcloud.com ~]\> env|grep DNS
OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com

check the web_gear_dns
[3068d95dd5-qgong16.dev.rhcloud.com ~]\> env|grep DNS
OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com
OPENSHIFT_GEAR_DNS=3068d95dd5-qgong16.dev.rhcloud.com
OPENSHIFT_MYSQL_DB_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com


3.add mongodb, and update the index.jsp, git push,find could connect db well in any one gear
4.check the event.log of this app. no error found.

Comment 16 Jhon Honce 2013-01-22 17:42:38 UTC
See comment#13. Waiting on upstream to apply fix

Comment 17 manoj 2013-03-01 20:08:54 UTC
Bill,

Any updates on this?

Comment 18 manoj 2013-04-04 19:07:16 UTC
Bill any updates on this?

Comment 19 Bill DeCoste 2013-04-04 21:09:35 UTC
The official patch for EAP is in dev. Should go to prod on the next update. 

For AS7 I'm going to have to patch it myself - there won't be an official, backwards compatible AS release.

Comment 21 Bill DeCoste 2013-04-09 19:23:02 UTC
Fixed for AS7 with jboss-as7 7.1.0.Final-14

Fixed for EAP6 with jbossas-modules-eap Release 4.2.Final_redhat_4.ep6.el6

Comment 22 Gaoyun Pei 2013-04-10 03:42:46 UTC
QE test this bug on prod. After modify the standalone.xml in step 6, the cluster could work well. 

Doubt that whether the modification to standalone.xml should be operated by user himself, or we should merge this change to our code. 

If the standalone.xml is designed like that, then the bug would be VERIFIED.

Steps:

1. Create a scalable jbosseap-6.0 app named app0 

   rhc app create app0 jbosseap-6.0 -p $PASSWORD -s

2. Clone and build a sample application that uses clustering

   git clone git://github.com/bdecoste/sfsbTest.git
   cd sfsbTest
   mvn clean package

3. Copy the built war into the deployments directory of app0 

   cp sfsbTest/target/sfsbTest-1.0.war app0/deployments/

4. Disable auto scaling by:

   touch $APP_Dir/.openshift/markers/disable_auto_scaling
   
   Then scale up this app.

5. Git push all the changes, ssh into app0, check the log by using ' tailf jbosseap-6.0/logs/server.log | grep "cluster" '. Then got server.log like:

2013/04/09 22:54:14,513 INFO  [org.hornetq.core.server.impl.HornetQServerImpl] (MSC service thread 1-3) live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingjournal,bindingsDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingbindings,largeMessagesDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messaginglargemessages,pagingDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingpaging)
2013/04/09 22:54:32,335 INFO  [stdout] (ServerService Thread Pool -- 62) GMS: address=app0-pgy.rhcloud.com/web, cluster=web, physical address=10.28.80.117:57572
2013/04/09 22:54:39,206 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 62) ISPN000094: Received new cluster view: [app0-pgy.rhcloud.com/web|0] [app0-pgy.rhcloud.com/web]
2013/04/09 22:54:41,605 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 62) JBAS010281: Started repl cache from web container
2013/04/09 22:54:41,705 INFO  [org.jboss.as.clustering] (MSC service thread 1-3) JBAS010238: Number of cluster members: 1
2013/04/09 22:54:42,403 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 63) JBAS010281: Started default-host/sfsbTest-1.0 cache from web container

6. Add line 312 in app0's standalone.xml.

    310                                         <property name="bind_port">7600</property>
    311                                         <property name="bind_addr">${env.OPENSHIFT_INTERNAL_IP}</property>
+   312                                         <property name="defer_client_bind_addr">true</property>

7. Git push all the changes, ssh into app0, check the log by using ' tailf jbosseap-6.0/logs/server.log | grep "cluster" '. Then got server.log like:

2013/04/09 23:12:56,928 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 38) JBAS010280: Activating Infinispan subsystem.
2013/04/09 23:12:57,532 INFO  [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 44) JBAS010260: Activating JGroups subsystem.
2013/04/09 23:13:09,432 INFO  [org.hornetq.core.server.impl.HornetQServerImpl] (MSC service thread 1-4) live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingjournal,bindingsDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingbindings,largeMessagesDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messaginglargemessages,pagingDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingpaging)
2013/04/09 23:13:28,116 INFO  [stdout] (ServerService Thread Pool -- 63) GMS: address=app0-pgy.rhcloud.com/web, cluster=web, physical address=10.28.80.117:57572
2013/04/09 23:13:32,611 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 63) ISPN000094: Received new cluster view: [5164d335e0b8cdee8200008a-pgy.rhcloud.com/web|1] [5164d335e0b8cdee8200008a-pgy.rhcloud.com/web, app0-pgy.rhcloud.com/web]
2013/04/09 23:13:35,703 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 63) JBAS010281: Started repl cache from web container
2013/04/09 23:13:35,606 INFO  [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 64) JBAS010281: Started default-host/sfsbTest-1.0 cache from web container
2013/04/09 23:13:35,795 INFO  [org.jboss.as.clustering] (MSC service thread 1-1) JBAS010238: Number of cluster members: 2

Comment 23 Bill DeCoste 2013-04-10 12:15:23 UTC
defer_client_bind_addr has been added to the default standalone.xml for both EAP6 and AS7. Users will have to manually modify existing applications to add this element.

Comment 24 Gaoyun Pei 2013-04-11 01:53:42 UTC
According to Comment 23, verify this bug.


Note You need to log in before you can comment on or make changes to this bug.