Hide Forgot
Description of problem: AS/EAP will not form a cluster in prod Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Create scaled AS/EAP applications 2. Look for cluster size in logs 3. Actual results: cluster size remains 1 Expected results: cluster size should be 2 Additional info:
Cluster forms in a devenv, but not stage or prod.
The issue is with clustering in prod is that the 2 scaled gears are on different nodes and this value in the JVM args causes a failure with creating a socket from a loopback to an actual remote address. To workaround the problem in .openshift/action_hooks/pre_start_jbosseap-6.0 add the following: export JAVA_OPTS="-Xmx256m -XX:MaxPermSize=128m -Dorg.jboss.resolver.warning=true -Djava.net.preferIPv4Stack=true -Dfile.encoding=UTF-8 -Djava.net.preferIPv4Stack=true -Djboss.node.name=${OPENSHIFT_GEAR_DNS} -Djgroups.bind_addr=${OPENSHIFT_GEAR_DNS} -Dorg.apache.coyote.http11.Http11Protocol.COMPRESSION=on"
With the above workaround, I get the following error in server.log and my app fails to deploy: 2012/12/07 09:49:18,258 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 80) MSC00001: Failed to start service jboss.jgroups.channel.web: org.jboss.msc.service.StartException in service jboss.jgroups.channel.web: java.net.BindException: No available port to bind to in range [7600 .. 7650] at org.jboss.as.clustering.jgroups.subsystem.ChannelService.start(ChannelService.java:51) at org.jboss.as.clustering.msc.AsynchronousService$1.run(AsynchronousService.java:82) [jboss-as-clustering-common-7.1.x.incremental.129.jar:7.1.x.incremental.129] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [rt.jar:1.7.0_09-icedtea] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [rt.jar:1.7.0_09-icedtea] at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_09-icedtea] at org.jboss.threads.JBossThread.run(JBossThread.java:122) [jboss-threads-2.0.0.GA.jar:2.0.0.GA] Caused by: java.net.BindException: No available port to bind to in range [7600 .. 7650] at org.jgroups.util.Util.createServerSocket(Util.java:3168) at org.jgroups.blocks.TCPConnectionMap.<init>(TCPConnectionMap.java:90) at org.jgroups.blocks.TCPConnectionMap.<init>(TCPConnectionMap.java:55) at org.jgroups.protocols.TCP.createConnectionMap(TCP.java:132) at org.jgroups.protocols.TCP.start(TCP.java:64) at org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:938) at org.jgroups.JChannel.startStack(JChannel.java:841) at org.jgroups.JChannel.connect(JChannel.java:277) at org.jgroups.JChannel.connect(JChannel.java:261) at org.jboss.as.clustering.jgroups.subsystem.ChannelService.start(ChannelService.java:48) ... 5 more
Per Bill: 1) We bind to a loopback via jgroups.bind.addr. We can only bind to a loopback due to security via selinux 2) org.jgroups.blocks.TCPConnectionMap tries to connect to the remote node and gets an Invalid Arguments exception from java.net. I did some digging and this is because the JVM will not allow a socket bound to a loopback to connect to a remote address. 3) If we try to bind via jgroups.bind.addr to a routable address (e.g. 10.whatever) then selinux blows us up.
looks like it's this first line that's causing the problem (jgroups3.0.9) in the constructor of TCPConnectionMap.TCPConnection: this.sock.bind(new InetSocketAddress(bind_addr, 0)); Util.connect(this.sock, destAddr, sock_conn_timeout); If this line is commented out the cluster forms If bind_addr is the loopback then I get the Invalid argument going from loopback to remote If bind_addr is the routable IP/hostname then selinux denies the binding. I tested a C app in the same OpenShift environment and if there is no explicit client binding or if the binding is explicitly "eth0" then it works. If the client binding is "lo" then we see the same behavior as JGroups (fails for remote, works for local). I've also tested connectivity between JBoss to remote MySQL and that works fine. I propose we add config to JGroups that disables that explicit client binding.
Created attachment 662542 [details] AS7 JGroups patch
Created attachment 662543 [details] EAP6 JGroups patch
The above attachments can be added and pushed to a git repo to correct this issue. Working on an official patch from the JGroups, AS, and EAP teams.
https://issues.jboss.org/browse/JGRP-1555
https://github.com/belaban/JGroups/pull/69
JGroups3.0.16.Final has been released. This contains the config change required to fix this problem.
ping, any news since the break?
We have patches for AS7 and EAP6 that a user can apply. I am still waiting to hear from engineering on the status of an official patch for both.
Hi, Bill QE try to reproduce this bug. After I created an scalable JbossEAP app, I checked the server.log and boot.log under <app-dir>/jbosseap-6.0/logs/, but I couldn't find any logs that contain "cluster size". So could you tell us where we could get the cluster size of the JbossEAP application?
Can't reproduce in devenv_2607 multi_node_env 1.Setup multi_node_env with 2 nodes 2.Create scalable jbossas application in multi_node_env, make sure the gears are in different nodes check the haproxy_gear_dns [qsjbossas-qgong16.dev.rhcloud.com ~]\> env|grep DNS OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com OPENSHIFT_GEAR_DNS=qsjbossas-qgong16.dev.rhcloud.com OPENSHIFT_MYSQL_DB_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com check the db_gear_dns [4f820aa384-qgong16.dev.rhcloud.com ~]\> env|grep DNS OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com OPENSHIFT_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com check the web_gear_dns [3068d95dd5-qgong16.dev.rhcloud.com ~]\> env|grep DNS OPENSHIFT_APP_DNS=qsjbossas-qgong16.dev.rhcloud.com OPENSHIFT_GEAR_DNS=3068d95dd5-qgong16.dev.rhcloud.com OPENSHIFT_MYSQL_DB_GEAR_DNS=4f820aa384-qgong16.dev.rhcloud.com 3.add mongodb, and update the index.jsp, git push,find could connect db well in any one gear 4.check the event.log of this app. no error found.
See comment#13. Waiting on upstream to apply fix
Bill, Any updates on this?
Bill any updates on this?
The official patch for EAP is in dev. Should go to prod on the next update. For AS7 I'm going to have to patch it myself - there won't be an official, backwards compatible AS release.
Commits pushed to master at https://github.com/openshift/origin-server https://github.com/openshift/origin-server/commit/d41e78345e2d572308f33fd380d962266b6e168c Bug 883944 https://github.com/openshift/origin-server/commit/ef8b75908bdac7c5be2ea62db4d477c80e7e25dd Merge pull request #1917 from bdecoste/master Bug 883944 [merge]
Fixed for AS7 with jboss-as7 7.1.0.Final-14 Fixed for EAP6 with jbossas-modules-eap Release 4.2.Final_redhat_4.ep6.el6
QE test this bug on prod. After modify the standalone.xml in step 6, the cluster could work well. Doubt that whether the modification to standalone.xml should be operated by user himself, or we should merge this change to our code. If the standalone.xml is designed like that, then the bug would be VERIFIED. Steps: 1. Create a scalable jbosseap-6.0 app named app0 rhc app create app0 jbosseap-6.0 -p $PASSWORD -s 2. Clone and build a sample application that uses clustering git clone git://github.com/bdecoste/sfsbTest.git cd sfsbTest mvn clean package 3. Copy the built war into the deployments directory of app0 cp sfsbTest/target/sfsbTest-1.0.war app0/deployments/ 4. Disable auto scaling by: touch $APP_Dir/.openshift/markers/disable_auto_scaling Then scale up this app. 5. Git push all the changes, ssh into app0, check the log by using ' tailf jbosseap-6.0/logs/server.log | grep "cluster" '. Then got server.log like: 2013/04/09 22:54:14,513 INFO [org.hornetq.core.server.impl.HornetQServerImpl] (MSC service thread 1-3) live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingjournal,bindingsDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingbindings,largeMessagesDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messaginglargemessages,pagingDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingpaging) 2013/04/09 22:54:32,335 INFO [stdout] (ServerService Thread Pool -- 62) GMS: address=app0-pgy.rhcloud.com/web, cluster=web, physical address=10.28.80.117:57572 2013/04/09 22:54:39,206 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 62) ISPN000094: Received new cluster view: [app0-pgy.rhcloud.com/web|0] [app0-pgy.rhcloud.com/web] 2013/04/09 22:54:41,605 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 62) JBAS010281: Started repl cache from web container 2013/04/09 22:54:41,705 INFO [org.jboss.as.clustering] (MSC service thread 1-3) JBAS010238: Number of cluster members: 1 2013/04/09 22:54:42,403 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 63) JBAS010281: Started default-host/sfsbTest-1.0 cache from web container 6. Add line 312 in app0's standalone.xml. 310 <property name="bind_port">7600</property> 311 <property name="bind_addr">${env.OPENSHIFT_INTERNAL_IP}</property> + 312 <property name="defer_client_bind_addr">true</property> 7. Git push all the changes, ssh into app0, check the log by using ' tailf jbosseap-6.0/logs/server.log | grep "cluster" '. Then got server.log like: 2013/04/09 23:12:56,928 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 38) JBAS010280: Activating Infinispan subsystem. 2013/04/09 23:12:57,532 INFO [org.jboss.as.clustering.jgroups] (ServerService Thread Pool -- 44) JBAS010260: Activating JGroups subsystem. 2013/04/09 23:13:09,432 INFO [org.hornetq.core.server.impl.HornetQServerImpl] (MSC service thread 1-4) live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingjournal,bindingsDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingbindings,largeMessagesDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messaginglargemessages,pagingDirectory=/var/lib/openshift/5164c90f500446a72b000380/jbosseap-6.0/jbosseap-6.0/standalone/data/messagingpaging) 2013/04/09 23:13:28,116 INFO [stdout] (ServerService Thread Pool -- 63) GMS: address=app0-pgy.rhcloud.com/web, cluster=web, physical address=10.28.80.117:57572 2013/04/09 23:13:32,611 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (ServerService Thread Pool -- 63) ISPN000094: Received new cluster view: [5164d335e0b8cdee8200008a-pgy.rhcloud.com/web|1] [5164d335e0b8cdee8200008a-pgy.rhcloud.com/web, app0-pgy.rhcloud.com/web] 2013/04/09 23:13:35,703 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 63) JBAS010281: Started repl cache from web container 2013/04/09 23:13:35,606 INFO [org.jboss.as.clustering.infinispan] (ServerService Thread Pool -- 64) JBAS010281: Started default-host/sfsbTest-1.0 cache from web container 2013/04/09 23:13:35,795 INFO [org.jboss.as.clustering] (MSC service thread 1-1) JBAS010238: Number of cluster members: 2
defer_client_bind_addr has been added to the default standalone.xml for both EAP6 and AS7. Users will have to manually modify existing applications to add this element.
According to Comment 23, verify this bug.