Description of problem: In BRMS cluster with ZooKeeper/Helix environment, if we start the BRMS cluster without starting helix-controller, the boot sequence gets stuck. Controller Boot Thread gives error messages after 5 minutes but... ==== [Server:server-one] 18:08:02,814 INFO [org.apache.helix.manager.zk.CallbackHandler] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) nodeOne_12345 subscribes child-change. path: /brms-cluster/INSTANCES/nodeOne_12345/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@7a95354f [Server:server-one] 18:08:02,815 INFO [org.apache.helix.messaging.handling.HelixTaskExecutor] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) No Messages to process [Server:server-one] 18:08:02,815 INFO [org.apache.helix.manager.zk.CallbackHandler] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) 204 END:INVOKE /brms-cluster/INSTANCES/nodeOne_12345/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor Took: 2ms [Server:server-one] 18:12:09,543 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) JBAS013412: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("interface" => "management")]' ... ==== The thread in problem is blocked forever. ==== "MSC service thread 1-1" prio=10 tid=0x00007fca4450f800 nid=0x199c sleeping[0x00007fca39490000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.uberfire.io.impl.cluster.helix.ClusterServiceHelix.enablePartition(ClusterServiceHelix.java:146) at org.uberfire.io.impl.cluster.helix.ClusterServiceHelix.lock(ClusterServiceHelix.java:172) at org.uberfire.commons.lock.LockExecuteReleaseTemplate.execute(LockExecuteReleaseTemplate.java:10) at org.uberfire.io.impl.cluster.IOServiceClusterImpl.start(IOServiceClusterImpl.java:148) at org.uberfire.io.impl.cluster.IOServiceClusterImpl.<init>(IOServiceClusterImpl.java:142) at org.uberfire.backend.server.io.ConfigIOServiceProducer.setup(ConfigIOServiceProducer.java:44) ... ==== https://github.com/uberfire/uberfire/blob/0.7.x/uberfire-io/src/main/java/org/uberfire/io/impl/cluster/helix/ClusterServiceHelix.java#L144-L149 Of course, it's a wrong operation but a hang is not desirable anyway. The deployment should fail with informative error massages. Steps to Reproduce: 1. Setup BRMS cluster following https://access.redhat.com/documentation/en-US/Red_Hat_JBoss_BRMS/6.2/html-single/Installation_Guide/index.html#Clustering_JAR_Installer 2. Start BRMS cluster domain properly (run helix-core/startCluster.sh first, then run domain.sh) 3. Shutdown the domain 4. kill HelixControllerMain 5. Start the domain Actual results: The boot sequence gets stuck as illustrated in the description. Expected results: The business-central.war deployment fails, hopefully, with an error message which tells the root cause.
Unfortunately this is not an easy fix - can't just check if controller is missing and throw an exception or log. This is related to Helix behavior and even more complicated: the cluster state. Sorry, but this won't be fixed. The good news is for 7 series we have plans to change BxMS clustering and this issue won't exist anymore.