Bug 1292748 - BRMS cluster boot gets stuck when helix-controller is not up
BRMS cluster boot gets stuck when helix-controller is not up
Status: CLOSED WONTFIX
Product: JBoss BRMS Platform 6
Classification: JBoss
Component: Business Central (Show other bugs)
6.2.0
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Alexandre Porcelli
Radovan Synek
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-18 04:34 EST by Toshiya Kobayashi
Modified: 2016-10-05 04:44 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-01-06 16:50:20 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Toshiya Kobayashi 2015-12-18 04:34:56 EST
Description of problem:

In BRMS cluster with ZooKeeper/Helix environment, if we start the BRMS cluster without starting helix-controller, the boot sequence gets stuck. Controller Boot Thread gives error messages after 5 minutes but...

====
[Server:server-one] 18:08:02,814 INFO  [org.apache.helix.manager.zk.CallbackHandler] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) nodeOne_12345 subscribes child-change. path: /brms-cluster/INSTANCES/nodeOne_12345/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@7a95354f
[Server:server-one] 18:08:02,815 INFO  [org.apache.helix.messaging.handling.HelixTaskExecutor] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) No Messages to process
[Server:server-one] 18:08:02,815 INFO  [org.apache.helix.manager.zk.CallbackHandler] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) 204 END:INVOKE /brms-cluster/INSTANCES/nodeOne_12345/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor Took: 2ms
[Server:server-one] 18:12:09,543 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) JBAS013412: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("interface" => "management")]'
...
====

The thread in problem is blocked forever.

====
"MSC service thread 1-1" prio=10 tid=0x00007fca4450f800 nid=0x199c sleeping[0x00007fca39490000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.uberfire.io.impl.cluster.helix.ClusterServiceHelix.enablePartition(ClusterServiceHelix.java:146)
	at org.uberfire.io.impl.cluster.helix.ClusterServiceHelix.lock(ClusterServiceHelix.java:172)
	at org.uberfire.commons.lock.LockExecuteReleaseTemplate.execute(LockExecuteReleaseTemplate.java:10)
	at org.uberfire.io.impl.cluster.IOServiceClusterImpl.start(IOServiceClusterImpl.java:148)
	at org.uberfire.io.impl.cluster.IOServiceClusterImpl.<init>(IOServiceClusterImpl.java:142)
	at org.uberfire.backend.server.io.ConfigIOServiceProducer.setup(ConfigIOServiceProducer.java:44)
...
====

https://github.com/uberfire/uberfire/blob/0.7.x/uberfire-io/src/main/java/org/uberfire/io/impl/cluster/helix/ClusterServiceHelix.java#L144-L149

Of course, it's a wrong operation but a hang is not desirable anyway. The deployment should fail with informative error massages.

Steps to Reproduce:
1. Setup BRMS cluster following https://access.redhat.com/documentation/en-US/Red_Hat_JBoss_BRMS/6.2/html-single/Installation_Guide/index.html#Clustering_JAR_Installer
2. Start BRMS cluster domain properly (run helix-core/startCluster.sh first, then run domain.sh)
3. Shutdown the domain
4. kill HelixControllerMain
5. Start the domain

Actual results:

The boot sequence gets stuck as illustrated in the description.

Expected results:

The business-central.war deployment fails, hopefully, with an error message which tells the root cause.
Comment 1 Alexandre Porcelli 2016-01-06 16:50:20 EST
Unfortunately this is not an easy fix - can't just check if controller is missing and throw an exception or log. This is related to Helix behavior and even more complicated: the cluster state.
Sorry, but this won't be fixed. The good news is for 7 series we have plans to change BxMS clustering and this issue won't exist anymore.

Note You need to log in before you can comment on or make changes to this bug.