Bug 1292748 - BRMS cluster boot gets stuck when helix-controller is not up
Summary: BRMS cluster boot gets stuck when helix-controller is not up
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: JBoss BRMS Platform 6
Classification: Retired
Component: Business Central
Version: 6.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Alexandre Porcelli
QA Contact: Radovan Synek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-18 09:34 UTC by Toshiya Kobayashi
Modified: 2016-10-05 08:44 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-01-06 21:50:20 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Toshiya Kobayashi 2015-12-18 09:34:56 UTC
Description of problem:

In BRMS cluster with ZooKeeper/Helix environment, if we start the BRMS cluster without starting helix-controller, the boot sequence gets stuck. Controller Boot Thread gives error messages after 5 minutes but...

====
[Server:server-one] 18:08:02,814 INFO  [org.apache.helix.manager.zk.CallbackHandler] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) nodeOne_12345 subscribes child-change. path: /brms-cluster/INSTANCES/nodeOne_12345/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@7a95354f
[Server:server-one] 18:08:02,815 INFO  [org.apache.helix.messaging.handling.HelixTaskExecutor] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) No Messages to process
[Server:server-one] 18:08:02,815 INFO  [org.apache.helix.manager.zk.CallbackHandler] (ZkClient-EventThread-204-localhost:2181,localhost:2182,localhost:2183) 204 END:INVOKE /brms-cluster/INSTANCES/nodeOne_12345/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor Took: 2ms
[Server:server-one] 18:12:09,543 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) JBAS013412: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("interface" => "management")]'
...
====

The thread in problem is blocked forever.

====
"MSC service thread 1-1" prio=10 tid=0x00007fca4450f800 nid=0x199c sleeping[0x00007fca39490000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.uberfire.io.impl.cluster.helix.ClusterServiceHelix.enablePartition(ClusterServiceHelix.java:146)
	at org.uberfire.io.impl.cluster.helix.ClusterServiceHelix.lock(ClusterServiceHelix.java:172)
	at org.uberfire.commons.lock.LockExecuteReleaseTemplate.execute(LockExecuteReleaseTemplate.java:10)
	at org.uberfire.io.impl.cluster.IOServiceClusterImpl.start(IOServiceClusterImpl.java:148)
	at org.uberfire.io.impl.cluster.IOServiceClusterImpl.<init>(IOServiceClusterImpl.java:142)
	at org.uberfire.backend.server.io.ConfigIOServiceProducer.setup(ConfigIOServiceProducer.java:44)
...
====

https://github.com/uberfire/uberfire/blob/0.7.x/uberfire-io/src/main/java/org/uberfire/io/impl/cluster/helix/ClusterServiceHelix.java#L144-L149

Of course, it's a wrong operation but a hang is not desirable anyway. The deployment should fail with informative error massages.

Steps to Reproduce:
1. Setup BRMS cluster following https://access.redhat.com/documentation/en-US/Red_Hat_JBoss_BRMS/6.2/html-single/Installation_Guide/index.html#Clustering_JAR_Installer
2. Start BRMS cluster domain properly (run helix-core/startCluster.sh first, then run domain.sh)
3. Shutdown the domain
4. kill HelixControllerMain
5. Start the domain

Actual results:

The boot sequence gets stuck as illustrated in the description.

Expected results:

The business-central.war deployment fails, hopefully, with an error message which tells the root cause.

Comment 1 Alexandre Porcelli 2016-01-06 21:50:20 UTC
Unfortunately this is not an easy fix - can't just check if controller is missing and throw an exception or log. This is related to Helix behavior and even more complicated: the cluster state.
Sorry, but this won't be fixed. The good news is for 7 series we have plans to change BxMS clustering and this issue won't exist anymore.


Note You need to log in before you can comment on or make changes to this bug.