Description of problem: Asume there is a cluster of 3 instances And there are 8 HASingleton services, the election policy is based on the number of nodes and the service name. - this should be deterministic if the node numbers are the same for each instance! If nodes are started the services are distributed as - node1 and node2 6/2 - all nodes 2/2/4 The expectation is that the same distribution is used if node3 is stopped and started again. This is the case if nodes are stopped graceful. But if nodes are suspected by JGroups (crash or not responding) the services might be stopped and switch to a different node, but not all services are started correctly. If a node crashed and is restarted the services are corrected during the restart. If a node was suspected and merged it take longer but heal at the end. How reproducible: Use the quickstart cluster-ha-singleton from here git:wfink/jboss-eap-quickstarts.git branch:6.4.x-develop_MultiHaSingleton follow the README, but install 3 servers and start it with: bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node1 -Djboss.socket.binding.port-offset=0 bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node2 -Djboss.socket.binding.port-offset=100 bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node3 -Djboss.socket.binding.port-offset=200 All 8 services should be started unique on one of the nodes. Kill one node by using "kill -9" for Linux. After that the services might be started or stopped on the remaining nodes, but not all 8 are stared at the end. Expected results: After the cluster is stable again and all remaining nodes see the same view, all services should be started. Additional info: Note there is a difference between EAP6 and EAP7+/WildFly. EAP6: each node starts a new election EAP7/WilfFly: Only the cluster coordinator start the election Because of this the election must be deterministic for EAP6, after that the election can be non-deterministic/random because the coordinator manage the distribution.
Created attachment 1145184 [details] Logfile node1
Created attachment 1145185 [details] Logfile node2
Created attachment 1145186 [details] Logfile node3
This is a bit confusing but I think this is the description of the problem. When a cluster node is killed the only one reacting is the coordinator (restarting all the services in that node properly) https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/service/src/main/java/org/jboss/as/clustering/service/ServiceProviderRegistryService.java#L149 When this happens the logic of starting/stoping singleton services does not start in other nodes: https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/singleton/src/main/java/org/jboss/as/clustering/singleton/SingletonService.java#L163 This means that the only singleton-services running in the slave nodes are those that are already running in those slave nodes (in some cases they are stopped by the coordinator) When a node joins again the cluster then the listeners are notified again in all nodes, so they start again the services.
Not sure whether only the coordinator is reacting, I did not tried to reproduce with more nodes. But from my test not 'only' the coordinator react, service which move to the coordinator will be stopped on those slave nodes, but additional services are not started here. In my test one service was stopped at the slave and started from the coordinator, but an other service was not started on that slave.
Hi wfink: The problem in those cases is that the coordinator is the master but it is not elected as an executing node. This means that it will stop the service in the coordinator; due to the slave is not notifying the group change it never start the service in those cases. So it will stop the service in the slave. It seems the condition master != not elected is not working properly (not sure if in this case makes sense) https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/singleton/src/main/java/org/jboss/as/clustering/singleton/SingletonService.java#L178 I will keep looking into this.
6.4.x: https://github.com/jbossas/jboss-eap/pull/2762 no upstream required.
Created attachment 1152348 [details] Logs for start failure.zip It seems that the situation is enhanced as failure works, I've tried several times to kill or shutdown a node. The distribution of the services is mostly the same two nodes : (123456)(78) three nodes: (37)(45)(1268) But after I killed one and stopped one other, the remaining node keep all services. when I restart the other nodes one of it will not start the services. See attached logs, last start.
I think the problem is whole different one. the 3rd node tries to start the services (1268) and the others are running (37) (45). The 3 node is throwing an exception like this for those service: {code} jboss.quickstart.ha.singleton.timer.Eight.service: Could not initialize timer at org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.HATimerService.start(HATimerService.java:80) at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1980) [jboss-msc-1.1.5.Final-redhat-1.jar:1.1.5.Final-redhat-1] at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1913) [jboss-msc-1.1.5.Final-redhat-1.jar:1.1.5.Final-redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91] Caused by: javax.naming.NameNotFoundException: Error looking up global/jboss-cluster-ha-singleton-service/SchedulerEightBean!org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.Scheduler, service service jboss.naming.context.java.global.jboss-cluster-ha-singleton-service."SchedulerEightBean!org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.Scheduler" is not started at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:133) at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:81) at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:197) at org.jboss.as.naming.InitialContext$DefaultInitialContext.lookup(InitialContext.java:243) at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:183) at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:179) at javax.naming.InitialContext.lookup(InitialContext.java:417) [rt.jar:1.8.0_91] at javax.naming.InitialContext.lookup(InitialContext.java:417) [rt.jar:1.8.0_91] at org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.HATimerService.start(HATimerService.java:77) ... 5 more {code} It seems that the lookup is done before the ejb started and it throws this problem. I will keep looking into it but I think it is a different bug.
#c9 added a new stacktrace but it is not relevant to this bug. The problem is in the quickstart (it is a bug IMO). In this case the HATimerServiceActivator does not create any dependency against the ejb component create/start service causing a race condition: https://github.com/jboss-developer/jboss-eap-quickstarts/blob/6.4.x/cluster-ha-singleton/service/src/main/java/org/jboss/as/quickstarts/cluster/hasingleton/service/ejb/HATimerServiceActivator.java#L59 .addDependency should be called there against the right ejb component. I don't really know how quickstarts are handled so not really sure I should open a new BZ (quickstarts + doc ?).
The PR not merged yet and still ongoing technical discussion about how to fix the problem. May not get into 6.4.9 at all. Changing devel_ack to '?' and will set it back to '+' once this gets merged into 6.4.x branch. https://github.com/jbossas/jboss-eap/pull/2762
Verified with EAP 6.4.9.CP.CR1
Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.