| Summary: | [GSS](6.4.z) Failover of HASingleton service is not correct if a Custom election policy is used and a node crash | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [JBoss] JBoss Enterprise Application Platform 6 | Reporter: | wfink | ||||||||||
| Component: | Clustering | Assignee: | Enrique Gonzalez Martinez <egonzale> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Michal Vinkler <mvinkler> | ||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | high | ||||||||||||
| Version: | 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5, 6.4.6, 6.4.7 | CC: | bmaxwell, cdewolf, egonzale, jbilek, jtruhlar, msochure, paul.ferraro, rnetuka, thies.rubarth | ||||||||||
| Target Milestone: | CR1 | ||||||||||||
| Target Release: | EAP 6.4.9 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 1344378 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2017-01-17 12:54:37 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 1344378, 1324262 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
wfink
2016-04-08 15:36:08 UTC
Created attachment 1145184 [details]
Logfile node1
Created attachment 1145185 [details]
Logfile node2
Created attachment 1145186 [details]
Logfile node3
This is a bit confusing but I think this is the description of the problem. When a cluster node is killed the only one reacting is the coordinator (restarting all the services in that node properly) https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/service/src/main/java/org/jboss/as/clustering/service/ServiceProviderRegistryService.java#L149 When this happens the logic of starting/stoping singleton services does not start in other nodes: https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/singleton/src/main/java/org/jboss/as/clustering/singleton/SingletonService.java#L163 This means that the only singleton-services running in the slave nodes are those that are already running in those slave nodes (in some cases they are stopped by the coordinator) When a node joins again the cluster then the listeners are notified again in all nodes, so they start again the services. Not sure whether only the coordinator is reacting, I did not tried to reproduce with more nodes. But from my test not 'only' the coordinator react, service which move to the coordinator will be stopped on those slave nodes, but additional services are not started here. In my test one service was stopped at the slave and started from the coordinator, but an other service was not started on that slave. Hi wfink: The problem in those cases is that the coordinator is the master but it is not elected as an executing node. This means that it will stop the service in the coordinator; due to the slave is not notifying the group change it never start the service in those cases. So it will stop the service in the slave. It seems the condition master != not elected is not working properly (not sure if in this case makes sense) https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/singleton/src/main/java/org/jboss/as/clustering/singleton/SingletonService.java#L178 I will keep looking into this. 6.4.x: https://github.com/jbossas/jboss-eap/pull/2762 no upstream required. Created attachment 1152348 [details]
Logs for start failure.zip
It seems that the situation is enhanced as failure works, I've tried several times to kill or shutdown a node.
The distribution of the services is mostly the same
two nodes : (123456)(78)
three nodes: (37)(45)(1268)
But
after I killed one and stopped one other, the remaining node keep all services.
when I restart the other nodes one of it will not start the services.
See attached logs, last start.
I think the problem is whole different one. the 3rd node tries to start the services (1268) and the others are running (37) (45). The 3 node is throwing an exception like this for those service:
{code}
jboss.quickstart.ha.singleton.timer.Eight.service: Could not initialize timer
at org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.HATimerService.start(HATimerService.java:80)
at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1980) [jboss-msc-1.1.5.Final-redhat-1.jar:1.1.5.Final-redhat-1]
at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1913) [jboss-msc-1.1.5.Final-redhat-1.jar:1.1.5.Final-redhat-1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91]
Caused by: javax.naming.NameNotFoundException: Error looking up global/jboss-cluster-ha-singleton-service/SchedulerEightBean!org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.Scheduler, service service jboss.naming.context.java.global.jboss-cluster-ha-singleton-service."SchedulerEightBean!org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.Scheduler" is not started
at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:133)
at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:81)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:197)
at org.jboss.as.naming.InitialContext$DefaultInitialContext.lookup(InitialContext.java:243)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:183)
at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:179)
at javax.naming.InitialContext.lookup(InitialContext.java:417) [rt.jar:1.8.0_91]
at javax.naming.InitialContext.lookup(InitialContext.java:417) [rt.jar:1.8.0_91]
at org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.HATimerService.start(HATimerService.java:77)
... 5 more
{code}
It seems that the lookup is done before the ejb started and it throws this problem. I will keep looking into it but I think it is a different bug.
#c9 added a new stacktrace but it is not relevant to this bug. The problem is in the quickstart (it is a bug IMO). In this case the HATimerServiceActivator does not create any dependency against the ejb component create/start service causing a race condition: https://github.com/jboss-developer/jboss-eap-quickstarts/blob/6.4.x/cluster-ha-singleton/service/src/main/java/org/jboss/as/quickstarts/cluster/hasingleton/service/ejb/HATimerServiceActivator.java#L59 .addDependency should be called there against the right ejb component. I don't really know how quickstarts are handled so not really sure I should open a new BZ (quickstarts + doc ?). The PR not merged yet and still ongoing technical discussion about how to fix the problem. May not get into 6.4.9 at all. Changing devel_ack to '?' and will set it back to '+' once this gets merged into 6.4.x branch. https://github.com/jbossas/jboss-eap/pull/2762 Verified with EAP 6.4.9.CP.CR1 Retroactively bulk-closing issues from released EAP 6.4 cummulative patches. |