1325376 – [GSS](6.4.z) Failover of HASingleton service is not correct if a Custom election policy is used and a node crash

Bug 1325376 - [GSS](6.4.z) Failover of HASingleton service is not correct if a Custom election policy is used and a node crash

Summary: [GSS](6.4.z) Failover of HASingleton service is not correct if a Custom elect...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Clustering
Sub Component:
Version:	6.4.0,6.4.1,6.4.2,6.4.3,6.4.4,6.4.5,6.4.6,6.4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	CR1
Target Release:	EAP 6.4.9
Assignee:	Enrique Gonzalez Martinez
QA Contact:	Michal Vinkler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	eap649-payload 1344378
TreeView+	depends on / blocked

Reported:	2016-04-08 15:36 UTC by wfink
Modified:	2019-11-14 07:45 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Clones:	1344378 (view as bug list)
Environment:
Last Closed:	2017-01-17 12:54:37 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
Logfile node1 (112.06 KB, text/plain) 2016-04-08 15:37 UTC, wfink	no flags	Details
Logfile node2 (95.47 KB, text/plain) 2016-04-08 15:37 UTC, wfink	no flags	Details
Logfile node3 (58.57 KB, text/plain) 2016-04-08 15:37 UTC, wfink	no flags	Details
Logs for start failure.zip (86.86 KB, application/zip) 2016-04-29 15:59 UTC, wfink	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1335495	0	unspecified	CLOSED	Clustering quickstart race condition	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	2361191	0	None	None	None	2016-06-09 15:08:32 UTC

Internal Links: 1335495

Description wfink 2016-04-08 15:36:08 UTC

Description of problem:
Asume there is a cluster of 3 instances
And there are 8 HASingleton services, the election policy is based on the number of nodes and the service name.
- this should be deterministic if the node numbers are the same for each instance!

If nodes are started the services are distributed as
- node1 and node2 6/2
- all nodes 2/2/4
The expectation is that the same distribution is used if node3 is stopped and started again.

This is the case if nodes are stopped graceful.
But if nodes are suspected by JGroups (crash or not responding) the services might be stopped and switch to a different node, but not all services are started correctly.

If a node crashed and is restarted the services are corrected during the restart.
If a node was suspected and merged it take longer but heal at the end.

How reproducible:
Use the quickstart cluster-ha-singleton from here
git:wfink/jboss-eap-quickstarts.git branch:6.4.x-develop_MultiHaSingleton

follow the README, but install 3 servers and start it with:
bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node1 -Djboss.socket.binding.port-offset=0
bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node2 -Djboss.socket.binding.port-offset=100
bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node3 -Djboss.socket.binding.port-offset=200

All 8 services should be started unique on one of the nodes.
Kill one node by using "kill -9" for Linux.
After that the services might be started or stopped on the remaining nodes, but not all 8 are stared at the end.

Expected results:
After the cluster is stable again and all remaining nodes see the same view, all services should be started.

Additional info:
Note there is a difference between EAP6 and EAP7+/WildFly.
EAP6:
each node starts a new election
EAP7/WilfFly:
Only the cluster coordinator start the election

Because of this the election must be deterministic for EAP6, after that the election can be non-deterministic/random because the coordinator manage the distribution.

Comment 1 wfink 2016-04-08 15:37:03 UTC

Created attachment 1145184 [details]
Logfile node1

Comment 2 wfink 2016-04-08 15:37:31 UTC

Created attachment 1145185 [details]
Logfile node2

Comment 3 wfink 2016-04-08 15:37:55 UTC

Created attachment 1145186 [details]
Logfile node3

Comment 5 Enrique Gonzalez Martinez 2016-04-14 14:33:29 UTC

This is a bit confusing but I think this is the description of the problem.

When a  cluster node is killed the only one reacting is the coordinator (restarting all the services in that node properly)

https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/service/src/main/java/org/jboss/as/clustering/service/ServiceProviderRegistryService.java#L149

When this happens the logic of starting/stoping singleton services does not start
in other nodes:

https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/singleton/src/main/java/org/jboss/as/clustering/singleton/SingletonService.java#L163

This means that the only singleton-services running in the slave nodes are those that are already running in those slave nodes (in some cases they are stopped by the coordinator)

When a node joins again the cluster then the listeners are notified again in all nodes, so they start again the services.

Comment 6 wfink 2016-04-15 06:37:12 UTC

Not sure whether only the coordinator is reacting, I did not tried to reproduce with more nodes.
But from my test not 'only' the coordinator react, service which move to the coordinator will be stopped on those slave nodes, but additional services are not started here.

In my test one service was stopped at the slave and started from the coordinator, but an other service was not started on that slave.

Comment 7 Enrique Gonzalez Martinez 2016-04-15 08:50:25 UTC

Hi wfink:

The problem in those cases is that the coordinator is the master but it is not elected as an executing node. This means that it will stop the service in the coordinator; due to the slave is not notifying the group change it never start the service in those cases. So it will stop the service in the slave.

It seems the condition master != not elected is not working properly (not sure if in this case makes sense)

https://github.com/jbossas/jboss-eap/blob/EAP_6.4.7.CR3-dev/clustering/singleton/src/main/java/org/jboss/as/clustering/singleton/SingletonService.java#L178

I will keep looking into this.

Comment 8 Enrique Gonzalez Martinez 2016-04-20 08:17:23 UTC

6.4.x: https://github.com/jbossas/jboss-eap/pull/2762

no upstream required.

Comment 9 wfink 2016-04-29 15:59:40 UTC

Created attachment 1152348 [details]
Logs for start failure.zip

It seems that the situation is enhanced as failure works, I've tried several times to kill or shutdown a node.
The distribution of the services is mostly the same
two nodes : (123456)(78)
three nodes: (37)(45)(1268)

But
after I killed one and stopped one other, the remaining node keep all services.
when I restart the other nodes one of it will not start the services.
See attached logs, last start.

Comment 10 Enrique Gonzalez Martinez 2016-05-02 10:31:09 UTC

I think the problem is whole different one. the 3rd node tries to start the services (1268) and the others are running (37) (45). The 3 node is throwing an exception like this for those service: 

{code}
jboss.quickstart.ha.singleton.timer.Eight.service: Could not initialize timer
	at org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.HATimerService.start(HATimerService.java:80)
	at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1980) [jboss-msc-1.1.5.Final-redhat-1.jar:1.1.5.Final-redhat-1]
	at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1913) [jboss-msc-1.1.5.Final-redhat-1.jar:1.1.5.Final-redhat-1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_91]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_91]
	at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_91]
Caused by: javax.naming.NameNotFoundException: Error looking up global/jboss-cluster-ha-singleton-service/SchedulerEightBean!org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.Scheduler, service service jboss.naming.context.java.global.jboss-cluster-ha-singleton-service."SchedulerEightBean!org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.Scheduler" is not started
	at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:133)
	at org.jboss.as.naming.ServiceBasedNamingStore.lookup(ServiceBasedNamingStore.java:81)
	at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:197)
	at org.jboss.as.naming.InitialContext$DefaultInitialContext.lookup(InitialContext.java:243)
	at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:183)
	at org.jboss.as.naming.NamingContext.lookup(NamingContext.java:179)
	at javax.naming.InitialContext.lookup(InitialContext.java:417) [rt.jar:1.8.0_91]
	at javax.naming.InitialContext.lookup(InitialContext.java:417) [rt.jar:1.8.0_91]
	at org.jboss.as.quickstarts.cluster.hasingleton.service.ejb.HATimerService.start(HATimerService.java:77)
	... 5 more

{code}

It seems that the lookup is done before the ejb started and it throws this problem. I will keep looking into it but I think it is a different bug.

Comment 11 Enrique Gonzalez Martinez 2016-05-02 14:07:20 UTC

#c9 added a new stacktrace but it is not relevant to this bug. The problem is in the quickstart (it is a bug IMO). In this case the HATimerServiceActivator does not create any dependency against the ejb component create/start service causing a race condition:

https://github.com/jboss-developer/jboss-eap-quickstarts/blob/6.4.x/cluster-ha-singleton/service/src/main/java/org/jboss/as/quickstarts/cluster/hasingleton/service/ejb/HATimerServiceActivator.java#L59

.addDependency should be called there against the right ejb component.

I don't really know how quickstarts are handled so not really sure I should open a new BZ (quickstarts + doc ?).

Comment 14 Miroslav Sochurek 2016-06-06 10:11:07 UTC

The PR not merged yet and still ongoing technical discussion about how to fix the problem. May not get into 6.4.9 at all. Changing devel_ack to '?' and will set it back to '+' once this gets merged into 6.4.x branch.

https://github.com/jbossas/jboss-eap/pull/2762

Comment 17 Jiří Bílek 2016-06-22 14:12:31 UTC

Verified with EAP 6.4.9.CP.CR1

Comment 18 Petr Penicka 2017-01-17 12:54:37 UTC

Retroactively bulk-closing issues from released EAP 6.4 cummulative patches.

Note You need to log in before you can comment on or make changes to this bug.