927615 – Failed to recover cluster state after the current node became the coordinator: IllegalStateException

Bug 927615 - Failed to recover cluster state after the current node became the coordinator: IllegalStateException

Summary: Failed to recover cluster state after the current node became the coordinator...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Enterprise Application Platform 6
Classification:	JBoss
Component:	Clustering
Sub Component:
Version:	6.1.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	EAP 6.4.4
Assignee:	Paul Ferraro
QA Contact:	Jitka Kozana
Docs Contact:	Russell Dickenson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-26 10:44 UTC by Ladislav Thon
Modified:	2019-09-12 07:43 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-07 13:37:30 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)
6.3.0.DR2_ISPN000196_NPE (19.65 KB, text/plain) 2014-03-06 10:45 UTC, Michal Karm Babacek	no flags	Details
ISPN000196: Failed to recover cluster state after the current node became the coordinator: java.lang.NullPointerException (26.07 KB, text/plain) 2015-12-04 09:19 UTC, Michal Karm Babacek	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-2935	0	Critical	Closed	After view change single node is left in cluster, but still tries to invoke stuff remotely	2020-03-09 02:11:18 UTC
Red Hat Issue Tracker	ISPN-3395	0	Major	Closed	ISPN000196: Failed to recover cluster state after the current node became the coordinator	2020-03-09 02:11:18 UTC

Description Ladislav Thon 2013-03-26 10:44:18 UTC

During EAP 6.1.0.ER3 testing, I've seen the following set of exceptions in server logs. The job is https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-netDown-repl-async/7/

On perf18:

09:46:01,772 INFO  [org.jboss.as.clustering] (Incoming-8,shared=udp) JBAS010225: New cluster view for partition ejb (id: 5, delta: 1, merge: true) : [perf18/ejb, perf21/ejb, perf19/ejb]
09:46:01,773 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-8,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[perf18/ejb|5] [perf18/ejb, perf21/ejb, perf19/ejb], subgroups=[perf21/ejb|3] [perf21/ejb, perf20/ejb], [perf19/ejb|0] [perf19/ejb], [perf18/ejb|4] [perf18/ejb, perf20/ejb]
09:46:01,817 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] (notification-thread-0) ISPN000196: Failed to recover cluster state after the current node became the coordinator: java.lang.IllegalStateException: Trying to set a topology with invalid members for cache org.jboss.test.clusterbench.ejb.stateful.RemoteStatefulSBImpl: members = [perf18/ejb, perf21/ejb, perf19/ejb], topology = CacheTopology{id=10, currentCH=ReplicatedConsistentHash{members=[perf18/ejb, perf20/ejb, perf21/ejb, perf19/ejb]}, pendingCH=null}
	at org.infinispan.topology.ClusterCacheStatus.updateCacheTopology(ClusterCacheStatus.java:165)
	at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:315)
	at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)
	at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.6.0_43]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [rt.jar:1.6.0_43]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [rt.jar:1.6.0_43]
	at java.lang.reflect.Method.invoke(Method.java:597) [rt.jar:1.6.0_43]
	at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
	at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]

At the same time on perf19:

09:46:01,772 ERROR [org.jboss.as.clustering] (Incoming-20,shared=udp) JBAS010223: ViewAccepted failed: java.lang.IllegalStateException: JBAS010240: Address 36d32c21-a759-5f85-76e6-1fe25f038bc6 not registered in transport layer
	at org.jboss.as.clustering.impl.CoreGroupCommunicationService$ClusterNodeFactoryImpl.getClusterNode(CoreGroupCommunicationService.java:1350)
	at org.jboss.as.clustering.impl.CoreGroupCommunicationService.translateAddresses(CoreGroupCommunicationService.java:1010)
	at org.jboss.as.clustering.impl.CoreGroupCommunicationService$GroupView.<init>(CoreGroupCommunicationService.java:1109)
	at org.jboss.as.clustering.impl.CoreGroupCommunicationService.processViewChange(CoreGroupCommunicationService.java:945)
	at org.jboss.as.clustering.impl.CoreGroupCommunicationService$MembershipListenerImpl.viewAccepted(CoreGroupCommunicationService.java:1403)
	at org.jgroups.blocks.MessageDispatcher.handleUpEvent(MessageDispatcher.java:523)
	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:600)
	at org.jgroups.blocks.mux.MuxUpHandler.passToAllHandlers(MuxUpHandler.java:156)
	at org.jgroups.blocks.mux.MuxUpHandler.up(MuxUpHandler.java:123)
	at org.jgroups.JChannel.up(JChannel.java:707)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)
	at org.jgroups.protocols.RSVP.up(RSVP.java:188)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
	at org.jgroups.protocols.pbcast.GMS.installView(GMS.java:638)
	at org.jgroups.protocols.pbcast.CoordGmsImpl.handleViewChange(CoordGmsImpl.java:251)
	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:799)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:245)
	at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:453)
	at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:763)
	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:574)
	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)
	at org.jgroups.protocols.FD.up(FD.java:253)
	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
	at org.jgroups.protocols.MERGE3.up(MERGE3.java:290)
	at org.jgroups.protocols.Discovery.up(Discovery.java:359)
	at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2616)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1263)
	at org.jgroups.protocols.TP$4.run(TP.java:1181)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) [rt.jar:1.6.0_43]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) [rt.jar:1.6.0_43]
	at java.lang.Thread.run(Thread.java:662) [rt.jar:1.6.0_43]

At the same time, perf20 is separated from the cluster by network failure.

At the same time on perf21:

09:46:01,781 INFO  [org.jboss.as.clustering] (Incoming-19,shared=udp) JBAS010226: New cluster view for partition ejb: 5 (org.jboss.as.clustering.impl.CoreGroupCommunicationService$GroupView@4e21ce1 delta: 1, merge: true)
09:46:01,784 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-19,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[perf18/ejb|5] [perf18/ejb, perf21/ejb, perf19/ejb], subgroups=[perf21/ejb|3] [perf21/ejb, perf20/ejb], [perf19/ejb|0] [perf19/ejb], [perf18/ejb|4] [perf18/ejb, perf20/ejb]

Searching for keywords, bug 924220 and https://issues.jboss.org/browse/ISPN-2752 come up, but I don't really think they are related.

Comment 1 Ladislav Thon 2013-03-26 12:22:43 UTC

And also seen in https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-netDown-repl-sync/7/console-perf19/ (only the first exception, the second doesn't seem to appear anywhere).

Comment 2 Ladislav Thon 2013-05-15 09:26:14 UTC

Still seeing this with EAP 6.1.0.ER8. For example:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-netDown-repl-sync/11/artifact/report/config/jboss-perf21/server.log
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-http-session-netDown-repl-async/11/artifact/report/config/jboss-perf21/server.log

Comment 3 Jitka Kozana 2013-08-27 08:57:34 UTC

The error

02:31:14,431 ERROR [org.jboss.as.clustering] (Incoming-12,shared=tcp) JBAS010223: ViewAccepted failed: java.lang.IllegalStateException: JBAS010240: Address ec1de289-4782-a45c-cb61-d662009b71e7 not registered in transport layer

was seen during 6.1.1.ER3 testing here:

https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-6x-failover-ejb-ejbremote-shutdown-dist-sync-tcpStack/17/artifact/report/config/jboss-perf20/server.log

however without the first exception (Failed to recover cluster state after the current node became the coordinator: ...).

Comment 4 Paul Ferraro 2013-12-12 15:04:38 UTC

Not seen in 6.1.1.  Please retest against 6.3.0.GA.

Comment 5 Richard Janík 2014-03-04 14:07:25 UTC

I was instructed to retest against 6.3.0.DR1. Haven't seen it there so I consider it verified.

Comment 6 Michal Karm Babacek 2014-03-06 10:44:28 UTC

6.3.0.DR2

I've got a lovely

{noformat}
ISPN000196: Failed to recover cluster state after the current node became the coordinator: java.lang.NullPointerException
{noformat}

with a dead-simple scenario, start, deploy, undeploy, shutdown. Two nodes, <distributable/> app.

I'm attaching the log.

Comment 7 Michal Karm Babacek 2014-03-06 10:45:42 UTC

Created attachment 871331 [details]
6.3.0.DR2_ISPN000196_NPE

Comment 8 Michal Karm Babacek 2014-03-06 10:47:56 UTC

Added the upstream counterpart.

Comment 9 Michal Karm Babacek 2014-03-06 10:58:28 UTC


I'm sorry, the extend of my attention disorder can be hardly exaggerated.
I accidentally swapped logs with the previous test run :-)

Comment 10 Michal Karm Babacek 2014-03-06 11:00:10 UTC

Comment on attachment 871331 [details]
6.3.0.DR2_ISPN000196_NPE

The [attachment 871331 [details]] is invalid. Ignore it.

Comment 11 Scott Mumford 2014-05-05 01:35:03 UTC

Paul, what version was this actually fixed in? Your comment (4) seem to suggest it's been fixed since 6.1.1.

Does it need a release note for 6.3.0? If so, can you please supply an outline of the situation using the prompts in the Doc Text field above.

Cheers

Comment 12 Paul Ferraro 2014-05-14 15:26:51 UTC

This does not need a release note since it was fixed in a previous release cycle, even though it was only verified this cycle.

Comment 13 JBoss JIRA Server 2014-12-03 10:55:35 UTC

Tristan Tarrant <ttarrant> updated the status of jira ISPN-3395 to Resolved

Comment 14 dereed 2015-02-05 00:12:56 UTC

The original issue in this BZ was fixed as part of ISPN-2935.

Comment 15 Michal Karm Babacek 2015-12-04 09:19:44 UTC

Created attachment 1102190 [details]
ISPN000196: Failed to recover cluster state after the current node became the coordinator: java.lang.NullPointerException

Comment 16 Michal Karm Babacek 2015-12-04 09:21:38 UTC

I hit the issue again with JBoss EAP 6.4.4.GA during a simple mod_cluster failover scenario on 3 nodes.

Comment 17 Ladislav Thon 2015-12-04 10:49:17 UTC

I believe that what mbabacek is seeing in comment 15 is in fact bug 1283465.

Comment 18 Michal Karm Babacek 2015-12-04 11:51:14 UTC

@LaDiSlAv, hmm, yes, it appears to be the case.

Comment 19 Min Woo Park 2015-12-07 04:31:03 UTC

When will this bug be fixed in EAP 6.3.2 GA ?

Comment 20 Ivo Studensky 2015-12-07 13:37:30 UTC

The issue is tracked by bug 1283465.

Note You need to log in before you can comment on or make changes to this bug.