Description of problem: EJB remote invocations from a client server (not sure whether it is the same if the client is a standalone application) to a clustered server with the EJB via remote outbound connection are not updated with the cluster-view after a node is suspected by a network or GC issue. It seems that this is related to the following facts: - The leaving node is not listed in the initial connections (r-o-c config) - The node is suspected - because of GC (or suspend) so it is not working Here the issue is seen more often - because of network disconnect Here the behaviour can be different because all nodes update the internal state - Cluster heal until the client is still alive (here the case as it is a server which act as ejb-client) The following issues can be seen: - Invocation made during split reach the unavailable node hung - JGroups show message that a node is suspected and a new view applied - after suspecting nodeX client-server get an update "nodeX removed from cluster-view" - JGroups log messages with the new merged cluster view - after re-join nodeX the client-server get an update "node removed from cluster-view" where the "node" here is the still connected node not the one which left! - in this case the client might restart the initial node detection - No client message for a view after cluster (JGroups) is correct again - in this case the client continue with only one node How reproducible: A reproducer is based on the ejb-multi-server quickstart here git:wfink/jboss-eap-quickstarts.git branch 6.4.x-ClusteredEJBTest-failoverForServer2Server* Steps to Reproduce: 1. build the QS 2. install 3 servers EAP6.4.x 3. remove the security-realm from remoting subsystem <connector> (standalone-ha.xml) start two as a cluster with standalone-ha.xml config deploy the EAR app-one/ear/target/jboss-ejb-multi-server-app-one.ear 4. Add an outbound connector to standalone.xml for the "client" <outbound-connections> <remote-outbound-connection name="remote-ejb-connection" outbound-socket-binding-ref="remote-ejb-1"> <properties> <property name="SASL_POLICY_NOANONYMOUS" value="false"/> <property name="SSL_ENABLED" value="false"/> <property name="org.jboss.remoting3.RemotingOptions.HEARTBEAT_INTERVAL" value="15000"/> </properties> </remote-outbound-connection> </outbound-connections> .... <outbound-socket-binding name="remote-ejb-1"> <remote-destination host="node1" port="4547"/> </outbound-socket-binding> 5. start the client-server with default 4447 port 6. use QS client/ mvn exec:java to show LB work 7. disconnect or suspend node2 (not in r-o-c) 8. see cluster message for split 9. maybe invoke client again to see only node1 respond and invocation which go to node2 might hung 10. reconnect node2 11. see cluster messages for new view 12. invoke client again to see whether LB is back to work Actual results: As mentioned above the client will not receive correct updates and might continue with a wrong LB Expected results: Client should receive correct updates and use all available cluster nodes for LB
Within the Clustering subsystem, the RegistryService is responsible for detecting changes in group membership and merges and calling listeners, broadly speaking. The RegistryService depends on callbacks from the ISPN via the CacheManagerListener interface, in order to keep its view of cluster state up to date. The CacheManagerListener interface supports four annotations: @ViewChanged, @Merged, @CacheStopped and @CacheStarted. In particular, when a JGroups view change occurs, the ISPN component JGroupsTransport calls viewAccepted which handles two cases: notifying that a view change has occurred, or notifying that a merge has occurred. Components that make use of the @ViewChanged annotation get NotifyViewChanged callbacks from CacheManagerNotifier; components that make use of the @Merged annotation get NotifyMerge callbacks from the CacheManagerNotifier. The problem here seems to be that the RegistryService was registering for @ViewChanged callbacks but not @Merged callbacks. So view change events were being handled properly but not merges.
For example, here is a trace of two nodes, node1+ and node2+ before adding the @Merged annotation: node1+ ------ 14:03:24,665 INFO [org.jboss.as.server] (management-handler-thread - 1) JBAS015859: Deployed "jboss-ejb-multi-server-app-one.ear" (runtime-name : "jboss-ejb-multi-server-app-one.ear") 14:03:34,986 INFO [org.jboss.as.clustering] (Incoming-1,shared=udp) JBAS010225: New cluster view for partition ejb (id: 1, delta: 1, merge: false) : [node1+/ejb, node2+/ejb] 14:03:34,987 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,shared=udp) ISPN000094: Received new cluster view: [node1+/ejb|1] [node1+/ejb, node2+/ejb] 14:03:34,988 INFO [stdout] (Incoming-1,shared=udp) ISPN: notify view change called!!! 14:03:34,988 INFO [stdout] (Incoming-1,shared=udp) ISPN: notify view change listener called: org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange 14:03:34,988 INFO [stdout] (Incoming-1,shared=udp) ISPN: notify view change listener called: org.jboss.as.clustering.registry.RegistryService, method = viewChanged 14:03:34,989 INFO [stdout] (notification-thread-0) HEY - view is changing - merge = false 14:03:35,210 DEBUG [org.jboss.as.ejb3.remote.LocalEjbReceiver] (notification-thread-0) Skipping cluster node additions to EJB client context org.jboss.ejb.client.EJBClientContext@21e8b3fc since it can only handle local node --- (node2+ is suspended) 14:06:00,502 INFO [org.jboss.as.clustering] (VERIFY_SUSPECT.TimerThread,ejb,node1+/ejb) JBAS010232: Suspected member: node2+/ejb 14:06:00,899 INFO [org.jboss.as.clustering] (Incoming-5,shared=udp) JBAS010225: New cluster view for partition ejb (id: 2, delta: -1, merge: false) : [node1+/ejb] 14:06:00,901 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-5,shared=udp) ISPN000094: Received new cluster view: [node1+/ejb|2] [node1+/ejb] 14:06:00,902 INFO [stdout] (Incoming-5,shared=udp) ISPN: notify view change called!!! 14:06:00,903 INFO [stdout] (Incoming-5,shared=udp) ISPN: notify view change listener called: org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange 14:06:00,904 INFO [stdout] (Incoming-5,shared=udp) ISPN: notify view change listener called: org.jboss.as.clustering.registry.RegistryService, method = viewChanged 14:06:00,904 INFO [stdout] (notification-thread-0) HEY - view is changing - merge = false --- (node2+ is resumed) 14:08:40,906 INFO [org.jboss.as.clustering] (Incoming-12,shared=udp) JBAS010226: New cluster view for partition ejb: 3 (org.jboss.as.clustering.impl.CoreGroupCommunicationService$GroupView@15c3fe9c delta: 1, merge: true) 14:08:40,907 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-12,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node2+/ejb|3] [node2+/ejb, node1+/ejb], subgroups=[node1+/ejb|1] [node2+/ejb], [node1+/ejb|2] [node1+/ejb] 14:08:40,907 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge called!!! 14:08:40,907 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange node2+ ------- 14:03:35,217 DEBUG [org.jboss.as.ejb3.remote.LocalEjbReceiver] (MSC service thread 1-1) Skipping cluster node additions to EJB client context org.jboss.ejb.client.EJBClientContext@5ce4ad3d since it can only handle local node 14:03:35,222 DEBUG [org.jboss.as.ejb3] (MSC service thread 1-4) Using pool config StrictMaxPoolConfig{name=slsb-strict-max-pool, maxPoolSize=20, timeoutUnit=MINUTES, timeout=5} to create pool for Stateless EJB AppOneBean 14:03:35,288 INFO [org.jboss.as.server] (management-handler-thread - 1) JBAS015859: Deployed "jboss-ejb-multi-server-app-one.ear" (runtime-name : "jboss-ejb-multi-server-app-one.ear") --- (node2+ is suspended) ^Z [1]+ Stopped ./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200 --- (node2+ is resumed) [nrla@lenovo bin]$ fg ./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200 14:06:40,057 WARN [org.jgroups.protocols.FD] (OOB-19,shared=udp) I was suspected by node1+/ejb; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK 14:06:40,057 WARN [org.jgroups.protocols.pbcast.GMS] (Incoming-5,shared=udp) node2+/ejb: not member of view [node1+/ejb|2]; discarding it 14:08:40,908 INFO [org.jboss.as.clustering] (Incoming-11,shared=udp) JBAS010225: New cluster view for partition ejb (id: 3, delta: 0, merge: true) : [node2+/ejb, node1+/ejb] 14:08:40,908 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-11,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node2+/ejb|3] [node2+/ejb, node1+/ejb], subgroups=[node1+/ejb|1] [node2+/ejb], [node1+/ejb|2] [node1+/ejb] 14:08:40,909 INFO [stdout] (Incoming-11,shared=udp) ISPN: notify merge called!!! 14:08:40,909 INFO [stdout] (Incoming-11,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange You can see that only one listener is registered for merge events and that "HEY -view is changing" is not written by the RegistryService when the merge occurs. Here is the same example after adding a @Merged annotation: node1+ ------ 14:32:45,031 INFO [org.jboss.as.clustering] (VERIFY_SUSPECT.TimerThread,ejb,node1+/ejb) JBAS010232: Suspected member: node2+/ejb 14:32:45,479 INFO [org.jboss.as.clustering] (Incoming-4,shared=udp) JBAS010225: New cluster view for partition ejb (id: 2, delta: -1, merge: false) : [node1+/ejb] 14:32:45,480 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-4,shared=udp) ISPN000094: Received new cluster view: [node1+/ejb|2] [node1+/ejb] 14:32:45,480 INFO [stdout] (Incoming-4,shared=udp) ISPN: notify view change called!!! 14:32:45,480 INFO [stdout] (Incoming-4,shared=udp) ISPN: notify view change listener called: org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange 14:32:45,481 INFO [stdout] (Incoming-4,shared=udp) ISPN: notify view change listener called: org.jboss.as.clustering.registry.RegistryService, method = viewChanged 14:32:45,519 INFO [stdout] (notification-thread-0) HEY - view is changing - merge = false 14:35:25,494 INFO [org.jboss.as.clustering] (Incoming-12,shared=udp) JBAS010225: New cluster view for partition ejb (id: 3, delta: 1, merge: true) : [node1+/ejb, node2+/ejb] 14:35:25,495 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-12,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node1+/ejb|3] [node1+/ejb, node2+/ejb], subgroups=[node1+/ejb|2] [node1+/ejb], [node2+/ejb|1] [node2+/ejb] 14:35:25,496 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge called!!! 14:35:25,497 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange 14:35:25,497 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.jboss.as.clustering.registry.RegistryService, method = viewChanged 14:35:25,498 INFO [stdout] (notification-thread-0) HEY - view is changing - merge = true node2+ ------ ^Z [1]+ Stopped ./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200 [nrla@lenovo bin]$ fg ./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200 14:32:53,744 WARN [org.jgroups.protocols.FD] (OOB-7,shared=udp) I was suspected by node1+/ejb; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK 14:32:53,747 WARN [org.jgroups.protocols.pbcast.GMS] (Incoming-3,shared=udp) node2+/ejb: not member of view [node1+/ejb|2]; discarding it 14:35:25,495 INFO [org.jboss.as.clustering] (Incoming-12,shared=udp) JBAS010226: New cluster view for partition ejb: 3 (org.jboss.as.clustering.impl.CoreGroupCommunicationService$GroupView@6de84f42 delta: 0, merge: true) 14:35:25,496 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-12,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node1+/ejb|3] [node1+/ejb, node2+/ejb], subgroups=[node1+/ejb|2] [node1+/ejb], [node2+/ejb|1] [node2+/ejb] 14:35:25,497 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge called!!! 14:35:25,497 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange 14:35:25,498 INFO [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.jboss.as.clustering.registry.RegistryService, method = viewChanged 14:35:25,498 INFO [stdout] (notification-thread-0) HEY - view is changing - merge = true
I've done a quick manual test of the issue with the fix applied (the fix was to add a @Merged annotation to the RegistryService.viewChanged() method and the initial problem seems to have been fixed. When the cluster is suspended and then resumed, the node1+ sends out a topology update to the server client, which then updates its ClusterContext and later invocations on the server-client are again directed to node2+. I was expecting to see node2+ also send a topology update back to the server-client, but don't see any evidence of that. Need to check out why not.
The previous comment should read "When node2+ is suspended and then resumed ...".
Do we need an upstream JIRA for this? The issue does not exist (or has been fixed, depending on how you want to look at it) in WF10.
If it doesn't affect upstream, there's no upstream JIRA. (unless there's an existing JIRA from when it was changed that could be linked)
Here is the PR for 6.4.4: https://github.com/jbossas/jboss-eap/pull/2661
There was no fix required in upstream as the implementation had changed significantly since 6.3.3. and Paul had added the @Merged annotation there.
Verified with EAP 6.4.7.CP.CR2.
Retroactively bulk-closing issues from released EAP 6.4 cumulative patches.