Bug 1290848

Summary: [GSS](6.4.z) EJB client loadbalancing will not work correct after a node is dropped from the cluster because of network or GC issues
Product: [JBoss] JBoss Enterprise Application Platform 6 Reporter: wfink
Component: EJBAssignee: Richard Achmatowicz <rachmato>
Status: CLOSED CURRENTRELEASE QA Contact: Jan Martiska <jmartisk>
Severity: unspecified Docs Contact:
Priority: medium    
Version: 6.3.3, 6.4.0, 6.4.1, 6.4.2, 6.4.3, 6.4.4, 6.4.5CC: bmaxwell, david.lloyd, dereed, dpospisi, jbilek, jochen.riedlinger, jtruhlar, rachmato
Target Milestone: CR1   
Target Release: EAP 6.4.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-17 11:55:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1279552    

Description wfink 2015-12-11 16:24:27 UTC
Description of problem:
EJB remote invocations from a client server (not sure whether it is the same if the client is a standalone application) to a clustered server with the EJB via remote outbound connection are not updated with the cluster-view after a node is suspected by a network or GC issue.

It seems that this is related to the following facts:
- The leaving node is not listed in the initial connections (r-o-c config)
- The node is suspected
  - because of GC (or suspend) so it is not working
    Here the issue is seen more often
  - because of network disconnect
    Here the behaviour can be different because all nodes update the internal state
- Cluster heal until the client is still alive (here the case as it is a server which act as ejb-client)


The following issues can be seen:
- Invocation made during split reach the unavailable node hung
- JGroups show message that a node is suspected and a new view applied
- after suspecting nodeX client-server get an update "nodeX removed from cluster-view"
- JGroups log messages with the new merged cluster view
- after re-join nodeX the client-server get an update "node removed from cluster-view" where the "node" here is the still connected node not the one which left!
  - in this case the client might restart the initial node detection
- No client message for a view after cluster (JGroups) is correct again
  - in this case the client continue with only one node


How reproducible:
A reproducer is based on the ejb-multi-server quickstart here
   git:wfink/jboss-eap-quickstarts.git
branch
   6.4.x-ClusteredEJBTest-failoverForServer2Server*

Steps to Reproduce:
1. build the QS
2. install 3 servers EAP6.4.x
3. remove the security-realm from remoting subsystem <connector> (standalone-ha.xml)
   start two as a cluster with standalone-ha.xml config
   deploy the EAR  app-one/ear/target/jboss-ejb-multi-server-app-one.ear
4. Add an outbound connector to standalone.xml for the "client"

           <outbound-connections>
                <remote-outbound-connection name="remote-ejb-connection" outbound-socket-binding-ref="remote-ejb-1">
                    <properties>
                        <property name="SASL_POLICY_NOANONYMOUS" value="false"/>
                        <property name="SSL_ENABLED" value="false"/>
                        <property name="org.jboss.remoting3.RemotingOptions.HEARTBEAT_INTERVAL" value="15000"/>
                    </properties>
                </remote-outbound-connection>
            </outbound-connections>
    ....
        <outbound-socket-binding name="remote-ejb-1">
            <remote-destination host="node1" port="4547"/>
        </outbound-socket-binding>

5. start the client-server with default 4447 port
6. use QS client/ mvn exec:java to show LB work
7. disconnect or suspend node2 (not in r-o-c)
8. see cluster message for split
9. maybe invoke client again to see only node1 respond 
   and invocation which go to node2 might hung
10. reconnect node2
11. see cluster messages for new view
12. invoke client again to see whether LB is back to work


Actual results:
As mentioned above the client will not receive correct updates and might continue with a wrong LB

Expected results:
Client should receive correct updates and use all available cluster nodes for LB

Comment 1 Richard Achmatowicz 2015-12-14 20:06:19 UTC
Within the Clustering subsystem, the RegistryService is responsible for detecting changes in group membership and merges and calling listeners, broadly speaking.

The RegistryService depends on callbacks from the ISPN via the CacheManagerListener interface, in order to keep its view of cluster state up to date. The CacheManagerListener interface supports four annotations: @ViewChanged, @Merged, @CacheStopped and @CacheStarted. 

In particular, when a JGroups view change occurs, the ISPN component JGroupsTransport calls viewAccepted which handles two cases: notifying that a view change has occurred, or notifying that a merge has occurred. Components that make use of the @ViewChanged annotation get NotifyViewChanged callbacks from CacheManagerNotifier; components that make use of the @Merged annotation get NotifyMerge callbacks from the CacheManagerNotifier. 

The problem here seems to be that the RegistryService was registering for @ViewChanged callbacks but not @Merged callbacks. So view change events were being handled properly but not merges.

Comment 2 Richard Achmatowicz 2015-12-14 20:09:21 UTC
For example, here is a trace of two nodes, node1+ and node2+ before adding the @Merged annotation:

node1+
------
14:03:24,665 INFO  [org.jboss.as.server] (management-handler-thread - 1) JBAS015859: Deployed "jboss-ejb-multi-server-app-one.ear" (runtime-name : "jboss-ejb-multi-server-app-one.ear")
14:03:34,986 INFO  [org.jboss.as.clustering] (Incoming-1,shared=udp) JBAS010225: New cluster view for partition ejb (id: 1, delta: 1, merge: false) : [node1+/ejb, node2+/ejb]
14:03:34,987 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,shared=udp) ISPN000094: Received new cluster view: [node1+/ejb|1] [node1+/ejb, node2+/ejb]
14:03:34,988 INFO  [stdout] (Incoming-1,shared=udp) ISPN: notify view change called!!!
14:03:34,988 INFO  [stdout] (Incoming-1,shared=udp) ISPN: notify view change listener called: org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange
14:03:34,988 INFO  [stdout] (Incoming-1,shared=udp) ISPN: notify view change listener called: org.jboss.as.clustering.registry.RegistryService, method = viewChanged
14:03:34,989 INFO  [stdout] (notification-thread-0) HEY - view is changing - merge = false
14:03:35,210 DEBUG [org.jboss.as.ejb3.remote.LocalEjbReceiver] (notification-thread-0) Skipping cluster node additions to EJB client context org.jboss.ejb.client.EJBClientContext@21e8b3fc since it can only handle local node
--- (node2+ is suspended)
14:06:00,502 INFO  [org.jboss.as.clustering] (VERIFY_SUSPECT.TimerThread,ejb,node1+/ejb) JBAS010232: Suspected member: node2+/ejb
14:06:00,899 INFO  [org.jboss.as.clustering] (Incoming-5,shared=udp) JBAS010225: New cluster view for partition ejb (id: 2, delta: -1, merge: false) : [node1+/ejb]
14:06:00,901 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-5,shared=udp) ISPN000094: Received new cluster view: [node1+/ejb|2] [node1+/ejb]
14:06:00,902 INFO  [stdout] (Incoming-5,shared=udp) ISPN: notify view change called!!!
14:06:00,903 INFO  [stdout] (Incoming-5,shared=udp) ISPN: notify view change listener called: org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange
14:06:00,904 INFO  [stdout] (Incoming-5,shared=udp) ISPN: notify view change listener called: org.jboss.as.clustering.registry.RegistryService, method = viewChanged
14:06:00,904 INFO  [stdout] (notification-thread-0) HEY - view is changing - merge = false
--- (node2+ is resumed)
14:08:40,906 INFO  [org.jboss.as.clustering] (Incoming-12,shared=udp) JBAS010226: New cluster view for partition ejb: 3 (org.jboss.as.clustering.impl.CoreGroupCommunicationService$GroupView@15c3fe9c delta: 1, merge: true)
14:08:40,907 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-12,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node2+/ejb|3] [node2+/ejb, node1+/ejb], subgroups=[node1+/ejb|1] [node2+/ejb], [node1+/ejb|2] [node1+/ejb]
14:08:40,907 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge called!!!
14:08:40,907 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange


node2+
-------
14:03:35,217 DEBUG [org.jboss.as.ejb3.remote.LocalEjbReceiver] (MSC service thread 1-1) Skipping cluster node additions to EJB client context org.jboss.ejb.client.EJBClientContext@5ce4ad3d since it can only handle local node
14:03:35,222 DEBUG [org.jboss.as.ejb3] (MSC service thread 1-4) Using pool config StrictMaxPoolConfig{name=slsb-strict-max-pool, maxPoolSize=20, timeoutUnit=MINUTES, timeout=5} to create pool for Stateless EJB AppOneBean
14:03:35,288 INFO  [org.jboss.as.server] (management-handler-thread - 1) JBAS015859: Deployed "jboss-ejb-multi-server-app-one.ear" (runtime-name : "jboss-ejb-multi-server-app-one.ear")
--- (node2+ is suspended)
^Z
[1]+  Stopped                 ./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200
--- (node2+ is resumed)
[nrla@lenovo bin]$ fg
./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200
14:06:40,057 WARN  [org.jgroups.protocols.FD] (OOB-19,shared=udp) I was suspected by node1+/ejb; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
14:06:40,057 WARN  [org.jgroups.protocols.pbcast.GMS] (Incoming-5,shared=udp) node2+/ejb: not member of view [node1+/ejb|2]; discarding it
14:08:40,908 INFO  [org.jboss.as.clustering] (Incoming-11,shared=udp) JBAS010225: New cluster view for partition ejb (id: 3, delta: 0, merge: true) : [node2+/ejb, node1+/ejb]
14:08:40,908 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-11,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node2+/ejb|3] [node2+/ejb, node1+/ejb], subgroups=[node1+/ejb|1] [node2+/ejb], [node1+/ejb|2] [node1+/ejb]
14:08:40,909 INFO  [stdout] (Incoming-11,shared=udp) ISPN: notify merge called!!!
14:08:40,909 INFO  [stdout] (Incoming-11,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange

You can see that only one listener is registered for merge events and that "HEY -view is changing" is not written by the RegistryService when the merge occurs.

Here is the same example after adding a @Merged annotation:

node1+
------
14:32:45,031 INFO  [org.jboss.as.clustering] (VERIFY_SUSPECT.TimerThread,ejb,node1+/ejb) JBAS010232: Suspected member: node2+/ejb
14:32:45,479 INFO  [org.jboss.as.clustering] (Incoming-4,shared=udp) JBAS010225: New cluster view for partition ejb (id: 2, delta: -1, merge: false) : [node1+/ejb]
14:32:45,480 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-4,shared=udp) ISPN000094: Received new cluster view: [node1+/ejb|2] [node1+/ejb]
14:32:45,480 INFO  [stdout] (Incoming-4,shared=udp) ISPN: notify view change called!!!
14:32:45,480 INFO  [stdout] (Incoming-4,shared=udp) ISPN: notify view change listener called: org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange
14:32:45,481 INFO  [stdout] (Incoming-4,shared=udp) ISPN: notify view change listener called: org.jboss.as.clustering.registry.RegistryService, method = viewChanged
14:32:45,519 INFO  [stdout] (notification-thread-0) HEY - view is changing - merge = false
14:35:25,494 INFO  [org.jboss.as.clustering] (Incoming-12,shared=udp) JBAS010225: New cluster view for partition ejb (id: 3, delta: 1, merge: true) : [node1+/ejb, node2+/ejb]
14:35:25,495 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-12,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node1+/ejb|3] [node1+/ejb, node2+/ejb], subgroups=[node1+/ejb|2] [node1+/ejb], [node2+/ejb|1] [node2+/ejb]
14:35:25,496 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge called!!!
14:35:25,497 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange
14:35:25,497 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.jboss.as.clustering.registry.RegistryService, method = viewChanged
14:35:25,498 INFO  [stdout] (notification-thread-0) HEY - view is changing - merge = true

node2+
------
^Z
[1]+  Stopped                 ./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200
[nrla@lenovo bin]$ fg
./standalone.sh -c standalone-ha.xml -Djboss.node.name=node2+ -Djboss.socket.binding.port-offset=200
14:32:53,744 WARN  [org.jgroups.protocols.FD] (OOB-7,shared=udp) I was suspected by node1+/ejb; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
14:32:53,747 WARN  [org.jgroups.protocols.pbcast.GMS] (Incoming-3,shared=udp) node2+/ejb: not member of view [node1+/ejb|2]; discarding it
14:35:25,495 INFO  [org.jboss.as.clustering] (Incoming-12,shared=udp) JBAS010226: New cluster view for partition ejb: 3 (org.jboss.as.clustering.impl.CoreGroupCommunicationService$GroupView@6de84f42 delta: 0, merge: true)
14:35:25,496 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-12,shared=udp) ISPN000093: Received new, MERGED cluster view: MergeView::[node1+/ejb|3] [node1+/ejb, node2+/ejb], subgroups=[node1+/ejb|2] [node1+/ejb], [node2+/ejb|1] [node2+/ejb]
14:35:25,497 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge called!!!
14:35:25,497 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener, method = handleViewChange
14:35:25,498 INFO  [stdout] (Incoming-12,shared=udp) ISPN: notify merge listener called: class = org.jboss.as.clustering.registry.RegistryService, method = viewChanged
14:35:25,498 INFO  [stdout] (notification-thread-0) HEY - view is changing - merge = true

Comment 3 Richard Achmatowicz 2015-12-14 21:04:31 UTC
I've done a quick manual test of the issue with the fix applied (the fix was to add a @Merged annotation to the RegistryService.viewChanged() method and the initial problem seems to have been fixed. When the cluster is suspended and then resumed, the node1+ sends out a topology update to the server client, which then updates its ClusterContext and later invocations on the server-client are again directed to node2+.

I was expecting to see node2+ also send a topology update back to the server-client, but don't see any evidence of that. Need to check out why not.

Comment 4 Richard Achmatowicz 2015-12-14 21:07:00 UTC
The previous comment should read "When node2+ is suspended and then resumed ...".

Comment 5 Richard Achmatowicz 2015-12-14 21:13:29 UTC
Do we need an upstream JIRA for this? The issue does not exist (or has been fixed, depending on how you want to look at it) in WF10.

Comment 6 dereed 2015-12-14 22:10:14 UTC
If it doesn't affect upstream, there's no upstream JIRA.
(unless there's an existing JIRA from when it was changed that could be linked)

Comment 7 Richard Achmatowicz 2015-12-15 16:40:05 UTC
Here is the PR for 6.4.4: https://github.com/jbossas/jboss-eap/pull/2661

Comment 11 Richard Achmatowicz 2016-02-22 14:24:55 UTC
There was no fix required in upstream as the implementation had changed significantly since 6.3.3. and Paul had added the @Merged annotation there.

Comment 13 Jiří Bílek 2016-03-15 11:09:47 UTC
Verified with EAP 6.4.7.CP.CR2.

Comment 14 Petr Penicka 2017-01-17 11:55:30 UTC
Retroactively bulk-closing issues from released EAP 6.4 cumulative patches.