Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745881 (EDG-96)

Summary:	SuspectException not handled in hotrod client
Product:	[JBoss] JBoss Data Grid 5	Reporter:	Michal Linhard <mlinhard>
Component:	Infinispan	Assignee:	Default User <jbpapp-maint>
Status:	CLOSED NEXTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	EAP 5.1.0 EDG TP	CC:	galder.zamarreno, mlinhard, nobody, rachmato, rhusar
Target Milestone:	---
Target Release:	EAP 5.1.0 EDG TP
Hardware:	Unspecified
OS:	Unspecified
URL:	http://jira.jboss.org/jira/browse/EDG-96
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-05-17 08:06:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michal Linhard 2011-05-06 10:24:35 UTC

project_key: EDG

As a client supporting failover, hotrod java client should be able to handle org.infinispan.remoting.transport.jgroups.SuspectException based on the JGroups layer failure detection on the serverside.

the example scenario in which we're observing this commonly is:
we have a cluster node1, node2, node3, node4
we use dist mode, owners of key A are node1, node3
PUT operation for key A goes to a node1 and infinispan on node1 tries to replicate to node3 - which is failed.
node1 gets SuspectException and rethrows to client

Comment 1 Michal Linhard 2011-05-06 10:25:22 UTC

Link: Added: This issue is incorporated by JBPAPP-6054

Comment 2 Michal Linhard 2011-05-06 10:37:34 UTC

second thought:
actually maybe the client is not the place to deal with this at all and the SuspectException should be not thrown at all, when we know we have a possibility to fail over to different node. maybe this could be dealt with in node1 (from the example).

anyway the suspect exception is something we can deal with and this should not bubble up to the user code.

Comment 3 Galder Zamarreño 2011-05-09 14:03:39 UTC

Michal, I agree w/ you that something needs to be done to handle this. The only thing the server can do here is maybe launder it into something that's easier to deal w/ by the client but in the end is the client that needs to decide what to do about it. Otherwise, if the suspect exception was not returned back to client, what would the server return instead?

Ideally we'd solve this with a with a new error status code but I'd leave that until we come up with a new version of the protocol (see ISPN-1090). In the mean time clients can check the error message. I'll create a JIRA so that clients can deal w/ this in its current form.

Comment 4 Galder Zamarreño 2011-05-09 14:21:01 UTC

Link: Added: This issue depends ISPN-1091

Comment 5 Richard Achmatowicz 2011-05-09 15:39:26 UTC

I had a look at this code on Friday and i'll add my comments so that my swimming through code was not in vain.

Roughly, the call chain is something like this:
HotRod client -> 
HotRod server -> 
AbstractProtocolDecoder.decode() -> 
CacheDelegate.put() -> 
(invoker chain).invoke() -> 
DistributedInterceptor.handleWriteCommand() -> 
RpcManager.invokeRemotely(recipients, command, sync) -> 
JGroupsTransport.invokeRemotely() 

At this point, JGroups makes a group RPC call to the cluster to all nodes which hold the key, so that the value may be updated. The group RPC returns a set of Rsp values, some of which completed successfully, some of which show that during the Rpc call, the node in question was suspected. In GroupRequest, a node can be suspected during an Rpc call if (i) a SUSPECT event is received by the RequestCorrelator or (ii) a VIEW_CHANGE event is received and the node in question is not part of the new view. The JGroupsTransport throws the exception in question if one of the nodes was suspected. Part of the problem here is that we don't know if the node was suspected because it was slow, or suspected because it really did fail and is not part of the new view. If the node was falsely suspected, then we may have a key one of whose copies was updated and one of whose copies was not (even if the client receives the exception). Yes? No?

From a clients point of view, shouldn't a put just succeed or fail (and if it does fail, then the client should retry). Should it not be the responsibility of the HotRod server to determine if the Rpc update was completed successfully or not, deal with any potential inconsistency in key copies in the cluster and mask all he details from the client? 

My 2c.

Comment 6 Galder Zamarreño 2011-05-10 08:59:43 UTC

Well, if the operation is synchronous and there's a suspicion, it'd be rolled back in the other nodes. So, there should not be such situation where some nodes have the changes applied and others not. With async, it's fire and forget so if someone is suspected, some nodes might contain the right data and others stale.

No, I don't think the server should do any failover, otherwise you start having to duplicate code failover code both in client and server. What the server needs to do is figure out if an operation was successful or not and tell the client accordingly. The client can then decide to failover or not based on the error code or error message and the cluster formation.

Comment 7 Galder Zamarreño 2011-05-10 09:02:03 UTC

Besides, the server cannot send cache requests to particular nodes. All it does is call the local Infinispan instance which hides the details away via JGroups. Well, maybe the server could call other servers cos it knows the endpoints, but it'd be repeating the same code as in the client. Let's keep things simple and responsibilities separated.

Comment 8 Michal Linhard 2011-05-10 09:23:54 UTC

Richards idea, that we considered, was to isolate the client from abnormal states of the cluster. But if, as you say, this would lead to code repeating and unclean code, I guess let's go with the client side failover approach. Since the suspect exception (in case of synchronized replication) doesn't lead to data integrity problems and operation fails as a whole, it's no problem there.
The main thing is to isolate the user of hotrod client from this kind of exception.

Other related thing is Replication timeout exception that is being thrown from the same place. Shouldn't client be able to recover from this kind of exception as well ? (just pointing it out so that we can possibly add it to ISPN-1090 as well or make some more general error code for clustering related problems)

Comment 9 Galder Zamarreño 2011-05-11 08:23:32 UTC

I'd need to check the code to remember but IIRC, replication timeout exception would have been covered under the 0x86 command timed out status code in http://community.jboss.org/docs/DOC-14421 and if that's the case, it should failover already. I'll add a note to verify this.

Comment 10 Michal Linhard 2011-05-16 15:26:23 UTC

run with 4.2.2-SNAPSHOT-20110516.115014:
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/60

SuspectExceptions disappeared...

Comment 11 Trustin Lee 2011-05-17 08:06:22 UTC

Release Notes Text: Added: Marking as resolved according to Michal's report.  Please feel free to reopen if not resolved.

Comment 12 Trustin Lee 2011-05-17 08:07:36 UTC

Marking as resolved according to Michal's report.  Please feel free to reopen if not resolved.
(It seems like the comment I entered when I mark an issue as resolved is inserted into the release note text field)

Comment 13 Trustin Lee 2011-05-17 08:07:36 UTC

Release Notes Text: Removed: Marking as resolved according to Michal's report.  Please feel free to reopen if not resolved.

Comment 14 Michal Linhard 2011-05-17 08:11:31 UTC

there are two text areas, the comment area is on the bottom of the form (when resolving).

Comment 15 Anne-Louise Tangring 2011-10-11 17:06:18 UTC

Docs QE Status: Removed: NEW