| Summary: | SuspectException not handled in hotrod client | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 5 | Reporter: | Michal Linhard <mlinhard> |
| Component: | Infinispan | Assignee: | Default User <jbpapp-maint> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | EAP 5.1.0 EDG TP | CC: | galder.zamarreno, mlinhard, nobody, rachmato, rhusar |
| Target Milestone: | --- | ||
| Target Release: | EAP 5.1.0 EDG TP | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| URL: | http://jira.jboss.org/jira/browse/EDG-96 | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-05-17 08:06:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Michal Linhard
2011-05-06 10:24:35 UTC
Link: Added: This issue is incorporated by JBPAPP-6054 second thought: actually maybe the client is not the place to deal with this at all and the SuspectException should be not thrown at all, when we know we have a possibility to fail over to different node. maybe this could be dealt with in node1 (from the example). anyway the suspect exception is something we can deal with and this should not bubble up to the user code. Michal, I agree w/ you that something needs to be done to handle this. The only thing the server can do here is maybe launder it into something that's easier to deal w/ by the client but in the end is the client that needs to decide what to do about it. Otherwise, if the suspect exception was not returned back to client, what would the server return instead? Ideally we'd solve this with a with a new error status code but I'd leave that until we come up with a new version of the protocol (see ISPN-1090). In the mean time clients can check the error message. I'll create a JIRA so that clients can deal w/ this in its current form. Link: Added: This issue depends ISPN-1091 I had a look at this code on Friday and i'll add my comments so that my swimming through code was not in vain. Roughly, the call chain is something like this: HotRod client -> HotRod server -> AbstractProtocolDecoder.decode() -> CacheDelegate.put() -> (invoker chain).invoke() -> DistributedInterceptor.handleWriteCommand() -> RpcManager.invokeRemotely(recipients, command, sync) -> JGroupsTransport.invokeRemotely() At this point, JGroups makes a group RPC call to the cluster to all nodes which hold the key, so that the value may be updated. The group RPC returns a set of Rsp values, some of which completed successfully, some of which show that during the Rpc call, the node in question was suspected. In GroupRequest, a node can be suspected during an Rpc call if (i) a SUSPECT event is received by the RequestCorrelator or (ii) a VIEW_CHANGE event is received and the node in question is not part of the new view. The JGroupsTransport throws the exception in question if one of the nodes was suspected. Part of the problem here is that we don't know if the node was suspected because it was slow, or suspected because it really did fail and is not part of the new view. If the node was falsely suspected, then we may have a key one of whose copies was updated and one of whose copies was not (even if the client receives the exception). Yes? No? From a clients point of view, shouldn't a put just succeed or fail (and if it does fail, then the client should retry). Should it not be the responsibility of the HotRod server to determine if the Rpc update was completed successfully or not, deal with any potential inconsistency in key copies in the cluster and mask all he details from the client? My 2c. Well, if the operation is synchronous and there's a suspicion, it'd be rolled back in the other nodes. So, there should not be such situation where some nodes have the changes applied and others not. With async, it's fire and forget so if someone is suspected, some nodes might contain the right data and others stale. No, I don't think the server should do any failover, otherwise you start having to duplicate code failover code both in client and server. What the server needs to do is figure out if an operation was successful or not and tell the client accordingly. The client can then decide to failover or not based on the error code or error message and the cluster formation. Besides, the server cannot send cache requests to particular nodes. All it does is call the local Infinispan instance which hides the details away via JGroups. Well, maybe the server could call other servers cos it knows the endpoints, but it'd be repeating the same code as in the client. Let's keep things simple and responsibilities separated. Richards idea, that we considered, was to isolate the client from abnormal states of the cluster. But if, as you say, this would lead to code repeating and unclean code, I guess let's go with the client side failover approach. Since the suspect exception (in case of synchronized replication) doesn't lead to data integrity problems and operation fails as a whole, it's no problem there. The main thing is to isolate the user of hotrod client from this kind of exception. Other related thing is Replication timeout exception that is being thrown from the same place. Shouldn't client be able to recover from this kind of exception as well ? (just pointing it out so that we can possibly add it to ISPN-1090 as well or make some more general error code for clustering related problems) I'd need to check the code to remember but IIRC, replication timeout exception would have been covered under the 0x86 command timed out status code in http://community.jboss.org/docs/DOC-14421 and if that's the case, it should failover already. I'll add a note to verify this. run with 4.2.2-SNAPSHOT-20110516.115014: http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/60 SuspectExceptions disappeared... Release Notes Text: Added: Marking as resolved according to Michal's report. Please feel free to reopen if not resolved. Marking as resolved according to Michal's report. Please feel free to reopen if not resolved. (It seems like the comment I entered when I mark an issue as resolved is inserted into the release note text field) Release Notes Text: Removed: Marking as resolved according to Michal's report. Please feel free to reopen if not resolved. there are two text areas, the comment area is on the bottom of the form (when resolving). Docs QE Status: Removed: NEW |