Bug 1058887

Summary: HotRod client keep trying recover connections to a failed cluster for a long time
Product: [JBoss] JBoss Data Grid 6 Reporter: wfink
Component: InfinispanAssignee: Tristan Tarrant <ttarrant>
Status: VERIFIED --- QA Contact: Martin Gencur <mgencur>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.1.0CC: jdg-bugs, pruivo, vjuranek
Target Milestone: CR1   
Target Release: 6.2.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
If a cluster is no longer reachable for some reason, i.e. network disconnect, the hot-rod client tries to re-establish the lost connections. The client library will retry this by a fixed calculation based on the max numbers of connections from the pool, or 10 multiplied with the number of available servers. This may result in a long delay until the application can continue and react, as it will wait for the read timeout for each try. </para> <para> This has been fixed by adding a new configuration property infinispan.client.hotrod.max_retries. This property defines the maximum number of retries in case of a recoverable error. A valid value should be greater or equal to 0 (zero). Zero means no retry. Default is 10.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1060199, 1075061, 1060655    

Description wfink 2014-01-28 17:09:23 UTC
If an JDG cluster is not longer reachable for some reason, i.e. network disconnect, the hot-rod client try to re-establish the lost connections.
The client library will retry this by a fixed calculation based on the max numbers of connections from the pool or 10, multiplied with the number of available servers.
This can lead in a very long time until the application can continue and react as it will wait for the read- or connect-timeout for each try.

To improve this behaviour there should be a configurable limit of retries per server and/or a timeout in total.

This will give the application the chance to handle a remote-cache failure and reply to the user instead of hanging for minutes (with the default settings)

Comment 2 Dan Berindei 2014-01-31 07:55:41 UTC
Pull request integrated: https://github.com/infinispan/jdg/pull/17