Bug 745882 (EDG-116)

Summary: HotRod server refuses connections shortly after start
Product: [JBoss] JBoss Data Grid 5 Reporter: Michal Linhard <mlinhard>
Component: InfinispanAssignee: Default User <jbpapp-maint>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: EAP 5.1.0 EDG TPCC: boniek, galder.zamarreno, mlinhard, nobody, trustin
Target Milestone: ---   
Target Release: EAP 5.1.0 EDG TP   
Hardware: Unspecified   
OS: Unspecified   
URL: http://jira.jboss.org/jira/browse/EDG-116
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-26 19:31:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
results.ods none

Description Michal Linhard 2011-05-11 11:25:09 UTC
project_key: EDG

In resilience tests we're seeing connection refused shortly after restart of the node.
we have 4 nodes perf17-perf20. we're failing perf19.
exactly after the perf19 finishes it's join rehash ([JBoss] 05:00:07,062 INFO  [JoinTask] perf19-58461 completed join rehash in 16.22 seconds!)
the driver nodes (perf02-perf10) start trying to connect to it and it's not yet ready to receive the connections.

Is there a period of time between the new node is officially in cluster (and therefore hotrod clients obtain it via topology change piggybacking) and the hotrod server is started ?

Shouldn't we eliminate this period ?

the affected run is:
http://hudson.qa.jboss.com/hudson/view/EDG/job/edg-51x-resilience-client-size4-hotrod/58/
I realized that there are sampling errors not only during node failure but also during node recovery (even more than during failure) and they are the mentioned connection refused exceptions.

Comment 1 Michal Linhard 2011-05-11 11:26:30 UTC
results.ods - attaching compiled data from the hudson run. the approximate times of fail and restore events are marked in the table.

Comment 2 Michal Linhard 2011-05-11 11:26:30 UTC
Attachment: Added: results.ods


Comment 3 Galder ZamarreƱo 2011-08-03 07:11:27 UTC
Michal, does this need looking into?

Comment 4 Michal Linhard 2011-08-03 08:38:04 UTC
I'll verify this one, it might be applicable also to EDG6 Alpha

Comment 5 Michal Linhard 2011-08-03 15:57:19 UTC
This will take a bit longer, I'll need to get resilience tests going.

Comment 6 Michal Linhard 2011-09-26 19:31:58 UTC
This is now obsolete, when smilar thing occurs for EDG6, we'll create a new JIRA.

Comment 7 Anne-Louise Tangring 2011-10-11 17:09:35 UTC
Docs QE Status: Removed: NEW