Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1116965

Summary: Messages sent to leavers can clog the JGroups bundler thread
Product: [JBoss] JBoss Data Grid 6 Reporter: Dan Berindei <dberinde>
Component: InfinispanAssignee: Tristan Tarrant <ttarrant>
Status: CLOSED CURRENTRELEASE QA Contact: Martin Gencur <mgencur>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.3.0CC: afield, jdg-bugs
Target Milestone: CR2   
Target Release: 6.3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-26 14:04:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1104045    

Description Dan Berindei 2014-07-07 18:06:38 UTC
In a stress test that repeatedly kills nodes while performing read/write operations, the TransferQueueBundler thread seems to spend a lot of time waiting for physical addresses:

06:40:10,316 WARN  [org.radargun.utils.Utils] (pool-5-thread-1) Stack for thread TransferQueueBundler,default,apex953-14666:
java.lang.Thread.sleep(Native Method)
org.jgroups.util.Util.sleep(Util.java:1504)
org.jgroups.util.Util.sleepRandom(Util.java:1574)
org.jgroups.protocols.TP.sendToSingleMember(TP.java:1685)
org.jgroups.protocols.TP.doSend(TP.java:1670)
org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2476)
org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2392)
org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2383)
java.lang.Thread.run(Thread.java:744)

There are 2 bugs related to this already fixed in JGroups 3.5.0.Beta2+: JGRP-1814, JGRP-1815

There is also a special case where the physical address could be removed from the cache too soon, exacerbating the effect of JGRP-1815: JGRP-1858

We can work around the problem by changing the JGroups configuration:
* TP.logical_addr_cache_expiration=86400000
** Only expire addresses after 1 day
* TP.physical_addr_max_fetch_attempts=1
** Sleep for only 20ms waiting for the physical address (default 3 - 1500ms)
* UNICAST3_conn_close_timeout=10000
** Drop the pending messages to leavers sooner

Comment 2 Alan Field 2014-07-15 11:38:14 UTC
Executed the elasticity test in Hyperion 3 times without a failure, and the resilience test 5 times without a failure with JDG 6.3.0 CR4. VERIFIED

Comment 3 JBoss JIRA Server 2014-07-29 09:11:06 UTC
Dan Berindei <dberinde> updated the status of jira ISPN-4480 to Resolved