Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1116965

Summary:	Messages sent to leavers can clog the JGroups bundler thread
Product:	[JBoss] JBoss Data Grid 6	Reporter:	Dan Berindei <dberinde>
Component:	Infinispan	Assignee:	Tristan Tarrant <ttarrant>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Martin Gencur <mgencur>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.3.0	CC:	afield, jdg-bugs
Target Milestone:	CR2
Target Release:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-01-26 14:04:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1104045

Description Dan Berindei 2014-07-07 18:06:38 UTC

In a stress test that repeatedly kills nodes while performing read/write operations, the TransferQueueBundler thread seems to spend a lot of time waiting for physical addresses:

06:40:10,316 WARN  [org.radargun.utils.Utils] (pool-5-thread-1) Stack for thread TransferQueueBundler,default,apex953-14666:
java.lang.Thread.sleep(Native Method)
org.jgroups.util.Util.sleep(Util.java:1504)
org.jgroups.util.Util.sleepRandom(Util.java:1574)
org.jgroups.protocols.TP.sendToSingleMember(TP.java:1685)
org.jgroups.protocols.TP.doSend(TP.java:1670)
org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2476)
org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2392)
org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2383)
java.lang.Thread.run(Thread.java:744)

There are 2 bugs related to this already fixed in JGroups 3.5.0.Beta2+: JGRP-1814, JGRP-1815

There is also a special case where the physical address could be removed from the cache too soon, exacerbating the effect of JGRP-1815: JGRP-1858

We can work around the problem by changing the JGroups configuration:
* TP.logical_addr_cache_expiration=86400000
** Only expire addresses after 1 day
* TP.physical_addr_max_fetch_attempts=1
** Sleep for only 20ms waiting for the physical address (default 3 - 1500ms)
* UNICAST3_conn_close_timeout=10000
** Drop the pending messages to leavers sooner

Comment 2 Alan Field 2014-07-15 11:38:14 UTC

Executed the elasticity test in Hyperion 3 times without a failure, and the resilience test 5 times without a failure with JDG 6.3.0 CR4. VERIFIED

Comment 3 JBoss JIRA Server 2014-07-29 09:11:06 UTC

Dan Berindei <dberinde> updated the status of jira ISPN-4480 to Resolved