Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 745904 (EDG-111)

Summary:	UNICAST sender window not found
Product:	[JBoss] JBoss Data Grid 6	Reporter:	Michal Linhard <mlinhard>
Component:	Infinispan	Assignee:	Default User <jbpapp-maint>
Status:	CLOSED NEXTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	6.0.0	CC:	jdg-bugs, mlinhard, nobody, trustin
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
URL:	http://jira.jboss.org/jira/browse/EDG-111
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-07-18 21:09:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michal Linhard 2011-07-15 08:08:34 UTC

project_key: EDG

EDG6 Alpha revision 65
Infinispan 5.0.0.CR7

In 4 node REST data stress test
(https://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-rest-size4/1/console-perf17/consoleText)

first there is a lot of these warnings:
{code}
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-143) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-134) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-58) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
{code}

then we're getting this:

{code}
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-140) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found
{code}

we use this config:
https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/

Comment 1 Michal Linhard 2011-07-15 10:14:01 UTC

this also happens with ispn 5.0.0-SNAPSHOT
http://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-hotrod-size4/9/console-perf17/consoleText
http://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-rest-size4/2/console-perf17/consoleText

Comment 2 Trustin Lee 2011-07-15 13:44:54 UTC

Conversation with Bela:
{code}
[21:13:53] <bela> This means that a member left the cluster
[21:14:23] <bela> perf18 is not part of the cluster; perf 17, 18 and 20 are
[21:14:28] <bela> trustin:  ^^
[21:19:47] <trustin> bela: does it mean 'left due to some failure'?
[21:20:30] <trustin> bela: .. like perf18 died.
[21:31:52] <bela> Either perf18 crashed, or it failed to respond to heartbeats and was expelled from the cluster
[21:32:19] <bela> If there is a high load on a system, perhaps due to high CPU, a node might fail to reply (in time)
[21:32:56] <bela> https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/standalone/configuration/standalone.xml
[21:33:07] <bela> shows that the config has FD_SOCK *and* FD_ALL
[21:33:43] <bela> FD_ALL has no configuration, which means the timeout is 10 secs
[21:34:11] <bela> This means, ifyou don't get a reply within 10 secs (the heartbeat itself is sent 3 times), then a node will get suspected and excluded
[21:34:40] <bela> I suggest 3 things for this test:
[21:34:54] <bela> #1 Either remove FD_ALL and only rely on FD_SOCK
[21:34:55] <bela> OR
[21:35:25] <bela> #2 Increase the timeout: FD_ALL timeout="35000" interval="10000"
[21:36:15] <bela> #3 Set msg_counts_as_heartbeat="true" in FD_ALL. This means, when you haven't received a heartbeat from P, but did receive a message from P, that P's counter is reset to 0, and P won't get suspected
{code}

Comment 3 Trustin Lee 2011-07-16 02:19:39 UTC

Michal, could you update the configuration as Bela advised and let me know if it's still a problem?

Comment 4 Michal Linhard 2011-07-18 21:08:35 UTC

This had probably to do with UDP config on perf machines.
This issue doesn't occur with the new config:
http://anonsvn.jboss.org/repos/edg/trunk/dist-dir/src/main/resources/standalone/configuration/standalone.xml
(rev 67)

Comment 5 Anne-Louise Tangring 2011-10-11 17:09:33 UTC

Docs QE Status: Removed: NEW