Bug 745904 (EDG-111) - UNICAST sender window not found
Summary: UNICAST sender window not found
Keywords:
Status: CLOSED NEXTRELEASE
Alias: EDG-111
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.0.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Default User
QA Contact:
URL: http://jira.jboss.org/jira/browse/EDG...
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-07-15 08:08 UTC by Michal Linhard
Modified: 2014-03-17 04:02 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-18 21:09:09 UTC
Type: Bug


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker EDG-111 0 None Closed UNICAST sender window not found 2012-06-29 05:39:36 UTC

Description Michal Linhard 2011-07-15 08:08:34 UTC
project_key: EDG

EDG6 Alpha revision 65
Infinispan 5.0.0.CR7

In 4 node REST data stress test
(https://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-rest-size4/1/console-perf17/consoleText)

first there is a lot of these warnings:
{code}
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-143) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-134) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-58) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
{code}

then we're getting this:

{code}
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-140) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found
{code}

we use this config:
https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/

Comment 2 Trustin Lee 2011-07-15 13:44:54 UTC
Conversation with Bela:
{code}
[21:13:53] <bela> This means that a member left the cluster
[21:14:23] <bela> perf18 is not part of the cluster; perf 17, 18 and 20 are
[21:14:28] <bela> trustin:  ^^
[21:19:47] <trustin> bela: does it mean 'left due to some failure'?
[21:20:30] <trustin> bela: .. like perf18 died.
[21:31:52] <bela> Either perf18 crashed, or it failed to respond to heartbeats and was expelled from the cluster
[21:32:19] <bela> If there is a high load on a system, perhaps due to high CPU, a node might fail to reply (in time)
[21:32:56] <bela> https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/standalone/configuration/standalone.xml
[21:33:07] <bela> shows that the config has FD_SOCK *and* FD_ALL
[21:33:43] <bela> FD_ALL has no configuration, which means the timeout is 10 secs
[21:34:11] <bela> This means, ifyou don't get a reply within 10 secs (the heartbeat itself is sent 3 times), then a node will get suspected and excluded
[21:34:40] <bela> I suggest 3 things for this test:
[21:34:54] <bela> #1 Either remove FD_ALL and only rely on FD_SOCK
[21:34:55] <bela> OR
[21:35:25] <bela> #2 Increase the timeout: FD_ALL timeout="35000" interval="10000"
[21:36:15] <bela> #3 Set msg_counts_as_heartbeat="true" in FD_ALL. This means, when you haven't received a heartbeat from P, but did receive a message from P, that P's counter is reset to 0, and P won't get suspected
{code}

Comment 3 Trustin Lee 2011-07-16 02:19:39 UTC
Michal, could you update the configuration as Bela advised and let me know if it's still a problem?

Comment 4 Michal Linhard 2011-07-18 21:08:35 UTC
This had probably to do with UDP config on perf machines.
This issue doesn't occur with the new config:
http://anonsvn.jboss.org/repos/edg/trunk/dist-dir/src/main/resources/standalone/configuration/standalone.xml
(rev 67)

Comment 5 Anne-Louise Tangring 2011-10-11 17:09:33 UTC
Docs QE Status: Removed: NEW 



Note You need to log in before you can comment on or make changes to this bug.