Bug 745904 (EDG-111)

Summary: UNICAST sender window not found
Product: [JBoss] JBoss Data Grid 6 Reporter: Michal Linhard <mlinhard>
Component: InfinispanAssignee: Default User <jbpapp-maint>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 6.0.0CC: jdg-bugs, mlinhard, nobody, trustin
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: http://jira.jboss.org/jira/browse/EDG-111
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-07-18 21:09:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Michal Linhard 2011-07-15 08:08:34 UTC
project_key: EDG

EDG6 Alpha revision 65
Infinispan 5.0.0.CR7

In 4 node REST data stress test
(https://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-rest-size4/1/console-perf17/consoleText)

first there is a lot of these warnings:
{code}
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-143) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-134) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-58) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
[JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251]
{code}

then we're getting this:

{code}
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-140) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found
[JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found
{code}

we use this config:
https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/

Comment 2 Trustin Lee 2011-07-15 13:44:54 UTC
Conversation with Bela:
{code}
[21:13:53] <bela> This means that a member left the cluster
[21:14:23] <bela> perf18 is not part of the cluster; perf 17, 18 and 20 are
[21:14:28] <bela> trustin:  ^^
[21:19:47] <trustin> bela: does it mean 'left due to some failure'?
[21:20:30] <trustin> bela: .. like perf18 died.
[21:31:52] <bela> Either perf18 crashed, or it failed to respond to heartbeats and was expelled from the cluster
[21:32:19] <bela> If there is a high load on a system, perhaps due to high CPU, a node might fail to reply (in time)
[21:32:56] <bela> https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/standalone/configuration/standalone.xml
[21:33:07] <bela> shows that the config has FD_SOCK *and* FD_ALL
[21:33:43] <bela> FD_ALL has no configuration, which means the timeout is 10 secs
[21:34:11] <bela> This means, ifyou don't get a reply within 10 secs (the heartbeat itself is sent 3 times), then a node will get suspected and excluded
[21:34:40] <bela> I suggest 3 things for this test:
[21:34:54] <bela> #1 Either remove FD_ALL and only rely on FD_SOCK
[21:34:55] <bela> OR
[21:35:25] <bela> #2 Increase the timeout: FD_ALL timeout="35000" interval="10000"
[21:36:15] <bela> #3 Set msg_counts_as_heartbeat="true" in FD_ALL. This means, when you haven't received a heartbeat from P, but did receive a message from P, that P's counter is reset to 0, and P won't get suspected
{code}

Comment 3 Trustin Lee 2011-07-16 02:19:39 UTC
Michal, could you update the configuration as Bela advised and let me know if it's still a problem?

Comment 4 Michal Linhard 2011-07-18 21:08:35 UTC
This had probably to do with UDP config on perf machines.
This issue doesn't occur with the new config:
http://anonsvn.jboss.org/repos/edg/trunk/dist-dir/src/main/resources/standalone/configuration/standalone.xml
(rev 67)

Comment 5 Anne-Louise Tangring 2011-10-11 17:09:33 UTC
Docs QE Status: Removed: NEW