Hide Forgot
project_key: EDG EDG6 Alpha revision 65 Infinispan 5.0.0.CR7 In 4 node REST data stress test (https://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-rest-size4/1/console-perf17/consoleText) first there is a lot of these warnings: {code} [JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251] [JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-143) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251] [JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-134) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251] [JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-58) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251] [JBoss] 03:39:08,184 WARNING [org.jgroups.protocols.pbcast.NAKACK] (pool-5-thread-14) perf17-26748: dropped message from perf18-14537 (not in table [perf20-25251, perf17-26748, perf19-35065]), view=[perf17-26748|4] [perf17-26748, perf19-35065, perf20-25251] {code} then we're getting this: {code} [JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found [JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found [JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-140) perf17-26748: sender window for perf18-14537 not found [JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-141) perf17-26748: sender window for perf18-14537 not found [JBoss] 03:40:35,809 SEVERE [org.jgroups.protocols.UNICAST] (pool-5-thread-18) perf17-26748: sender window for perf18-14537 not found {code} we use this config: https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/
this also happens with ispn 5.0.0-SNAPSHOT http://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-hotrod-size4/9/console-perf17/consoleText http://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-rest-size4/2/console-perf17/consoleText
Conversation with Bela: {code} [21:13:53] <bela> This means that a member left the cluster [21:14:23] <bela> perf18 is not part of the cluster; perf 17, 18 and 20 are [21:14:28] <bela> trustin: ^^ [21:19:47] <trustin> bela: does it mean 'left due to some failure'? [21:20:30] <trustin> bela: .. like perf18 died. [21:31:52] <bela> Either perf18 crashed, or it failed to respond to heartbeats and was expelled from the cluster [21:32:19] <bela> If there is a high load on a system, perhaps due to high CPU, a node might fail to reply (in time) [21:32:56] <bela> https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/standalone/configuration/standalone.xml [21:33:07] <bela> shows that the config has FD_SOCK *and* FD_ALL [21:33:43] <bela> FD_ALL has no configuration, which means the timeout is 10 secs [21:34:11] <bela> This means, ifyou don't get a reply within 10 secs (the heartbeat itself is sent 3 times), then a node will get suspected and excluded [21:34:40] <bela> I suggest 3 things for this test: [21:34:54] <bela> #1 Either remove FD_ALL and only rely on FD_SOCK [21:34:55] <bela> OR [21:35:25] <bela> #2 Increase the timeout: FD_ALL timeout="35000" interval="10000" [21:36:15] <bela> #3 Set msg_counts_as_heartbeat="true" in FD_ALL. This means, when you haven't received a heartbeat from P, but did receive a message from P, that P's counter is reset to 0, and P won't get suspected {code}
Michal, could you update the configuration as Bela advised and let me know if it's still a problem?
This had probably to do with UDP config on perf machines. This issue doesn't occur with the new config: http://anonsvn.jboss.org/repos/edg/trunk/dist-dir/src/main/resources/standalone/configuration/standalone.xml (rev 67)
Docs QE Status: Removed: NEW