| Summary: | UNICAST sender window not found | ||
|---|---|---|---|
| Product: | [JBoss] JBoss Data Grid 6 | Reporter: | Michal Linhard <mlinhard> |
| Component: | Infinispan | Assignee: | Default User <jbpapp-maint> |
| Status: | CLOSED NEXTRELEASE | QA Contact: | |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.0.0 | CC: | jdg-bugs, mlinhard, nobody, trustin |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| URL: | http://jira.jboss.org/jira/browse/EDG-111 | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-07-18 21:09:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Michal Linhard
2011-07-15 08:08:34 UTC
this also happens with ispn 5.0.0-SNAPSHOT http://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-hotrod-size4/9/console-perf17/consoleText http://hudson.qa.jboss.com/hudson/job/edg-60-stress-data-rest-size4/2/console-perf17/consoleText Conversation with Bela:
{code}
[21:13:53] <bela> This means that a member left the cluster
[21:14:23] <bela> perf18 is not part of the cluster; perf 17, 18 and 20 are
[21:14:28] <bela> trustin: ^^
[21:19:47] <trustin> bela: does it mean 'left due to some failure'?
[21:20:30] <trustin> bela: .. like perf18 died.
[21:31:52] <bela> Either perf18 crashed, or it failed to respond to heartbeats and was expelled from the cluster
[21:32:19] <bela> If there is a high load on a system, perhaps due to high CPU, a node might fail to reply (in time)
[21:32:56] <bela> https://svn.devel.redhat.com/repos/jboss-qa/load-testing/etc/edg-60/stress/standalone/configuration/standalone.xml
[21:33:07] <bela> shows that the config has FD_SOCK *and* FD_ALL
[21:33:43] <bela> FD_ALL has no configuration, which means the timeout is 10 secs
[21:34:11] <bela> This means, ifyou don't get a reply within 10 secs (the heartbeat itself is sent 3 times), then a node will get suspected and excluded
[21:34:40] <bela> I suggest 3 things for this test:
[21:34:54] <bela> #1 Either remove FD_ALL and only rely on FD_SOCK
[21:34:55] <bela> OR
[21:35:25] <bela> #2 Increase the timeout: FD_ALL timeout="35000" interval="10000"
[21:36:15] <bela> #3 Set msg_counts_as_heartbeat="true" in FD_ALL. This means, when you haven't received a heartbeat from P, but did receive a message from P, that P's counter is reset to 0, and P won't get suspected
{code}
Michal, could you update the configuration as Bela advised and let me know if it's still a problem? This had probably to do with UDP config on perf machines. This issue doesn't occur with the new config: http://anonsvn.jboss.org/repos/edg/trunk/dist-dir/src/main/resources/standalone/configuration/standalone.xml (rev 67) Docs QE Status: Removed: NEW |