https://issues.jboss.org/browse/ISPN-2550 creating different BZ to separate from https://bugzilla.redhat.com/show_bug.cgi?id=875151 which has different cause
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2550 Tomas' functional issue has now been separated into ISPN-2624, leaving this JIRA fully focused on the situation when the nodes are killed.
Dan Berindei <dberinde> made a comment on jira ISPN-2550 What is the issue in ISPN-1624? The subject looks the same to me :) Michal, yes, the commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 was intended to fix the IndexOutOfBoundsException.
Dan Berindei <dberinde> made a comment on jira ISPN-2550 What is the issue in ISPN-2624? The subject looks the same to me :) Michal, yes, the commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 was intended to fix the IndexOutOfBoundsException.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2550 @Dan, ISPN-2624 is a different scenario. Happens when node starts up and one of the nodes is apparently set up for storage only (no Netty endpoint). To avoid confusion, I'm treating it as a different case right now, cos it smells like a misconfiguration. Michal's case is about killing nodes.
Galder Zamarreño <galder.zamarreno> made a comment on jira ISPN-2550 Plus, if there really is an when a node joins in (as opposed to killing), your fix won't work and would result in inbalances in the cluster... but let's not make judgements, let's see what ISPN-2624 is about and then we talk...
Michal Linhard <mlinhard> made a comment on jira ISPN-2550 The IndexOutOfBoundsException appears independently of dan's fix so I created ISPN-2642
Michal Linhard <mlinhard> made a comment on jira ISPN-2550 I've patched JDG 6.1.0.ER5 by replacing infinispan-core and infinispan-server-hotrod jars built from dan's branch and ran resilience tests in hyperion http://www.qa.jboss.com/~mlinhard/hyperion3/run0011/report/stats-throughput.png the issues ISPN-2550 and ISPN-2642 didn't appear but the run still wasn't OK. After rejoin of killed node0002 all operations were blocked for more than 5 minutes - i.e. zero throughput in the last stage of the test. I'm investigating what happened there.
Dan Berindei <dberinde> made a comment on jira ISPN-2550 @Galder, could we could modify the JIRA subject to say this one happens during leave and the other happens during join then? @Michal, commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 should fix ISPN-2642 as well, have you tested it?
Michal Linhard <mlinhard> made a comment on jira ISPN-2550 as I say ISPN-2642 didn't appear, so it seems to be fixed, I'm now investigating other problems I have with that test run.
Michal Linhard <mlinhard> made a comment on jira ISPN-2550 I've reduced number of entries in the cache during the test to 5000 1kb entries and I've got a clean resilience test run: http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png only expected exceptions: http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/loganalysis/server/ there is still problem with uneven request balancing (ISPN-2632) and blocking of the whole system after join, when there's more data (5% heap filled), but it doesn't have to be related with issues we're discussing here.
Dan Berindei <dberinde> made a comment on jira ISPN-2550 Sorry Michal, I didn't refresh the JIRA page before posting my comment. I'm glad the fix works, I'll try to get a unit test working as well before issuing a PR though.
Dan Berindei <dberinde> updated the status of jira ISPN-2550 to Coding In Progress
Verified for 6.1.0.ER8