Imagine this situation in distributed cache with 3 owners: 1) The segment X is owned by nodes A, B, C 2) Node B fails -> CH_UPDATE and then REBALANCE_START are broadcasted 3) Node D starts transfer of segment X from C 4) Node C fails -> another CH_UPDATE is broadcasted 5) D handes the CH_UPDATE and removes the transfer of segment X from C, but does not start another transfer from A The addedSegments does not contain the restarted transfer, because all transfers from write consistent hash are removed from it in the beginning - the segment is considered received here although the transfer is still in progress.
Adrian Nistor <anistor> updated the status of jira ISPN-2574 to Coding In Progress
Dan Berindei <dberinde> made a comment on jira ISPN-2574 Fix integrated in master, leaving the issue open until we add a test as well.
Tristan Tarrant <ttarrant> made a comment on jira ISPN-2574 Can't we close this issue and create a new one just for the test ?
Mircea Markus <mmarkus> made a comment on jira ISPN-2574 That's if Adrian is confident that it's fixed without a test.
Adrian Nistor <anistor> made a comment on jira ISPN-2574 Ok, let's close this and create a separate issue for the unit test.
Adrian Nistor <anistor> made a comment on jira ISPN-2574 Closing this so it can go to QE. Created a separate issue for the unit test: ISPN-2569
Adrian Nistor <anistor> made a comment on jira ISPN-2574 Closing this so it can go to QE. Created a separate issue for the unit test: ISPN-2596
Michal Linhard <mlinhard> updated the status of jira ISPN-2574 to Reopened
Michal Linhard <mlinhard> made a comment on jira ISPN-2574 Adrian please check out my test case: https://github.com/mlinhard/infinispan/commit/8681c35c95aeba128ae28a1c2aba9609b2b9e2b8 it doesn't work for current master. the test scenario is a more simple one: config: distribution, num owners 2 1. create cluster {A,B}, fill 1000 entries 2. join C 3. when B is about to send StateResponseCommand to C, fail B, never send the command 4. C should restart the state transfer and ask the same segments from A 5. cluster {A, C} will form with all segments properly backed up on both A and C Beta4 didn't restart the state transfer which meant some entries weren't properly transfered to C Beta6 did this alright current master again fails to restart the state transfer from A
Michal Linhard <mlinhard> made a comment on jira ISPN-2574 On current master this test crashes on this line: {code} final Cache<Object, Object> c2 = cache(2); {code} but when catch the exception (by replacing it with): {code} Cache<Object, Object> aCache = null; while (aCache == null) { try { aCache = cache(2); } catch (Exception e) { log.error("Problem obtaining cache: ", e); } } final Cache<Object, Object> c2 = aCache; {code} i still can't see the StateRequestCommand being sent from C to A (after B is killed)
This is fine in ER6 but this functionality has changed and might not be fine in ER8....
Michal Linhard <mlinhard> made a comment on jira ISPN-2574 Just checked, also fails for 5.2.0.CR1
Fails for ER8
Adrian Nistor <anistor> made a comment on jira ISPN-2574 Thanks for the unit test! I'm looking at this issue right now. Surefire wants it renamed to *Test. Will rename and integrate it.
Adrian Nistor <anistor> made a comment on jira ISPN-2574 Two things were wrong: 1. An in-progress tasks that was fetching from a leaver was replaced by a new task from a new source but the existing task should also be interrupted otherwise the transfer thread is blocked forever. 2. The check in StateConsumerImpl.startTransferThread() that prevents two threads running at the same time was unsafe and could result in no thread running at all which means tasks pile up in taskQueue but are not processed.
StateTransferRestartTest that fails in ER9 passes in ER10.