Bug 882162 - Segment transfer not restarted if the owner fails
Summary: Segment transfer not restarted if the owner fails
Keywords:
Status: VERIFIED
Alias: None
Product: JBoss Data Grid 6
Classification: JBoss
Component: Infinispan
Version: 6.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ER10
: 6.1.0
Assignee: Tristan Tarrant
QA Contact: Nobody
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-11-30 10:04 UTC by Radim Vansa
Modified: 2023-03-02 08:27 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker ISPN-2574 0 Critical Resolved Segment transfer not restarted if the owner fails 2014-03-17 08:14:35 UTC

Description Radim Vansa 2012-11-30 10:04:40 UTC
Imagine this situation in distributed cache with 3 owners:
1) The segment X is owned by nodes A, B, C
2) Node B fails -> CH_UPDATE and then REBALANCE_START are broadcasted
3) Node D starts transfer of segment X from C
4) Node C fails -> another CH_UPDATE is broadcasted
5) D handes the CH_UPDATE and removes the transfer of segment X from C, but does not start another transfer from A

The addedSegments does not contain the restarted transfer, because all transfers from write consistent hash are removed from it in the beginning - the segment is considered received here although the transfer is still in progress.

Comment 1 JBoss JIRA Server 2012-11-30 10:06:27 UTC
Adrian Nistor <anistor> updated the status of jira ISPN-2574 to Coding In Progress

Comment 2 JBoss JIRA Server 2012-11-30 14:37:34 UTC
Dan Berindei <dberinde> made a comment on jira ISPN-2574

Fix integrated in master, leaving the issue open until we add a test as well.

Comment 3 JBoss JIRA Server 2012-12-05 13:33:37 UTC
Tristan Tarrant <ttarrant> made a comment on jira ISPN-2574

Can't we close this issue and create a new one just for the test ?

Comment 4 JBoss JIRA Server 2012-12-05 14:15:07 UTC
Mircea Markus <mmarkus> made a comment on jira ISPN-2574

That's if Adrian is confident that it's fixed without a test.

Comment 5 JBoss JIRA Server 2012-12-06 13:34:55 UTC
Adrian Nistor <anistor> made a comment on jira ISPN-2574

Ok, let's close this and create a separate issue for the unit test.

Comment 6 JBoss JIRA Server 2012-12-06 13:40:50 UTC
Adrian Nistor <anistor> made a comment on jira ISPN-2574

Closing this so it can go to QE.

Created a separate issue for the unit test: ISPN-2569

Comment 7 JBoss JIRA Server 2012-12-06 13:41:48 UTC
Adrian Nistor <anistor> made a comment on jira ISPN-2574

Closing this so it can go to QE.

Created a separate issue for the unit test: ISPN-2596

Comment 8 JBoss JIRA Server 2013-01-10 13:38:32 UTC
Michal Linhard <mlinhard> updated the status of jira ISPN-2574 to Reopened

Comment 9 JBoss JIRA Server 2013-01-10 13:38:32 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2574

Adrian please check out my test case: https://github.com/mlinhard/infinispan/commit/8681c35c95aeba128ae28a1c2aba9609b2b9e2b8

it doesn't work for current master.

the test scenario is a more simple one:
config: distribution, num owners 2

1. create cluster {A,B}, fill 1000 entries
2. join C
3. when B is about to send StateResponseCommand to C, fail B, never send the command
4. C should restart the state transfer and ask the same segments from A
5. cluster {A, C} will form with all segments properly backed up on both A and C

Beta4 didn't restart the state transfer which meant some entries weren't properly transfered to C
Beta6 did this alright
current master again fails to restart the state transfer from A

Comment 10 JBoss JIRA Server 2013-01-10 13:45:59 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2574

On current master this test crashes on this line:
{code}
 final Cache<Object, Object> c2 = cache(2);
{code}

but when catch the exception (by replacing it with):
{code}
      Cache<Object, Object> aCache = null;
      while (aCache == null) {
         try {
            aCache = cache(2);
         } catch (Exception e) {
            log.error("Problem obtaining cache: ", e);
         }
      }
      final Cache<Object, Object> c2 = aCache;
{code}

i still can't see the StateRequestCommand being sent from C to A (after B is killed)

Comment 11 Michal Linhard 2013-01-10 13:48:19 UTC
This is fine in ER6 but this functionality has changed and might not be fine in ER8....

Comment 12 JBoss JIRA Server 2013-01-10 14:02:46 UTC
Michal Linhard <mlinhard> made a comment on jira ISPN-2574

Just checked, also fails for 5.2.0.CR1

Comment 13 Michal Linhard 2013-01-10 16:40:39 UTC
Fails for ER8

Comment 14 JBoss JIRA Server 2013-01-15 14:35:39 UTC
Adrian Nistor <anistor> made a comment on jira ISPN-2574

Thanks for the unit test! I'm looking at this issue right now. Surefire wants it renamed to *Test. Will rename and integrate it.

Comment 15 JBoss JIRA Server 2013-01-16 14:17:18 UTC
Adrian Nistor <anistor> made a comment on jira ISPN-2574

Two things were wrong: 
1. An in-progress tasks that was fetching from a leaver was replaced by a new task from a new source but the existing task should also be interrupted otherwise the transfer thread is blocked forever.

2. The check in StateConsumerImpl.startTransferThread() that prevents two threads running at the same time was unsafe and could result in no thread running at all which means tasks pile up in taskQueue but are not processed.

Comment 16 Michal Linhard 2013-01-30 11:09:43 UTC
StateTransferRestartTest that fails in ER9 passes in ER10.


Note You need to log in before you can comment on or make changes to this bug.