882162 – Segment transfer not restarted if the owner fails

Bug 882162 - Segment transfer not restarted if the owner fails

Summary: Segment transfer not restarted if the owner fails

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ER10
Target Release:	6.1.0
Assignee:	Tristan Tarrant
QA Contact:	Nobody
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-30 10:04 UTC by Radim Vansa
Modified:	2025-02-10 03:27 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2025-02-10 03:27:05 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-2574	0	Critical	Resolved	Segment transfer not restarted if the owner fails	2014-03-17 08:14:35 UTC

Description Radim Vansa 2012-11-30 10:04:40 UTC

Imagine this situation in distributed cache with 3 owners:
1) The segment X is owned by nodes A, B, C
2) Node B fails -> CH_UPDATE and then REBALANCE_START are broadcasted
3) Node D starts transfer of segment X from C
4) Node C fails -> another CH_UPDATE is broadcasted
5) D handes the CH_UPDATE and removes the transfer of segment X from C, but does not start another transfer from A

The addedSegments does not contain the restarted transfer, because all transfers from write consistent hash are removed from it in the beginning - the segment is considered received here although the transfer is still in progress.

Comment 1 JBoss JIRA Server 2012-11-30 10:06:27 UTC

Adrian Nistor <anistor> updated the status of jira ISPN-2574 to Coding In Progress

Comment 2 JBoss JIRA Server 2012-11-30 14:37:34 UTC

Dan Berindei <dberinde> made a comment on jira ISPN-2574

Fix integrated in master, leaving the issue open until we add a test as well.

Comment 3 JBoss JIRA Server 2012-12-05 13:33:37 UTC

Tristan Tarrant <ttarrant> made a comment on jira ISPN-2574

Can't we close this issue and create a new one just for the test ?

Comment 4 JBoss JIRA Server 2012-12-05 14:15:07 UTC

Mircea Markus <mmarkus> made a comment on jira ISPN-2574

That's if Adrian is confident that it's fixed without a test.

Comment 5 JBoss JIRA Server 2012-12-06 13:34:55 UTC

Adrian Nistor <anistor> made a comment on jira ISPN-2574

Ok, let's close this and create a separate issue for the unit test.

Comment 6 JBoss JIRA Server 2012-12-06 13:40:50 UTC

Adrian Nistor <anistor> made a comment on jira ISPN-2574

Closing this so it can go to QE.

Created a separate issue for the unit test: ISPN-2569

Comment 7 JBoss JIRA Server 2012-12-06 13:41:48 UTC

Adrian Nistor <anistor> made a comment on jira ISPN-2574

Closing this so it can go to QE.

Created a separate issue for the unit test: ISPN-2596

Comment 8 JBoss JIRA Server 2013-01-10 13:38:32 UTC

Michal Linhard <mlinhard> updated the status of jira ISPN-2574 to Reopened

Comment 9 JBoss JIRA Server 2013-01-10 13:38:32 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2574

Adrian please check out my test case: https://github.com/mlinhard/infinispan/commit/8681c35c95aeba128ae28a1c2aba9609b2b9e2b8

it doesn't work for current master.

the test scenario is a more simple one:
config: distribution, num owners 2

1. create cluster {A,B}, fill 1000 entries
2. join C
3. when B is about to send StateResponseCommand to C, fail B, never send the command
4. C should restart the state transfer and ask the same segments from A
5. cluster {A, C} will form with all segments properly backed up on both A and C

Beta4 didn't restart the state transfer which meant some entries weren't properly transfered to C
Beta6 did this alright
current master again fails to restart the state transfer from A

Comment 10 JBoss JIRA Server 2013-01-10 13:45:59 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2574

On current master this test crashes on this line:
{code}
 final Cache<Object, Object> c2 = cache(2);
{code}

but when catch the exception (by replacing it with):
{code}
      Cache<Object, Object> aCache = null;
      while (aCache == null) {
         try {
            aCache = cache(2);
         } catch (Exception e) {
            log.error("Problem obtaining cache: ", e);
         }
      }
      final Cache<Object, Object> c2 = aCache;
{code}

i still can't see the StateRequestCommand being sent from C to A (after B is killed)

Comment 11 Michal Linhard 2013-01-10 13:48:19 UTC

This is fine in ER6 but this functionality has changed and might not be fine in ER8....

Comment 12 JBoss JIRA Server 2013-01-10 14:02:46 UTC

Michal Linhard <mlinhard> made a comment on jira ISPN-2574

Just checked, also fails for 5.2.0.CR1

Comment 13 Michal Linhard 2013-01-10 16:40:39 UTC

Fails for ER8

Comment 14 JBoss JIRA Server 2013-01-15 14:35:39 UTC

Adrian Nistor <anistor> made a comment on jira ISPN-2574

Thanks for the unit test! I'm looking at this issue right now. Surefire wants it renamed to *Test. Will rename and integrate it.

Comment 15 JBoss JIRA Server 2013-01-16 14:17:18 UTC

Adrian Nistor <anistor> made a comment on jira ISPN-2574

Two things were wrong: 
1. An in-progress tasks that was fetching from a leaver was replaced by a new task from a new source but the existing task should also be interrupted otherwise the transfer thread is blocked forever.

2. The check in StateConsumerImpl.startTransferThread() that prevents two threads running at the same time was unsafe and could result in no thread running at all which means tasks pile up in taskQueue but are not processed.

Comment 16 Michal Linhard 2013-01-30 11:09:43 UTC

StateTransferRestartTest that fails in ER9 passes in ER10.

Comment 20 Red Hat Bugzilla 2025-02-10 03:27:05 UTC

This product has been discontinued or is no longer tracked in Red Hat Bugzilla.

Note You need to log in before you can comment on or make changes to this bug.