1117948 – Members can miss the rebalance cancellation on coordinator change

Bug 1117948 - Members can miss the rebalance cancellation on coordinator change

Summary: Members can miss the rebalance cancellation on coordinator change

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	CR3
Target Release:	6.3.0
Assignee:	Tristan Tarrant
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1104045
TreeView+	depends on / blocked

Reported:	2014-07-09 16:55 UTC by Dan Berindei
Modified:	2015-01-26 14:06 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-01-26 14:06:03 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-4490	0	Major	Pull Request Sent	Members can miss the rebalance cancellation on coordinator change	2014-07-15 13:35:29 UTC

Description Dan Berindei 2014-07-09 16:55:44 UTC

The new coordinator sends first a CH_UPDATE command to cancel the existing rebalance, and then a REBALANCE_START command to start a new rebalance. But the CH_UPDATE command is sent asynchronously, so it's possible for some members to receive it after the REBALANCE_START command.

If that happens, that node will assume that it will receive the segments it requested for the previous rebalance. But with the bug 1116969/ISPN-4484 fix, the provider node cancels the outbound transfer tasks when receiving a CH_UPDATE without a pendingCH, so the state requestor will never receive its segments.

Even without the bug 1116969/ISPN-4484 fix this is a problem, although less obvious. Between the provider node receiving the CH_UPDATE and the REBALANCE_START commands, it won't have the requestor in its write CH, so the requestor can miss transactions.

Comment 2 Alan Field 2014-07-15 11:37:32 UTC

Executed the elasticity test in Hyperion 3 times without a failure, and the resilience test 5 times without a failure with JDG 6.3.0 CR4. VERIFIED

Note You need to log in before you can comment on or make changes to this bug.