1104031 – StateResponse chunk with lastChunk=true from cancelled ST stops receiving data in next ST

Bug 1104031 - StateResponse chunk with lastChunk=true from cancelled ST stops receiving data in next ST

Summary: StateResponse chunk with lastChunk=true from cancelled ST stops receiving dat...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	JBoss Data Grid 6
Classification:	JBoss
Component:	Infinispan
Sub Component:
Version:	6.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ER7
Target Release:	6.3.0
Assignee:	Tristan Tarrant
QA Contact:	Martin Gencur
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1104639
TreeView+	depends on / blocked

Reported:	2014-06-03 06:48 UTC by Radim Vansa
Modified:	2015-01-26 14:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-01-26 14:03:22 UTC
Type:	Bug
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ISPN-4310	0	Critical	Resolved	StateResponse chunk with lastChunk=true from cancelled ST stops receiving data in next ST	2014-07-07 19:00:19 UTC

Description Radim Vansa 2014-06-03 06:48:10 UTC

1. A requests segment from B (there are multiple chunks)
2. B sends all chunks, but before A receives them, new topology arrives and A cancels the ST.
3. Another topology comes and A requests this segment again
4. A receives the old StateResponseCommand with lastChunk=true and thinks that it got all segments, therefore, it discards further chunks.

Result is inconsistent cluster, and after further rebalances completely lost data.
This ought to be rare, but was repeatedly observed when gracefully stopping coordinator on a 32-node cluster full of data.

Comment 2 Sanne Grinovero 2014-06-13 16:51:16 UTC

I didn't follow on recent updates on ST, but we discussed lots of time in the past that an in-flight ST should never be cancelled when started, it should "move forward" the the next view if it changes again.
Did the design change?

Comment 3 Dan Berindei 2014-06-16 07:21:03 UTC

Yes, Sanne, there are two cases where ST is cancelled: when there is a merge, or when the coordinator leaves the cluster. Cancelling ST in the second case is not strictly necessary, it's done because it's a bit difficult to distinguish between the two. I've created https://issues.jboss.org/browse/ISPN-4404 to work on this.

Note You need to log in before you can comment on or make changes to this bug.