Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 762268 (GLUSTER-536)

Summary:	fsx tool fails over stripe
Product:	[Community] GlusterFS	Reporter:	Amar Tumballi <amarts>
Component:	stripe	Assignee:	Amar Tumballi <amarts>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	low
Version:	3.0.0	CC:	gluster-bugs, jdarcy, pavan, rabhat, vraman
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	RTA	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Amar Tumballi 2010-01-12 19:08:52 UTC

| On Mon, Jan 4, 2010 at 10:09 PM, Raghavendra Bhat 
| <raghavendrabhat> wrote:
| 
| Hi Amar,
| 
| 
| fsx test failed in stripe. The command executed was "fsx -R -W -N 100 <file | on which fsx is tested>". Its giving Segmentation fault.
| 
| -Raghavendra Bhat
|

This test script fails on stripe because of following reason.

Imagine the following operations (on a stripe with stripe-size 131072)


write (offset=140000, size=100000)
read (offset=100000, size=60000)

ideally read should return 50000, but in stripe, it returns just 28928 (160000 - 131072), hence the application fails. 

We have to comeup with a way to send at least a 'truncate' on other subvolumes (where write is not going), so that next read returns 000000s. 

Regards,

Comment 1 Jeff Darcy 2010-01-13 12:56:44 UTC

(In reply to comment #0)
> Imagine the following operations (on a stripe with stripe-size 131072)
> 
> 
> write (offset=140000, size=100000)
> read (offset=100000, size=60000)
> 
> ideally read should return 50000

I assume you mean 60000.

> but in stripe, it returns just 28928 (160000
> - 131072), hence the application fails. 

Do you mean that it returns 28928 bytes starting at offset 31072 in the user's buffer, or that it returns 28928 starting at offset zero?

> We have to comeup with a way to send at least a 'truncate' on other subvolumes
> (where write is not going), so that next read returns 000000s. 

Alternatively, we could assume that a hole exists on any subvolume file reporting a read past EOF (which must be unambiguously signaled as such), if and only if some other subvolume reported a successful read later in the file, and zero-fill ourselves.  This saves the overhead and complexity of issuing the truncate and then reading zeroes over the wire, at some cost in additional complexity when reads hit this (rare) case.

Comment 2 Amar Tumballi 2010-02-19 23:52:54 UTC

> 
> Alternatively, we could assume that a hole exists on any subvolume file
> reporting a read past EOF (which must be unambiguously signaled as such), if
> and only if some other subvolume reported a successful read later in the file,
> and zero-fill ourselves.  This saves the overhead and complexity of issuing the
> truncate and then reading zeroes over the wire, at some cost in additional
> complexity when reads hit this (rare) case.

Hi Jeff,

Thanks for suggestions. This is easier and cleaner for sure. Let me check other scenarios in this case, and fix this bug.

Regards,
Amar

Comment 3 Anand Avati 2010-03-01 14:20:56 UTC

PATCH: http://patches.gluster.com/patch/2851 in master (stripe read fix (when read() is done on a sparse file over glusterfs))

Comment 4 Anand Avati 2010-03-01 14:21:00 UTC

PATCH: http://patches.gluster.com/patch/2852 in release-3.0 (stripe read fix (when read() is done on a sparse file over glusterfs))

Comment 5 Jeff Darcy 2010-03-01 19:32:04 UTC

Besides the limitation noted in the code that it won't work for a hole that spans two stripe segments (not nodes), it can also fail when there are multiple holes.  More seriously, I think it can fail in a pretty catastrophic way.  Imagine the following sequence.

- seek(131072)
- write(1000)
- seek(262144)
- write(1000)

That leaves two holes, one from 0-131072 and one from 132072-262144, and a file size of 263144.  Now someone opens the file and tries to read 300000.  This generates three stripe reads - 0-131072, 131072-262144, 262144-300000 - and it's possible that the middle one could complete before the first (the third is irrelevant).  When that middle read completes with 1000, corresponding to the first write, it will set readv_pendingsize to local->readv_size-op_ret=299000.  When the first read subsequently completes, it will attempt to zero-fill.  As part of that effort, it will allocate a 128KB buffer and then use memset to clear readv_pendingsize=299000 bytes.

Comment 6 Amar Tumballi 2010-03-01 22:59:23 UTC

(In reply to comment #5)

Hi Jeff,

I thought of these limitations, and this surely falls under the category of the comment there (saying currently the patch doesn't solve the issue of spanning read across more than two nodes).

The situation you mentioned won't happen with today's GlusterFS, as the read (300000, 0) won't reach glusterfs at all, instead it will come as 3 reads like below

read (131072, 0);
read (131072, 131072);
read (37856, 262144);

So, all these cases pass. And as long as users set stripe size >= 128KB, there is no chance of a read call spanning more than 2 nodes. That is the intension of marking this bug as fixed.

-Amar

Comment 7 Jeff Darcy 2010-03-02 10:46:42 UTC

(In reply to comment #6)

Thanks for the explanation, Amar.  I always appreciate the chance to learn more about how GlusterFS works.  What is it that prevents the read of 300000 from reaching GlusterFS, or did you mean to say that it wouldn't reach the stripe translator?

Comment 8 Raghavendra Bhat 2010-03-29 03:10:42 UTC

The bug has reappeared in glusterfs-3.0.4rc2 and master.

Comment 9 Amar Tumballi 2010-03-29 04:26:38 UTC

(In reply to comment #7)
> (In reply to comment #6)
> 
> Thanks for the explanation, Amar.  I always appreciate the chance to learn more
> about how GlusterFS works.  What is it that prevents the read of 300000 from
> reaching GlusterFS, or did you mean to say that it wouldn't reach the stripe
> translator?


Sorry for the delay. Currently, fuse kernel module breaks the read into proper block sizes which it supports (currently we fix it to 128KB). Hence none of the GlusterFS translators get read() call bigger than 128KB. 

Regards,

Comment 10 Anand Avati 2010-03-31 04:08:24 UTC

PATCH: http://patches.gluster.com/patch/3054 in master (stripe readv: proper validation of 'op_ret'.)

Comment 11 Anand Avati 2010-03-31 04:08:29 UTC

PATCH: http://patches.gluster.com/patch/3053 in release-3.0 (stripe readv: proper 'op_ret' validation)