Bug 809975

Summary:	stripe: truncated copy (ENOSPC) inconsistent with source file
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Brian Foster <bfoster>
Component:	glusterfs	Assignee:	Brian Foster <bfoster>
Status:	CLOSED UPSTREAM	QA Contact:	shylesh <shmohan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.0	CC:	aavati, amarts, gluster-bugs, sdharane, surs, vagarwal, vbellur
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-10-03 09:26:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Brian Foster 2012-04-04 19:02:39 UTC

Description of problem:

I noticed this behavior when testing a separate translator and reproduced using the default stripe graph with 2 bricks. The behavior is that when copying a file to a volume that ends up filling the volume (i.e., ENOSPC occurs before the copy completes), the portion of the file that does copy is not consistent with the original file.

I reproduced using a single VM with a stripe volume created on top of two LVM volumes with XFS, each around 330MB or so in size.

Version-Release number of selected component (if applicable): Upstream build as of commit 3c5aed7d8226e50296c6a7106e0d59682732e9b5.

How reproducible: 100%


Steps to Reproduce:
1. Create a striped volume.
2. Fill the majority of the volume using 'dd if=/dev/zero of=spc bs=1M count=...". E.g., I created a few 100MB files.
3. Create a data file larger than the remaining space: dd if=/dev/urandom of=/tmp/data bs=1M count=50
4. Copy the data file to the volume, observe ENOSPC.
5. Run 'cmp /stripe/data /original/data' and verify.
  
Actual results:

./data /tmp/data differ: byte 18886657, line 73676

Expected results: 

cmp: EOF on ./data


Additional info:

Comment 2 Amar Tumballi 2012-07-11 04:21:14 UTC

Brian,

Does your stripe coalesce patch fix this issue? if not what do you think can fix the issue? Assigning to you for the time being.

Comment 3 Brian Foster 2012-07-11 12:04:12 UTC

Hi Amar,

If I remember right, I ran into this issue when first writing the independent coalesce translator. I initially thought it was something broken in my implementation but I tested and reproduced with the default stripe graph, so I filed this bug to make sure it wasn't lost. I don't believe it is affected one way or another by coalesce, but I'll try it again on 3.3 when I get a chance...

Comment 4 Brian Foster 2012-07-17 17:49:50 UTC

Well this is an interesting bug. I can still reproduce as originally documented and the issue appears to be generally due to the lack of error propagation with write-behind. With a bit more thorough testing, I can actually reproduce this situation without ever hitting a client visible ENOSPC at all. This makes this bug a bit more alarming, because ENOSPC at least indicates to the user that the file wasn't copied.

The situation I'm testing is a 3 brick striped volume (single VM, bricks are independent LVM volumes and XFS filesystems). I fill up the majority of the space on the volume with large files and through some trial and error, I'm able to create a situation where the volume reports 1152 (1k) blocks free, copying a 1MB file with random data succeeds according to the client, but hits an unreported short write on the backend. Prior to copying this file, the 3 bricks have slightly differing free block counts. The first stripe is copied successfully across the bricks, the second and third stripes copy successfully to brick indexes 1 and 2, but only 4k is written to index 0 chunks of stripes 2 and 3. This short write is reported back to cluster/stripe, but the user generated I/O has already been turned around with a 128k return value at this point. When reading the file back, these unwritten regions are zero filled and result in the inconsistency with the original file.

Without write-behind in the mix the 4k write bubbles back up to cp, at which point it attempts to write the remaining portion of the buffer, receives ENOSPC and fails accordingly.

Also note that while I don't think the data missing behavior is relevant for the non-striped case, this unreported short write behavior makes me curious whether the undetected short write/ENOSPC condition is reproducible on a normal DHT volume. I'm also a bit curious whether the stripe issue is still reproducible without write-behind due to the parallelism involved (i.e., what if a large, non-chunk aligned write spans two bricks and one or both result in short writes or ENOSPC?).

Comment 5 Brian Foster 2012-07-17 21:28:30 UTC

Interestingly enough, I can actually reproduce the inconsistent file situation with a single brick dht volume. The test is similar to as described in comment #4. I can (reportedly) successfully copy a 320k file to a volume with 128k free space, but unexpectedly, the portion of the file that is copied still contains zeroed out ranges. This is because the last 128k write returns short, but a subsequent couple of writes return 4k, extending the file.

This appears to be related to how XFS handles ENOSPC conditions. Because XFS does a significant amount of preallocation, some of the allocation paths run a flush-->retry pattern to free up blocks and hopefully satisfy the request. I am clearly hitting the 'xfs_delalloc_enospc' condition in this test, though I can't say exactly why the internal retry presumably fails yet a subsequent write finds a block (and another after that). This may or may not be expected in XFS, but I don't think it is necessarily invalid behavior.

Comment 6 Brian Foster 2012-07-18 21:50:11 UTC

It took a bit of instrumentation (i.e., instrumenting glusterfsd writev response order w/ gdb), but I was able to reproduce a funny situation where a non-aligned 128k write returns -1 (ENOSPC) on node 1 and 64k on node 2, resulting in a 64k-1 return value back up from stripe. This occurs because we crudely sum up the return values from each stripe chunk.

This also suggests some strangeness would occur were the first write actually short. E.g., a 128k write broken up into two 64k writes to node 1 and 2 results in a 32k short write on node 1 and 64k write on node 2 and returns 96k. This would probably cause an application to attempt to rewrite the last 32k rather than the 32k-64k range of the original 128k I/O that wasn't written.

Comment 7 Anand Avati 2012-07-18 22:13:50 UTC

Good point.

First - we need to make the return value of stripe return the number of  contiguous bytes successfully written till the first failure in the logical offset. This should be treated differently if the fd is opened with O_DIRECT and  return failure instead of short-write.

Second - we need to detect short write responses in write-behind and flag an error in the fd, which is propagated on close().

Comment 8 Brian Foster 2012-07-18 23:58:38 UTC

Thanks Avati. That's what I was planning with regard to stripe. It's pretty clear we need to fix up our return handling / error detection there, if nothing else. We still might allow weird situations where a short extending write might increase the size of the file larger than expected (so if the app did nothing, a cmp would still fail), but I think to fix that we'd have to issue such broken up writes serially. Is that problem worth the trouble of doing that?

With regard to write-behind, I was initially thinking about issuing a retry but storing an error off for close sounds more like a standard compliant thing to do. The manpage says close() can return EIO, so is it sufficient in this case to return EIO and make it the application's responsibility to check/rewrite or otherwise understand that _some_ portion of the file wasn't written?

Comment 9 Brian Foster 2012-07-19 22:03:36 UTC

One fix is posted for review: http://review.gluster.com/3700

This addresses the stripe writev problem. It passes my cp ENOSPC test and unaligned write test (without write-behind enabled). As a general sanity test, it also survives a couple kernel compiles with a default graph and stripe-coalesce enabled and disabled.

Comment 10 Brian Foster 2012-07-23 13:54:01 UTC

A fix for write-behind is posted for review: http://review.gluster.com/3712

This checks for short writes and pends an EIO error for return on the next fop (write or flush/close). The combination of this change and that described in comment #9 should ensure that short writes are no longer silent, though it is still possible for extending writes to create holes in a file if the overall request does not succeed.

I had a discussion with Avati on IRC and he pointed out that this might also be technically possible without stripe involved (i.e., with write-behind involved, io-threads on the server side can potentially re-order write requests and race with a server crash to cause an inconsistent file). More investigation is required to determine how much of a problem this is.

Comment 11 Vijay Bellur 2012-07-25 23:28:10 UTC

CHANGE: http://review.gluster.com/3712 (performance/write-behind: detect short writes and pend an EIO error) merged in master by Anand Avati (avati)

Comment 12 Vijay Bellur 2012-07-30 18:39:11 UTC

CHANGE: http://review.gluster.com/3700 (cluster/stripe: handle short writes and errors in writev callback) merged in master by Anand Avati (avati)

Comment 13 RHEL Program Management 2013-10-03 09:26:50 UTC

Quality Engineering Management has reviewed and declined this request.
You may appeal this decision by reopening this request.

Comment 14 Vivek Agarwal 2013-10-03 09:50:49 UTC

Per discussion with Sudhir/Amar, no roadmap for stripe, hence closing it.