| Summary: | stripe: truncated copy (ENOSPC) inconsistent with source file | ||
|---|---|---|---|
| Product: | Red Hat Gluster Storage | Reporter: | Brian Foster <bfoster> |
| Component: | glusterfs | Assignee: | Brian Foster <bfoster> |
| Status: | CLOSED UPSTREAM | QA Contact: | shylesh <shmohan> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 2.0 | CC: | aavati, amarts, gluster-bugs, sdharane, surs, vagarwal, vbellur |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-10-03 09:26:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Brian Foster
2012-04-04 19:02:39 UTC
Brian, Does your stripe coalesce patch fix this issue? if not what do you think can fix the issue? Assigning to you for the time being. Hi Amar, If I remember right, I ran into this issue when first writing the independent coalesce translator. I initially thought it was something broken in my implementation but I tested and reproduced with the default stripe graph, so I filed this bug to make sure it wasn't lost. I don't believe it is affected one way or another by coalesce, but I'll try it again on 3.3 when I get a chance... Well this is an interesting bug. I can still reproduce as originally documented and the issue appears to be generally due to the lack of error propagation with write-behind. With a bit more thorough testing, I can actually reproduce this situation without ever hitting a client visible ENOSPC at all. This makes this bug a bit more alarming, because ENOSPC at least indicates to the user that the file wasn't copied. The situation I'm testing is a 3 brick striped volume (single VM, bricks are independent LVM volumes and XFS filesystems). I fill up the majority of the space on the volume with large files and through some trial and error, I'm able to create a situation where the volume reports 1152 (1k) blocks free, copying a 1MB file with random data succeeds according to the client, but hits an unreported short write on the backend. Prior to copying this file, the 3 bricks have slightly differing free block counts. The first stripe is copied successfully across the bricks, the second and third stripes copy successfully to brick indexes 1 and 2, but only 4k is written to index 0 chunks of stripes 2 and 3. This short write is reported back to cluster/stripe, but the user generated I/O has already been turned around with a 128k return value at this point. When reading the file back, these unwritten regions are zero filled and result in the inconsistency with the original file. Without write-behind in the mix the 4k write bubbles back up to cp, at which point it attempts to write the remaining portion of the buffer, receives ENOSPC and fails accordingly. Also note that while I don't think the data missing behavior is relevant for the non-striped case, this unreported short write behavior makes me curious whether the undetected short write/ENOSPC condition is reproducible on a normal DHT volume. I'm also a bit curious whether the stripe issue is still reproducible without write-behind due to the parallelism involved (i.e., what if a large, non-chunk aligned write spans two bricks and one or both result in short writes or ENOSPC?). Interestingly enough, I can actually reproduce the inconsistent file situation with a single brick dht volume. The test is similar to as described in comment #4. I can (reportedly) successfully copy a 320k file to a volume with 128k free space, but unexpectedly, the portion of the file that is copied still contains zeroed out ranges. This is because the last 128k write returns short, but a subsequent couple of writes return 4k, extending the file. This appears to be related to how XFS handles ENOSPC conditions. Because XFS does a significant amount of preallocation, some of the allocation paths run a flush-->retry pattern to free up blocks and hopefully satisfy the request. I am clearly hitting the 'xfs_delalloc_enospc' condition in this test, though I can't say exactly why the internal retry presumably fails yet a subsequent write finds a block (and another after that). This may or may not be expected in XFS, but I don't think it is necessarily invalid behavior. It took a bit of instrumentation (i.e., instrumenting glusterfsd writev response order w/ gdb), but I was able to reproduce a funny situation where a non-aligned 128k write returns -1 (ENOSPC) on node 1 and 64k on node 2, resulting in a 64k-1 return value back up from stripe. This occurs because we crudely sum up the return values from each stripe chunk. This also suggests some strangeness would occur were the first write actually short. E.g., a 128k write broken up into two 64k writes to node 1 and 2 results in a 32k short write on node 1 and 64k write on node 2 and returns 96k. This would probably cause an application to attempt to rewrite the last 32k rather than the 32k-64k range of the original 128k I/O that wasn't written. Good point. First - we need to make the return value of stripe return the number of contiguous bytes successfully written till the first failure in the logical offset. This should be treated differently if the fd is opened with O_DIRECT and return failure instead of short-write. Second - we need to detect short write responses in write-behind and flag an error in the fd, which is propagated on close(). Thanks Avati. That's what I was planning with regard to stripe. It's pretty clear we need to fix up our return handling / error detection there, if nothing else. We still might allow weird situations where a short extending write might increase the size of the file larger than expected (so if the app did nothing, a cmp would still fail), but I think to fix that we'd have to issue such broken up writes serially. Is that problem worth the trouble of doing that? With regard to write-behind, I was initially thinking about issuing a retry but storing an error off for close sounds more like a standard compliant thing to do. The manpage says close() can return EIO, so is it sufficient in this case to return EIO and make it the application's responsibility to check/rewrite or otherwise understand that _some_ portion of the file wasn't written? One fix is posted for review: http://review.gluster.com/3700 This addresses the stripe writev problem. It passes my cp ENOSPC test and unaligned write test (without write-behind enabled). As a general sanity test, it also survives a couple kernel compiles with a default graph and stripe-coalesce enabled and disabled. A fix for write-behind is posted for review: http://review.gluster.com/3712 This checks for short writes and pends an EIO error for return on the next fop (write or flush/close). The combination of this change and that described in comment #9 should ensure that short writes are no longer silent, though it is still possible for extending writes to create holes in a file if the overall request does not succeed. I had a discussion with Avati on IRC and he pointed out that this might also be technically possible without stripe involved (i.e., with write-behind involved, io-threads on the server side can potentially re-order write requests and race with a server crash to cause an inconsistent file). More investigation is required to determine how much of a problem this is. CHANGE: http://review.gluster.com/3712 (performance/write-behind: detect short writes and pend an EIO error) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.com/3700 (cluster/stripe: handle short writes and errors in writev callback) merged in master by Anand Avati (avati) Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request. Per discussion with Sudhir/Amar, no roadmap for stripe, hence closing it. |