Bug 853690
Summary: | Having bricks with different sizes can truncate files. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Vidya Sakar <vinaraya> |
Component: | glusterfs | Assignee: | Pranith Kumar K <pkarampu> |
Status: | CLOSED ERRATA | QA Contact: | Lalatendu Mohanty <lmohanty> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 2.0 | CC: | bfoster, gluster-bugs, jeff.shaw, lmohanty, mailbox, rfortier, rhs-bugs, vbellur, vijay |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.4.0qa4-1.el6rhs | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | GLUSTER-3750 | Environment: | |
Last Closed: | 2013-09-23 22:33:16 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 765482 | ||
Bug Blocks: |
Description
Vidya Sakar
2012-09-02 07:12:36 UTC
I reproduce this on a gluster 3.2.4 installation. I create a rep2 volume across a 90MB and 110MB brick. The size of the mount is reported correctly, but I'm able to copy a 100MB file without any indication of error. I can read the file back correctly unless I kill the glusterfsd which stores the complete file, at which point the file differs from the original. Running the same test against the latest upstream code reproduces an error on copy, which I believe is expected behavior. I did some brief testing to fill the volume and validate our error semantics and it seems to be correct. If I do the minimal amount of 4k I/Os to fill up the volume, I'm not reproducing a situation where I receive no error and the files on the individual bricks do not match. Further investigation... the first upstream commit that dumps an error for this test is: c903de38da917239fe905fc6efa1f413d120fc04 write-behind: implement causal ordering and other cleanup ... which makes some sense given we've seen this kind of issue with write-behind before (on the other hand, how it detects this condition is not immediately clear to me). But this also calls into question the correctness of AFR (i.e., we're relying on a higher level translator for an error that should probably be contained in another). If I remove write-behind from the client graph, I can reproduce the original problem on upstream code. It appears we might require a fix here after all... My initial thought here was to bubble up short writes and any errors from an underlying brick on writev_cbk through afr. I had a brief IRC conversation with Jeff with regard to maintaining state in the event of errors (e.g., what's the right thing to do if a write is successful on one brick and we get EINVAL/EIO on another?). A traditional raid1-like approach might be to attempt a recovery (retry the write) and boot out the child if it fails). This would not necessarily result in an I/O error returned to the client, but perhaps some kind of administrative notification that a recovery is required. This appears to be similar to the currently resulting state in the event of an error: -rw-r--r-- 2 root root 188416 Oct 26 13:58 b1/file -rw-r--r-- 2 root root 196608 Oct 26 13:58 b2/file # file: b1/file trusted.afr.test-client-0=0x000000010000000000000000 trusted.afr.test-client-1=0x000000000000000000000000 trusted.gfid=0x4e6ed5181bf4486c9a9246d3d32958de # file: b2/file trusted.afr.test-client-0=0x000000010000000000000000 trusted.afr.test-client-1=0x000000000000000000000000 trusted.gfid=0x4e6ed5181bf4486c9a9246d3d32958de The ENOSPC and short write conditions are more unique to the filesystem layer in this regard. I think it is somewhat reasonable to return the minimum return value across the set of writes since existing, sensible utilities (cp, dd) will retry the write. There is some concern over leaving the replicas in a non-synced state here, but this is effectively what already happens today. IOW, I can reproduce the following if the short write returns first: -rw-r--r-- 2 root root 4096 Oct 26 15:39 b1/file -rw-r--r-- 2 root root 32768 Oct 26 15:39 b2/file # file: b1/file trusted.afr.test-client-0=0x000000000000000000000000 trusted.afr.test-client-1=0x000000000000000000000000 trusted.gfid=0x7cec26da62cd45f4b02b8b3705b078a7 # file: b2/file trusted.afr.test-client-0=0x000000000000000000000000 trusted.afr.test-client-1=0x000000000000000000000000 trusted.gfid=0x7cec26da62cd45f4b02b8b3705b078a7 I'm getting the ball rolling here with a small change to bubble up ENOSPC and short writes specifically (excluding other errors). The same test as presented in comment #4 results in an ENOSPC error and the following state: -rw-r--r-- 2 root root 24576 Oct 26 16:44 b1/file -rw-r--r-- 2 root root 32768 Oct 26 16:44 b2/file # file: b1/file trusted.afr.test-client-0=0x000000010000000000000000 trusted.afr.test-client-1=0x000000000000000000000000 trusted.gfid=0xa89ba41a3e0a4673b55d6a75653462f5 # file: b2/file trusted.afr.test-client-0=0x000000010000000000000000 trusted.afr.test-client-1=0x000000000000000000000000 trusted.gfid=0xa89ba41a3e0a4673b55d6a75653462f5 I think this is an improvement from the current situation, but might not be comprehensive. If a client did not actually retry a short write, we'd still be in the same, potentially corrupted situation. The right approach might actually be to treat a short write as an error occurred (which would be unfortunate if the client does retry the write and it succeeds), but I need to delve into that code a bit more. In the meantime: http://review.gluster.org/4133 After some discussion in the review for the previous proposal, I decided to forego the short write propagation approach in favour of a best-case result approach. This returns the best case result across the replicas and marks shorter writes as failed, such that a heal is required. An error or short write relative to the original request should still be returned if it is the best case result. http://review.gluster.org/4144 The test development for this fix introduced a few additional issues/dependencies: 1.) The ability to generate short writes on demand does not currently exist. 2.) The ability to use a single graph (i.e., no separate client and server graphs) is broken. 3.) Self-heal within a single graph is broken. Issue 1 is handled by an enhancement to the error-gen translator to support a short write "pseudo error:" http://review.gluster.org/#change,4148 Issue 2 already has a pending fix: http://review.gluster.org/4114 Issue 3 is resolved with a fix to afr's data self-heal fxattrop fop: http://review.gluster.org/#change,4149 ... and a repost of the short-write fix described in comment #6 with a test included: http://review.gluster.org/#change,4150 CHANGE: http://review.gluster.org/4148 (debug/error-gen: add the short write pseudo-error) merged in master by Vijay Bellur (vbellur) CHANGE: http://review.gluster.org/4150 (afr: handle short writes in afr_writev_wind and self-heal to avoid corruption) merged in master by Vijay Bellur (vbellur) I have created two bricks i.e. /brick1 having 982MB on rhsTestNode-1 and /brick2 of 93MB on rhsTestNode-2 Created a volume "test-volume" using these bricks. Tried copying a file which is bigger than free space available on the /brick2. While copying it gave error as expected I can see relevant information in the logfiles. Hence marking it as verified. ########################################################################## Below are the steps I performed. #there are two storage nodes i.e. rhsTestNode-1 and rhsTestNode-2, one client i.e. "root@unused" root@rhsTestNode-1 ~]# uname -a Linux rhsTestNode-1 2.6.32-220.30.1.el6.x86_64 #1 SMP Sun Nov 18 15:00:27 EST 2012 x86_64 x86_64 x86_64 GNU/Linux [root@rhsTestNode-1 ~]# gluster --version glusterfs 3.4.0qa5 built on Dec 17 2012 04:36:17 [root@rhsTestNode-2 brick2]# uname -a Linux rhsTestNode-2 2.6.32-279.19.1.el6.x86_64 #1 SMP Sat Nov 24 14:35:28 EST 2012 x86_64 x86_64 x86_64 GNU/Linux [root@rhsTestNode-2 brick2]# gluster --version glusterfs 3.4.0qa5 built on Dec 17 2012 04:36:17 [root@rhsTestNode-1 ~]# df -kh /brick1 Filesystem Size Used Avail Use% Mounted on /dev/mapper/1ATA_QEMU_HARDDISK_QM00001 1014M 33M 982M 4% /brick1 [root@rhsTestNode-2 ~]# df -kh /brick2 Filesystem Size Used Avail Use% Mounted on /dev/mapper/1ATA_QEMU_HARDDISK_QM00002 98M 5.4M 93M 6% /brick2 [root@rhsTestNode-1 ~]# gluster peer probe 10.70.35.80 peer probe: success [root@rhsTestNode-1 ~]# gluster peer status Number of Peers: 1 Hostname: 10.70.35.80 Port: 24007 Uuid: 7c42aa25-32e3-4ec2-bc6e-ecc57ff003f1 State: Peer in Cluster (Connected) [root@rhsTestNode-1 ~]# gluster volume create test-volume replica 2 10.70.35.63:/brick1 10.70.35.80:/brick2 volume create: test-volume: success: please start the volume to access data [root@rhsTestNode-1 ~]# gluster volume start test-volume volume start: test-volume: success [root@rhsTestNode-1 ~]# gluster volume list test-volume #################################################################### mount -t glusterfs 10.70.35.63:/test-volume /mnt/gluster_test_volume/ [root@unused lalatendu]# df -kh /mnt/gluster_test_volume/ Filesystem Size Used Avail Use% Mounted on 10.70.35.63:/test-volume 98M 5.4M 93M 6% /mnt/gluster_test_volume [root@unused lalatendu]# cp /home/lalatendu/Downloads/debian-6.0.6-amd64-CD-16.iso /mnt/gluster_test_volume/ cp: writing `/mnt/gluster_test_volume/debian-6.0.6-amd64-CD-16.iso': Input/output error cp: failed to extend `/mnt/gluster_test_volume/debian-6.0.6-amd64-CD-16.iso': Input/output error In client side /var/log/glusterfs/mnt-gluster_test_volume.log 2013-01-03 13:00:30.735699] I [afr-common.c:1976:afr_set_root_inode_on_first_lookup] 0-test-volume-replicate-0: added root inode [2013-01-03 14:17:01.284263] W [fuse-bridge.c:2025:fuse_writev_cbk] 0-glusterfs-fuse: 1512: WRITE => -1 (Input/output error) [2013-01-03 14:17:02.184562] W [client3_1-fops.c:879:client3_1_writev_cbk] 0-test-volume-client-1: remote operation failed: No space left on device [2013-01-03 14:17:02.185760] W [client3_1-fops.c:879:client3_1_writev_cbk] 0-test-volume-client-1: remote operation failed: No space left on device [root@unused gluster_test_volume]# pwd /mnt/gluster_test_volume [root@unused gluster_test_volume]# du -ch debian-6.0.6-amd64-CD-16.iso 94M debian-6.0.6-amd64-CD-16.iso 94M total ######################### [root@rhsTestNode-1 ~]# cd /brick1 [root@rhsTestNode-1 brick1]# ls debian-6.0.6-amd64-CD-16.iso [root@rhsTestNode-1 brick1]# du -ch debian-6.0.6-amd64-CD-16.iso 94M debian-6.0.6-amd64-CD-16.iso 94M total [root@rhsTestNode-2 brick2]# pwd /brick2 [root@rhsTestNode-2 brick2]# du -ch debian-6.0.6-amd64-CD-16.iso 93M debian-6.0.6-amd64-CD-16.iso 93M total Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html |