Bug 853690

Summary:	Having bricks with different sizes can truncate files.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Vidya Sakar <vinaraya>
Component:	glusterfs	Assignee:	Pranith Kumar K <pkarampu>
Status:	CLOSED ERRATA	QA Contact:	Lalatendu Mohanty <lmohanty>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.0	CC:	bfoster, gluster-bugs, jeff.shaw, lmohanty, mailbox, rfortier, rhs-bugs, vbellur, vijay
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.4.0qa4-1.el6rhs	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	GLUSTER-3750	Environment:
Last Closed:	2013-09-23 22:33:16 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	765482
Bug Blocks:

Description Vidya Sakar 2012-09-02 07:12:36 UTC

+++ This bug was initially created as a clone of Bug #765482 +++

This is a data corruption bug for Gluster on CentOS 6.

If one or more bricks in a replication group are smaller than the others, a file written to the replication group can overflow the smaller bricks. Later, if the file is read from one of the smaller bricks, the truncated file is read with no read error reported to the user. Instead, due to bytes being unavailable for reading, an error should occur and the copy should fail. Also, the truncated copy is reported as having the length that would have been expected if the file were not corrupt.

Here is a real world example of some commands I ran, with the mount point's log attached. I have a file on two file servers under /brick0, which are mounted on gluster0-gw0 as /mnt/test. I've already copied a file to the gluster volume that is too big for gluster0-member0:/brick0.

[root@gluster0-member0 ~]# df -h /brick0
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_gluster0member0-lv_brick0
                      245M  245M     0 100% /brick0
[root@gluster0-member0 ~]# ls -lh /brick0
total 239M
-rwxr--r-- 1 jeff.shaw domain users 332M Sep  9 16:46 debian-live-508-amd64-rescue.iso

[root@gluster0-member1 ~]# df -h /brick0
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_gluster0member1-lv_brick0
                      485M  346M  114M  76% /brick0
[root@gluster0-member1 ~]# ls -lh /brick0
total 332M
-rwxr--r-- 1 15004 15007 332M Sep  9 16:46 debian-live-508-amd64-rescue.iso

[root@gluster0-member1 ~]# umount /brick0

[root@gluster0-gw0 ~]# ls -lh /mnt/test
total 332M
-rwxr--r-- 1 jeff.shaw domain users 332M Sep  9 16:46 debian-live-508-amd64-rescue.iso
[root@gluster0-gw0 ~]# cp /mnt/test/debian-live-508-amd64-rescue.iso .
[root@gluster0-gw0 ~]# ls -lh .
-rwxr--r--  1 root root 332M Oct 21 09:55 debian-live-508-amd64-rescue.iso

Considering that I unmounted the only brick that stored the entire contents of debian-live-508-amd64-rescue.iso, I don't see how this copy is possibly successful. The gluster file system should fail to read the file.

[root@gluster0-gw0 ~]# md5sum debian-live-508-amd64-rescue.iso
33ff3a930892fcd8df3bebb244a1e99d  debian-live-508-amd64-rescue.iso

[root@gluster0-member1 ~]# mount /brick0
[root@gluster0-member1 ~]# md5sum /brick0/debian-live-508-amd64-rescue.iso
512d97b6da025da413f730a5be7231ef  /brick0/debian-live-508-amd64-rescue.iso

Now, since I've mounted gluster0-member1:/brick0, which has the only good copy of the file, I would hope that the replication translator (or whatever handles this) would only read from that copy. After running md5sum a few times on the file, it appears to be doing what I expect.

I used a known good copy of the file to verify that the correct md5sum is 512d97b6da025da413f730a5be7231ef.

# gluster --version
glusterfs 3.2.4 built on Sep 30 2011 07:17:57

...

# uname -a
Linux gluster0-group0-brick1 2.6.32-71.el6.x86_64 #1 SMP Fri May 20 03:51:51 BST 2011 x86_64 x86_64 x86_64 GNU/Linux

Comment 2 Brian Foster 2012-10-25 17:51:35 UTC

I reproduce this on a gluster 3.2.4 installation. I create a rep2 volume across a 90MB and 110MB brick. The size of the mount is reported correctly, but I'm able to copy a 100MB file without any indication of error. I can read the file back correctly unless I kill the glusterfsd which stores the complete file, at which point the file differs from the original.

Running the same test against the latest upstream code reproduces an error on copy, which I believe is expected behavior. 

I did some brief testing to fill the volume and validate our error semantics and it seems to be correct. If I do the minimal amount of 4k I/Os to fill up the volume, I'm not reproducing a situation where I receive no error and the files on the individual bricks do not match.

Comment 3 Brian Foster 2012-10-25 19:31:14 UTC

Further investigation... the first upstream commit that dumps an error for this test is: 

c903de38da917239fe905fc6efa1f413d120fc04 write-behind: implement causal ordering and other cleanup

... which makes some sense given we've seen this kind of issue with write-behind before (on the other hand, how it detects this condition is not immediately clear to me). But this also calls into question the correctness of AFR (i.e., we're relying on a higher level translator for an error that should probably be contained in another). If I remove write-behind from the client graph, I can reproduce the original problem on upstream code. It appears we might require a fix here after all...

Comment 4 Brian Foster 2012-10-26 20:36:30 UTC

My initial thought here was to bubble up short writes and any errors from an underlying brick on writev_cbk through afr. I had a brief IRC conversation with Jeff with regard to maintaining state in the event of errors (e.g., what's the right thing to do if a write is successful on one brick and we get EINVAL/EIO on another?). A traditional raid1-like approach might be to attempt a recovery (retry the write) and boot out the child if it fails). This would not necessarily result in an I/O error returned to the client, but perhaps some kind of administrative notification that a recovery is required. This appears to be similar to the currently resulting state in the event of an error:

-rw-r--r-- 2 root root 188416 Oct 26 13:58 b1/file
-rw-r--r-- 2 root root 196608 Oct 26 13:58 b2/file
# file: b1/file
trusted.afr.test-client-0=0x000000010000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x4e6ed5181bf4486c9a9246d3d32958de

# file: b2/file
trusted.afr.test-client-0=0x000000010000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x4e6ed5181bf4486c9a9246d3d32958de

The ENOSPC and short write conditions are more unique to the filesystem layer in this regard. I think it is somewhat reasonable to return the minimum return value across the set of writes since existing, sensible utilities (cp, dd) will retry the write. There is some concern over leaving the replicas in a non-synced state here, but this is effectively what already happens today. IOW, I can reproduce the following if the short write returns first:

-rw-r--r-- 2 root root  4096 Oct 26 15:39 b1/file
-rw-r--r-- 2 root root 32768 Oct 26 15:39 b2/file
# file: b1/file
trusted.afr.test-client-0=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x7cec26da62cd45f4b02b8b3705b078a7

# file: b2/file
trusted.afr.test-client-0=0x000000000000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0x7cec26da62cd45f4b02b8b3705b078a7

Comment 5 Brian Foster 2012-10-26 21:08:43 UTC

I'm getting the ball rolling here with a small change to bubble up ENOSPC and short writes specifically (excluding other errors). The same test as presented in comment #4 results in an ENOSPC error and the following state:

-rw-r--r-- 2 root root 24576 Oct 26 16:44 b1/file
-rw-r--r-- 2 root root 32768 Oct 26 16:44 b2/file
# file: b1/file
trusted.afr.test-client-0=0x000000010000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0xa89ba41a3e0a4673b55d6a75653462f5

# file: b2/file
trusted.afr.test-client-0=0x000000010000000000000000
trusted.afr.test-client-1=0x000000000000000000000000
trusted.gfid=0xa89ba41a3e0a4673b55d6a75653462f5

I think this is an improvement from the current situation, but might not be comprehensive. If a client did not actually retry a short write, we'd still be in the same, potentially corrupted situation. The right approach might actually be to treat a short write as an error occurred (which would be unfortunate if the client does retry the write and it succeeds), but I need to delve into that code a bit more. In the meantime:

http://review.gluster.org/4133

Comment 6 Brian Foster 2012-10-30 17:45:08 UTC

After some discussion in the review for the previous proposal, I decided to forego the short write propagation approach in favour of a best-case result approach. This returns the best case result across the replicas and marks shorter writes as failed, such that a heal is required. An error or short write relative to the original request should still be returned if it is the best case result.

http://review.gluster.org/4144

Comment 7 Brian Foster 2012-11-01 15:19:58 UTC

The test development for this fix introduced a few additional issues/dependencies:

1.) The ability to generate short writes on demand does not currently exist.
2.) The ability to use a single graph (i.e., no separate client and server graphs) is broken.
3.) Self-heal within a single graph is broken.

Issue 1 is handled by an enhancement to the error-gen translator to support a short write "pseudo error:" 

http://review.gluster.org/#change,4148

Issue 2 already has a pending fix:

http://review.gluster.org/4114

Issue 3 is resolved with a fix to afr's data self-heal fxattrop fop:

http://review.gluster.org/#change,4149

... and a repost of the short-write fix described in comment #6 with a test included:

http://review.gluster.org/#change,4150

Comment 8 Vijay Bellur 2012-11-20 07:12:40 UTC

CHANGE: http://review.gluster.org/4148 (debug/error-gen: add the short write pseudo-error) merged in master by Vijay Bellur (vbellur)

Comment 9 Vijay Bellur 2012-11-29 17:00:41 UTC

CHANGE: http://review.gluster.org/4150 (afr: handle short writes in afr_writev_wind and self-heal to avoid corruption) merged in master by Vijay Bellur (vbellur)

Comment 10 Lalatendu Mohanty 2013-01-03 10:29:10 UTC

I have created two bricks i.e. /brick1 having 982MB on rhsTestNode-1 and /brick2 of 93MB on rhsTestNode-2

Created a volume "test-volume" using these bricks. 

Tried copying a file which is bigger than free space available on the /brick2. 

While copying it gave error as expected I can see relevant information in the logfiles.

Hence marking it as verified.
##########################################################################

Below are the steps I performed. 

#there are two storage nodes i.e. rhsTestNode-1 and rhsTestNode-2, one client i.e. "root@unused"

root@rhsTestNode-1 ~]# uname -a
Linux rhsTestNode-1 2.6.32-220.30.1.el6.x86_64 #1 SMP Sun Nov 18 15:00:27 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

[root@rhsTestNode-1 ~]# gluster --version
glusterfs 3.4.0qa5 built on Dec 17 2012 04:36:17

[root@rhsTestNode-2 brick2]# uname -a
Linux rhsTestNode-2 2.6.32-279.19.1.el6.x86_64 #1 SMP Sat Nov 24 14:35:28 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

[root@rhsTestNode-2 brick2]# gluster --version
glusterfs 3.4.0qa5 built on Dec 17 2012 04:36:17

[root@rhsTestNode-1 ~]# df -kh /brick1
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/1ATA_QEMU_HARDDISK_QM00001
                     1014M   33M  982M   4% /brick1

[root@rhsTestNode-2 ~]# df -kh /brick2
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/1ATA_QEMU_HARDDISK_QM00002
                       98M  5.4M   93M   6% /brick2


[root@rhsTestNode-1 ~]# gluster peer probe 10.70.35.80
peer probe: success
[root@rhsTestNode-1 ~]# gluster peer status
Number of Peers: 1

Hostname: 10.70.35.80
Port: 24007
Uuid: 7c42aa25-32e3-4ec2-bc6e-ecc57ff003f1
State: Peer in Cluster (Connected)
[root@rhsTestNode-1 ~]# gluster volume create test-volume replica 2 10.70.35.63:/brick1 10.70.35.80:/brick2
volume create: test-volume: success: please start the volume to access data
[root@rhsTestNode-1 ~]# gluster volume start test-volume
volume start: test-volume: success
[root@rhsTestNode-1 ~]# gluster volume list
test-volume

####################################################################

mount -t glusterfs 10.70.35.63:/test-volume /mnt/gluster_test_volume/

[root@unused lalatendu]# df -kh /mnt/gluster_test_volume/
Filesystem                Size  Used Avail Use% Mounted on
10.70.35.63:/test-volume   98M  5.4M   93M   6% /mnt/gluster_test_volume


[root@unused lalatendu]# cp /home/lalatendu/Downloads/debian-6.0.6-amd64-CD-16.iso /mnt/gluster_test_volume/
cp: writing `/mnt/gluster_test_volume/debian-6.0.6-amd64-CD-16.iso': Input/output error
cp: failed to extend `/mnt/gluster_test_volume/debian-6.0.6-amd64-CD-16.iso': Input/output error


In client side /var/log/glusterfs/mnt-gluster_test_volume.log

2013-01-03 13:00:30.735699] I [afr-common.c:1976:afr_set_root_inode_on_first_lookup] 0-test-volume-replicate-0: added root inode
[2013-01-03 14:17:01.284263] W [fuse-bridge.c:2025:fuse_writev_cbk] 0-glusterfs-fuse: 1512: WRITE => -1 (Input/output error)
[2013-01-03 14:17:02.184562] W [client3_1-fops.c:879:client3_1_writev_cbk] 0-test-volume-client-1: remote operation failed: No space left on device
[2013-01-03 14:17:02.185760] W [client3_1-fops.c:879:client3_1_writev_cbk] 0-test-volume-client-1: remote operation failed: No space left on device


[root@unused gluster_test_volume]# pwd
/mnt/gluster_test_volume
[root@unused gluster_test_volume]# du -ch debian-6.0.6-amd64-CD-16.iso 
94M	debian-6.0.6-amd64-CD-16.iso
94M	total


#########################

[root@rhsTestNode-1 ~]# cd /brick1
[root@rhsTestNode-1 brick1]# ls
debian-6.0.6-amd64-CD-16.iso
[root@rhsTestNode-1 brick1]# du -ch debian-6.0.6-amd64-CD-16.iso 
94M	debian-6.0.6-amd64-CD-16.iso
94M	total


[root@rhsTestNode-2 brick2]# pwd
/brick2
[root@rhsTestNode-2 brick2]# du -ch debian-6.0.6-amd64-CD-16.iso 
93M	debian-6.0.6-amd64-CD-16.iso
93M	total

Comment 13 Scott Haines 2013-09-23 22:33:16 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html