Bug 762666 (GLUSTER-934) - md5sum mismatch when files are transferred using vsftpd
Summary: md5sum mismatch when files are transferred using vsftpd
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-934
Product: GlusterFS
Classification: Community
Component: write-behind
Version: 3.0.4
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: Raghavendra G
QA Contact:
URL:
Whiteboard:
: GLUSTER-963 GLUSTER-1060 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-05-19 23:57 UTC by Vikas Gorur
Modified: 2015-12-01 16:45 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Vikas Gorur 2010-05-19 20:59:38 UTC
Configuration generated with:

glusterfs-volgen -r 1 host1:/export host2:/export

Attaching to the quickread translator below write-behind makes the problem go away.

Comment 1 Vikas Gorur 2010-05-19 21:14:10 UTC
Appears to be a race:

[root@brick5 934]# cmp test.4M test.4M.mnt.run3 
test.4M test.4M.mnt.run3 differ: byte 1143921, line 4504

[root@brick5 934]# cmp test.4M test.4M.mnt.run2

[root@brick5 934]# cmp test.4M test.4M.mnt.run1
test.4M test.4M.mnt.run1 differ: byte 2088017, line 8232

test.4M is the original source file, and run[123] are the files that were transferred during 3 successive FTP uploads.

Comment 2 Vikas Gorur 2010-05-19 21:33:03 UTC
For run3:

[root@brick5 934]# cmp test.4M test.4M.mnt.run3 
test.4M test.4M.mnt.run3 differ: byte 1143921, line 4504

From the server trace:

[2010-05-19 17:09:33] N [trace.c:1642:trace_writev] brick1: 776: (*fd=0x2aaaac000fe0, *vector=0x7fff1e47a770, count=1, offset=1143920)

For run5:

[root@brick5 trendmicro]# cmp test.4M.run5 test.4M
test.4M.run5 test.4M differ: byte 315665, line 1211

[2010-05-19 17:28:56] N [trace.c:1642:trace_writev] brick1: 1037: (*fd=0x2aaaac002120, *vector=0x7fff903695b0, 
count=1, offset=315664)

Off-by-1?

Comment 3 Vikas Gorur 2010-05-19 23:57:42 UTC
When a file is transferred using ftp (ftp server is vsftpd version 2.0.5 on CentOS 5.2) with write-behind loaded, the md5sum on the mountpoint does not match the md5sum on the source.

It appears that the mismatch only happens for files that are larger than 4MB (window-size in this test is 4MB).

c48c9a0c7a162516df5d16c17bd73d78  test.3.99M
c48c9a0c7a162516df5d16c17bd73d78  test.3.99M.mnt
4b5232ac6e400872da52f4af71f4159c  test.4M
cb542550cfb2a72d8343e0f35caaf126  test.4M.mnt

Comment 4 Raghavendra G 2010-05-23 08:41:52 UTC
Minimum configuration required to reproduce this bug is,
fuse->write-behind->replicate->client->server->locks->posix.

Following are the causes for the bug:

1. in afr, write is a transaction instead of a single operation. Hence if two writes are sent to afr one after another, there is a possibility of change of their order by the time they leave afr.
2. Maximum size of a write from write-behind is 128KB, hence for a window size > 128KB, there is a possibility of write-behind issuing more than one write (one after another).
3. For files opened with O_APPEND, a file with holes cannot be created, since writes always happen at the end of file.
4. vsftpd always opens files with O_APPEND.

Now, since writes can happen out of order, files with holes are created (by the time vsftpd finishes writing to file, these holes will be filled, since holes were created only because of out-of-order writes from afr). With posix opening files with O_APPEND (as passed by vsftpd), writes always happen at the current end of file, instead of happening at their correct offset, thereby causing corruption.

As a fix, we should remove O_APPEND from the flags passed to open/creat. on the other hand O_APPEND is redundant, since the offsets are always sent by fuse and we do lseek before doing read/write.

Comment 5 Raghavendra G 2010-05-25 00:39:35 UTC
After probing further into the bug, we found the problem to be in posix-locks. posix-locks uses address of frame->root (on server side) as 'lock-owner'. Once a lock is granted, the request-frame is unwound and freed. Hence there is a possibility of same address being reused for frame->root in new requests and thereby a new lock request with same lock-owner (frame->root) as that of one of currently held locks being granted (this is because posix-locks grants inode locks for requests having same lock-owner as that of one of currently granted locks). If there are other lock requests issued between the time a lock is granted and its frame->root value is reused, out of order writes can be issued from afr, since lock requests for writes at lesser offsets are still not granted, but the lock request with reused frame->root address is granted.

As a fix, posix-locks should be using some value which is guaranteed to be unique across lock-requests for lock-owner.

Comment 6 Raghavendra G 2010-05-25 00:45:51 UTC
correction:

As a fix, posix-locks should be using some value which is guaranteed to be
unique across lock-requests for lock-owner, unless the issuer of lock request really wants the same lock-owner value for different lock-requests.

Comment 7 Raghavendra G 2010-05-25 01:00:45 UTC
Another correction:
posix-locks grants inode locks for requests having same lock-owner as that of one of currently granted locks.

posix-locks MAY/MAY NOT grant inode lock requests having same lock-owner as that of one of currently granted locks, but if a lock request and one of already granted locks have same lock-owner, they do not conflict with each other.

In this particular case, since there can be only one lock on the file (since we are locking entire file - afr locks entire file for files opened with O_APPEND), and its lock-owner is same as that of new request, the new request is granted.

Comment 8 Anand Avati 2010-05-26 04:40:56 UTC
PATCH: http://patches.gluster.com/patch/3307 in master (features/locks: Use fuse supplied lock owner even for internal locks.)

Comment 9 Anand Avati 2010-05-26 04:41:00 UTC
PATCH: http://patches.gluster.com/patch/3306 in release-3.0 (features/locks: Use fuse supplied lock owner even for internal locks.)

Comment 10 Anand Avati 2010-05-26 08:40:12 UTC
PATCH: http://patches.gluster.com/patch/3318 in master (performance/write-behind: explicitly enforce ordering of overlapping writes.)

Comment 11 Anand Avati 2010-05-26 08:49:02 UTC
PATCH: http://patches.gluster.com/patch/3319 in release-3.0 (performance/write-behind: explicitly enforce ordering of overlapping writes.)

Comment 12 Raghavendra G 2010-05-31 08:47:21 UTC
*** Bug 963 has been marked as a duplicate of this bug. ***

Comment 13 Raghavendra G 2010-05-31 08:52:54 UTC
bug #762695 is surfaced because of patches to write-behind which were supposed to fix this bug. Hence marking #963 as duplicate and reopening this bug.

Comment 14 Vijay Bellur 2010-07-14 03:09:02 UTC
*** Bug 1060 has been marked as a duplicate of this bug. ***

Comment 15 Raghavendra G 2010-08-20 08:18:32 UTC
kernel compile on latest git-pull of release-3.0 succeeds. I think we can close this bug.

Comment 16 Anand Avati 2011-01-27 17:18:21 UTC
PATCH: http://patches.gluster.com/patch/6000 in release-3.0 (performance/write-behind: backport write-behind from 3.1)


Note You need to log in before you can comment on or make changes to this bug.