Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 762666 (GLUSTER-934)

Summary:	md5sum mismatch when files are transferred using vsftpd
Product:	[Community] GlusterFS	Reporter:	Vikas Gorur <vikas>
Component:	write-behind	Assignee:	Raghavendra G <raghavendra>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	low
Version:	3.0.4	CC:	gluster-bugs, pavan, rabhat, vijay
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vikas Gorur 2010-05-19 20:59:38 UTC

Configuration generated with:

glusterfs-volgen -r 1 host1:/export host2:/export

Attaching to the quickread translator below write-behind makes the problem go away.

Comment 1 Vikas Gorur 2010-05-19 21:14:10 UTC

Appears to be a race:

[root@brick5 934]# cmp test.4M test.4M.mnt.run3 
test.4M test.4M.mnt.run3 differ: byte 1143921, line 4504

[root@brick5 934]# cmp test.4M test.4M.mnt.run2

[root@brick5 934]# cmp test.4M test.4M.mnt.run1
test.4M test.4M.mnt.run1 differ: byte 2088017, line 8232

test.4M is the original source file, and run[123] are the files that were transferred during 3 successive FTP uploads.

Comment 2 Vikas Gorur 2010-05-19 21:33:03 UTC

For run3:

[root@brick5 934]# cmp test.4M test.4M.mnt.run3 
test.4M test.4M.mnt.run3 differ: byte 1143921, line 4504

From the server trace:

[2010-05-19 17:09:33] N [trace.c:1642:trace_writev] brick1: 776: (*fd=0x2aaaac000fe0, *vector=0x7fff1e47a770, count=1, offset=1143920)

For run5:

[root@brick5 trendmicro]# cmp test.4M.run5 test.4M
test.4M.run5 test.4M differ: byte 315665, line 1211

[2010-05-19 17:28:56] N [trace.c:1642:trace_writev] brick1: 1037: (*fd=0x2aaaac002120, *vector=0x7fff903695b0, 
count=1, offset=315664)

Off-by-1?

Comment 3 Vikas Gorur 2010-05-19 23:57:42 UTC

When a file is transferred using ftp (ftp server is vsftpd version 2.0.5 on CentOS 5.2) with write-behind loaded, the md5sum on the mountpoint does not match the md5sum on the source.

It appears that the mismatch only happens for files that are larger than 4MB (window-size in this test is 4MB).

c48c9a0c7a162516df5d16c17bd73d78  test.3.99M
c48c9a0c7a162516df5d16c17bd73d78  test.3.99M.mnt
4b5232ac6e400872da52f4af71f4159c  test.4M
cb542550cfb2a72d8343e0f35caaf126  test.4M.mnt

Comment 4 Raghavendra G 2010-05-23 08:41:52 UTC

Minimum configuration required to reproduce this bug is,
fuse->write-behind->replicate->client->server->locks->posix.

Following are the causes for the bug:

1. in afr, write is a transaction instead of a single operation. Hence if two writes are sent to afr one after another, there is a possibility of change of their order by the time they leave afr.
2. Maximum size of a write from write-behind is 128KB, hence for a window size > 128KB, there is a possibility of write-behind issuing more than one write (one after another).
3. For files opened with O_APPEND, a file with holes cannot be created, since writes always happen at the end of file.
4. vsftpd always opens files with O_APPEND.

Now, since writes can happen out of order, files with holes are created (by the time vsftpd finishes writing to file, these holes will be filled, since holes were created only because of out-of-order writes from afr). With posix opening files with O_APPEND (as passed by vsftpd), writes always happen at the current end of file, instead of happening at their correct offset, thereby causing corruption.

As a fix, we should remove O_APPEND from the flags passed to open/creat. on the other hand O_APPEND is redundant, since the offsets are always sent by fuse and we do lseek before doing read/write.

Comment 5 Raghavendra G 2010-05-25 00:39:35 UTC

After probing further into the bug, we found the problem to be in posix-locks. posix-locks uses address of frame->root (on server side) as 'lock-owner'. Once a lock is granted, the request-frame is unwound and freed. Hence there is a possibility of same address being reused for frame->root in new requests and thereby a new lock request with same lock-owner (frame->root) as that of one of currently held locks being granted (this is because posix-locks grants inode locks for requests having same lock-owner as that of one of currently granted locks). If there are other lock requests issued between the time a lock is granted and its frame->root value is reused, out of order writes can be issued from afr, since lock requests for writes at lesser offsets are still not granted, but the lock request with reused frame->root address is granted.

As a fix, posix-locks should be using some value which is guaranteed to be unique across lock-requests for lock-owner.

Comment 6 Raghavendra G 2010-05-25 00:45:51 UTC

correction:

As a fix, posix-locks should be using some value which is guaranteed to be
unique across lock-requests for lock-owner, unless the issuer of lock request really wants the same lock-owner value for different lock-requests.

Comment 7 Raghavendra G 2010-05-25 01:00:45 UTC

Another correction:
posix-locks grants inode locks for requests having same lock-owner as that of one of currently granted locks.

posix-locks MAY/MAY NOT grant inode lock requests having same lock-owner as that of one of currently granted locks, but if a lock request and one of already granted locks have same lock-owner, they do not conflict with each other.

In this particular case, since there can be only one lock on the file (since we are locking entire file - afr locks entire file for files opened with O_APPEND), and its lock-owner is same as that of new request, the new request is granted.

Comment 8 Anand Avati 2010-05-26 04:40:56 UTC

PATCH: http://patches.gluster.com/patch/3307 in master (features/locks: Use fuse supplied lock owner even for internal locks.)

Comment 9 Anand Avati 2010-05-26 04:41:00 UTC

PATCH: http://patches.gluster.com/patch/3306 in release-3.0 (features/locks: Use fuse supplied lock owner even for internal locks.)

Comment 10 Anand Avati 2010-05-26 08:40:12 UTC

PATCH: http://patches.gluster.com/patch/3318 in master (performance/write-behind: explicitly enforce ordering of overlapping writes.)

Comment 11 Anand Avati 2010-05-26 08:49:02 UTC

PATCH: http://patches.gluster.com/patch/3319 in release-3.0 (performance/write-behind: explicitly enforce ordering of overlapping writes.)

Comment 12 Raghavendra G 2010-05-31 08:47:21 UTC

*** Bug 963 has been marked as a duplicate of this bug. ***

Comment 13 Raghavendra G 2010-05-31 08:52:54 UTC

bug #762695 is surfaced because of patches to write-behind which were supposed to fix this bug. Hence marking #963 as duplicate and reopening this bug.

Comment 14 Vijay Bellur 2010-07-14 03:09:02 UTC

*** Bug 1060 has been marked as a duplicate of this bug. ***

Comment 15 Raghavendra G 2010-08-20 08:18:32 UTC

kernel compile on latest git-pull of release-3.0 succeeds. I think we can close this bug.

Comment 16 Anand Avati 2011-01-27 17:18:21 UTC

PATCH: http://patches.gluster.com/patch/6000 in release-3.0 (performance/write-behind: backport write-behind from 3.1)