Bug 329431 - CIFS with MS-Windows Fileserver: Data loss caused by ignoring quota
Summary: CIFS with MS-Windows Fileserver: Data loss caused by ignoring quota
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Jeff Layton
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-10-12 14:39 UTC by Issue Tracker
Modified: 2008-05-21 14:58 UTC (History)
5 users (show)

Fixed In Version: RHBA-2008-0314
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-21 14:58:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
'script' output of test run (3.16 KB, application/octet-stream)
2007-10-12 14:45 UTC, Vince Worthington
no flags Details
syslog from running tests (906 bytes, text/plain)
2007-10-12 14:46 UTC, Vince Worthington
no flags Details
strace of cp operation (103.36 KB, text/plain)
2007-10-12 14:48 UTC, Vince Worthington
no flags Details
strace of mv operation (103.98 KB, text/plain)
2007-10-12 14:50 UTC, Vince Worthington
no flags Details
tcpdump of cifs session (4.47 MB, application/octet-stream)
2007-10-12 14:52 UTC, Vince Worthington
no flags Details
sosreport (1.31 MB, application/x-bzip)
2007-10-12 14:53 UTC, Vince Worthington
no flags Details
reproducer program (877 bytes, patch)
2007-10-12 17:39 UTC, Jeff Layton
no flags Details | Diff
upstream patch -- only sync out the file on size changes in cifs_setattr (791 bytes, patch)
2007-10-12 20:44 UTC, Jeff Layton
no flags Details | Diff
proposed upstream patch 2 (5.61 KB, patch)
2007-10-16 12:13 UTC, Jeff Layton
no flags Details | Diff
proposed upstream patch 3 (6.97 KB, patch)
2007-10-25 19:54 UTC, Jeff Layton
no flags Details | Diff
patch -- better handle writeback errors (8.46 KB, patch)
2007-11-15 12:55 UTC, Jeff Layton
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0314 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5.2 2008-05-20 18:43:34 UTC

Description Issue Tracker 2007-10-12 14:39:03 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2007-10-12 14:39:04 UTC
Description of problem:
If a file is moved to a CIFS share which exceeds the quota on the server, there's no errormessage and after the mv
only a part of the file is on the share. The local file is deleted!

How reproducible:
100%

Steps to Reproduce:
dd if=/dev/urandom of=testfile.dat bs=1M count=500
logger "=== Start =========================="
df -P
ls -l testfile.dat
mv testfile.dat /net/m525017/J/temp/
ls -l testfile.dat
ls -l /net/m525017/J/temp/testfile.dat
df -P
logger "=== End =========================="


Actual results:
Only part of the moved file gets to the cifs share. The other part is lost. No error occurs.

Expected results:
Command should fail with something like "no space left on
device", no data should be lost.

Additional info:
The fileserver is a standard MS-Windows Server 2003 SP1 with no modification of fileservices.

Client mount-options are:
acl,intr,nocase,nouser_xattr,rsize=8192,wsize=8192,soft,uid=$(id -u),gid=$(id -g),dirmode=0770,filemode=0660,credentials=${CREDENTIALS}

This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 2 Issue Tracker 2007-10-12 14:39:05 UTC
Dominik,

first of all, thank you very much for reporting this. I am trying to
locate the resources in-house to try reproducing the problem. Meanwhile I
do this, do you think you could provide me a tcpdump trace of the network
traffic when this problem is happening ?

It might be worth lowering the quota so that we don't end up collecting
too much network traffic.

I am interested in capturing all SMB/CIFS traffic with the following
command:

  tcpdump -i any -w /tmp/rhsupport-cifs.pcap -s 0 port 139 or port 445

I realize that CIFS is giving write errors which should be enough to make
"mv" suspicious about the success of the operation, but I would like to
see what messages are returned from the Windows server in regards to the
success of the operation.

Thank you very much.

-- Navid

Internal Status set to 'Waiting on Customer'
Status set to: Waiting on Client

This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 3 Issue Tracker 2007-10-12 14:39:06 UTC
Dominik,

I quickly look at the trace you have sent and it appears clear that the
Windows file-server is returning an NT status "STATUS_DISK_FULL"
starting on frame 2736. The CIFS client continues to send write requests
with an incremental offset instead of retrying, which leads me to think
that it simply is ignoring the error status previously returned.

I will try to reproduce this in our labs and let you know.

Thank you very much.

-- Navid


This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 4 Issue Tracker 2007-10-12 14:39:07 UTC
Dominik,

the problem appears to be already fixed in FC6 as of kernel
2.6.22.7-57.fc6. I am currently looking at the changelogs and diffs to see
if I can locate where the fix was introduced.

Regards,

-- Navid


This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 5 Issue Tracker 2007-10-12 14:39:09 UTC
Navid,

this is a severe bug. It should be fixed as soon as possible. If this
doesn't get
in 5.1 we need a intermediate solution.

-- Dominik


This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 6 Issue Tracker 2007-10-12 14:39:10 UTC
I have done some further testing using the latest available kernel for
RHEL5. Interestingly while "mv" thinks the move was successful and
removes the original file (causing data loss), "cp" is behaving
correctly in my environment.

I have attached both straces for "mv" and "cp" to comapare.

"cp" does the following:

open("./navid.dat", O_WRONLY|O_CREAT|O_LARGEFILE, 0100644) = 4
read(3,
"\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0"...,
16384) = 16384
write(4,
"\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0"...,
16384) = 16384
[...]
read(3, "", 16384)                      = 0
close(4)                                = -1 ENOSPC (No space left on
device)

As opposed to "mv", which doesn't return an error when closing the fd
for the destination file:

open("./navid.dat", O_WRONLY|O_CREAT|O_LARGEFILE, 0100644) = 4
read(3,
"\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0"...,
16384) = 16384
write(4,
"\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0\\0"...,
16384) = 16384
[...]
utimes("/proc/self/fd/4", {1192187215, 0}) = 0
fchown32(4, 0, 0)                       = 0
getxattr("/tmp/navid.dat", "system.posix_acl_access", 0xbfe9aa20, 132)
= -1 ENODATA (No data available)
stat64("/tmp/navid.dat", {st_mode=S_IFREG|0644, st_size=9437184, ...}) =
0
setxattr("./navid.dat", "system.posix_acl_access",
"\\x02\\x00\\x00\\x00\\x01\\x00\\x06\\x00\\xff\\xff\\xff\\xff\\x04\\x00\\x04\\x00\\xff\\xff\\xff\\xff
\\x00\\x04\\x00\\xff\\xff\\xff\\xff", 28, 0) = -1 EOPNOTSUPP (Operation
not supported)
chmod("./navid.dat", 0100644)           = 0
read(3, "", 16384)                      = 0
close(4)                                = 0

The FD is opened the same way in both cases, the only obvious difference
is those ACL related calls (getxattr and setxattrs) which are in between
the last read() and close() of the destination file (which returns 0).

I will do further testing and let you know.

Regards,

-- Navid


This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 7 Issue Tracker 2007-10-12 14:39:11 UTC
Yes, "cp" works here too.

If the quota is reached on the share (disk is "full") a "mv" gives an
error
message too.

It seems, that mv ignores disk-full messages after it starts writing!?

-- Dominik



This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 8 Issue Tracker 2007-10-12 14:39:12 UTC
Short additonally note: not only "mv" but also moving files with nautilus
results in data loss.

-- Dominik



This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 9 Issue Tracker 2007-10-12 14:39:13 UTC
Internal Status set to: 'Waiting on SEG'
Data corruption on RHEL5 using latest kernel, beta and even jlayton's
test kernels.

Very easy to reproduce:

1. Windows box sharing a folder with quotas enforced
2. RHEL5 mounting the share using CIFS
3. dd if=/dev/zero of=/tmp/navid.dat bs=1M count=100
4. mv /tmp/navid.dat /mnt/windows/share

When Windows is returning STATUS_DISK_FULL and CIFS error messages are
visible in the kernel logs, even if the quota limit has been reached the
command will exit successfully and remove the source file, even if the
destination contents are actually truncated.

This doesn't seem to affect "cp", which for some reason correctly
received a ENOSPC when doing a close() on the destination file.

FC6 with 2.6.22.7-57.fc6 doesn't seem to be affected either.

-- Navid


Issue escalated to Support Engineering Group by: navid.


This event sent from IssueTracker by vincew  [SEG - Storage]
 issue 134794

Comment 10 Jeff Layton 2007-10-12 14:44:18 UTC
Ok, I set up a win2k3 export with quotas turned on. I successfully get ENOSPC
errors back for both mv and cp. On a mv, the original file is not unlinked. I'm
testing:

2.6.18-52.el5.jtltest.8xen

...which is pretty much the same CIFS that's going into 5.1 (a couple of extra
patches, but nothing that should affect this).


Comment 11 Vince Worthington 2007-10-12 14:45:39 UTC
Created attachment 225561 [details]
'script' output of test run

Comment 12 Vince Worthington 2007-10-12 14:46:42 UTC
Created attachment 225571 [details]
syslog from running tests

Comment 13 Vince Worthington 2007-10-12 14:48:47 UTC
Created attachment 225581 [details]
strace of cp operation

Comment 14 Jeff Layton 2007-10-12 14:49:36 UTC
For reference, from 2.6.18-52.el5.jtltest.8xen:

write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
read(3, "", 16384)                      = 0
utimes("/proc/self/fd/4", {1192199452, 0}) = 0
fchown(4, 0, 0)                         = 0
getxattr("/tmp/bigfile.txt", "system.posix_acl_access", 0x7ffff3a72a20, 132) =
-1 ENODATA (No data available)
stat("/tmp/bigfile.txt", {st_mode=S_IFREG|0644, st_size=125829120, ...}) = 0
setxattr("/mnt/cifs/bigfile.txt", "system.posix_acl_access",
"\x02\x00\x00\x00\x01\x00\x06\x00\xff\xff\xff\xff\x04\x00\x04\x00\xff\xff\xff\xff
\x00\x04\x00\xff\xff\xff\xff", 28, 0) = -1 EOPNOTSUPP (Operation not supported)
chmod("/mnt/cifs/bigfile.txt", 0100644) = 0
close(4)                                = -1 ENOSPC (No space left on device)
open("/usr/share/locale/locale.alias", O_RDONLY) = 4


...that's from trying to mv a 120M file to a volume with a 100M quota.

Comment 15 Vince Worthington 2007-10-12 14:50:02 UTC
Created attachment 225591 [details]
strace of mv operation

Comment 16 Vince Worthington 2007-10-12 14:52:07 UTC
Created attachment 225601 [details]
tcpdump of cifs session

Comment 17 Vince Worthington 2007-10-12 14:53:46 UTC
Created attachment 225611 [details]
sosreport

Comment 18 Jeff Layton 2007-10-12 15:50:01 UTC
The return code for the close() is actually governed by the return value from
the flush operation. There's a difference in your box versus mine in this situation:

Yours:
 fs/cifs/file.c: CIFS VFS: in cifs_writepages as Xid: 18339 with uid: 0
 fs/cifs/file.c: CIFS VFS: leaving cifs_writepages (xid = 18339) rc = 0
 fs/cifs/file.c: Flush inode e24d71e0 file dfa07d40 rc 0

Mine:
 fs/cifs/file.c: CIFS VFS: in cifs_writepage as Xid: 741191 with uid: 0
 fs/cifs/file.c: write 4096 bytes to offset 21209088 of bigfile.txt
 fs/cifs/file.c: CIFS VFS: in cifs_write as Xid: 741192 with uid: 0
 fs/cifs/transport.c: For smb_command 47
 fs/cifs/transport.c: Sending smb of length 4160
 fs/cifs/connect.c: rfc1002 length 0x36)
 fs/cifs/misc.c: Calculated size 51 vs length 54 mismatch for mid 16707
Status code returned 0xc000007f NT_STATUS_DISK_FULL
 fs/cifs/netmisc.c:  !!Mapping smb error code 112 to POSIX err -28 !!
 fs/cifs/cifssmb.c: Send error in write = -28
 fs/cifs/file.c: CIFS VFS: leaving cifs_write (xid = 741192) rc = -28
 fs/cifs/file.c: CIFS VFS: leaving cifs_writepage (xid = 741191) rc = -28
 fs/cifs/file.c: Flush inode ffff880015c13ac8 file ffff88001e6806c0 rc -28

...so on your machine, it looks like there aren't any actual pages queue up to
be flushed by the time the flush call is done. Since there's nothing queued up,
then the flush doesn't seem to fail.

There are some differences between my test rig and yours. Mine's x86_64, for
instance...


Comment 19 Issue Tracker 2007-10-12 16:36:49 UTC
I tested this using 2.6.18-20.el5 (plus 2 non-related ipsec patches) on an
x86 vmware rhel5 guest.

I was denied once i exceeded quota in the mv operation:

[root@rhel5-test1 ~]# dd if=/dev/urandom of=testfile.dat bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 68.0668 seconds, 3.1 MB/s
[root@rhel5-test1 ~]# ls *dat
-rw-r--r-- 1 root root 209715200 Oct 12 12:30 testfile.dat
[root@rhel5-test1 ~]# mv testfile.dat /mnt/tmp2/
mv: closing `/mnt/tmp2/testfile.dat': No space left on device
[root@rhel5-test1 ~]# ls *dat
-rw-r--r-- 1 root root 209715200 Oct 12 12:30 testfile.dat
[root@rhel5-test1 ~]# ls /mnt/tmp2/*dat
-rw-r--r-- 1 root root 104857600 Oct 12 12:31 /mnt/tmp2/testfile.dat
[root@rhel5-test1 ~]#




This event sent from IssueTracker by vincew 
 issue 134794

Comment 20 Issue Tracker 2007-10-12 16:41:35 UTC
note - /mnt/tmp2 is where the remote w2k3sp1 share was mounted..  I did a
200MB dd file to speed up reproduction (or not) after a few 500MB tries in
which the mv succeeded (because of not having the deny storage to users
with exceeded quota option in w2k3sp1).

Jeff believes we may be seeing a race between flushing of pages and the
close() - to explain why we don't always see this happening.  

--vince


This event sent from IssueTracker by vincew 
 issue 134794

Comment 21 Jeff Layton 2007-10-12 17:39:50 UTC
Created attachment 225861 [details]
reproducer program

I'm pretty sure that this has something to do with a writepages completing
before the close is called, but I'm having a very hard time reproducing this. I
wrote a small test program that writes out some data to a file, does a fsync, a
variable length sleep and then a close. On a normal box:

fsync(3)				= -1 ENOSPC (No space left on device)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x2aaaaaaac000
write(1, "fsync returned -1 errno 28\n", 27fsync returned -1 errno 28
) = 27
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})		= 0
close(3)				= -1 ENOSPC (No space left on device)
write(1, "close returned -1 errno 28\n", 27close returned -1 errno 28
) = 27
exit_group(-1)				= ?

...on navid's machine:

fsync(3)				= -1 ENOSPC (No space left on device)
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7f13000
write(1, "fsync returned -1 errno 28\n", 27fsync returned -1 errno 28
) = 27
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, {1, 0})		= 0
close(3)				= 0
write(1, "close returned 0 errno 28\n", 26close returned 0 errno 28
) = 26
exit_group(0)				= ?

...so on his machine the fsync returns -1 (with errno set to -ENOSPC), but the
close returns 0. On every other machine I've tested, both the fsync and the
close return error. I'm not sure what makes his machine "special" here.

Comment 22 Jeff Layton 2007-10-12 19:35:27 UTC
Ok, I think I've been able to reproduce it. I had to get the share very close to
quota and change the program to only write a little bit more than the quota
allows. So for instance I got the share to quota - 64k, and then changed the
program to only write 128k. With that I was able to see it. Still not certain
why navid's box can reproduce it so easily, but I think there is some raciness
involved and perhaps vmware makes it easier to reproduce.

Now, I think the problem is this (presuming I'm following the code correctly):

When cifs_writepages is called to flush the pagecache for the file, it does this:

                                rc = CIFSSMBWrite2(xid, cifs_sb->tcon,
                                                   open_file->netfid,
                                                   bytes_to_write, offset,
                                                   &bytes_written, iov, n_iov,
                                                   1);
                                atomic_dec(&open_file->wrtPending);
                                if (rc || bytes_written < bytes_to_write) {
                                        cERROR(1,("Write2 ret %d, written = %d",
                                                  rc, bytes_written));
                                        /* BB what if continued retry is
                                           requested via mount flags? */
                                        set_bit(AS_EIO, &mapping->flags);


...and then later it frees up the pages that it attempts to flush. So if there
is any error at all in the write, we toss out the page (yikes!). When there are
errors in the write, cifs_writepages (correctly) returns error, but there is no
guarantee that the writepages being done is for the close() call. If this
writepages is being done for something else, and the entire pagevec for the file
is flushed then the next pass through cifs_writepages will end up with no
outstanding pages to be written and will return 0.

It seems like we should keep track in the cifsFileInfo for the file that a
previous cifs_writepages failed and have this function return an error until it
closes. Then again, I'm not certain what the correct behavior is for a close()
that follows an unsuccessful fsync().



Comment 23 Jeff Layton 2007-10-12 20:44:37 UTC
Created attachment 226081 [details]
upstream patch -- only sync out the file on size changes in cifs_setattr

This patch is against cifs-2.6 git tree. Sent to upstream maintainer for
comment. I tested a variant of this on RHEL5 and it seems to fix the problem.
This just prevents the cifs client from syncing out pages when setattr isn't
going to change the file size. It also has cifs_setattr check the return code
of filemap_write_and_wait and error out if it fails.

Awaiting upstream comment at this point...

Comment 27 Jeff Layton 2007-10-16 12:13:05 UTC
Created attachment 228691 [details]
proposed upstream patch 2

A second proposed upstream patch. I'm not certain this is the right approach
but it seems like it to me currently.

The idea here is that we want to separate "unconscious" flushes from
"conscious" ones. By unconscious, I mean when the kernel needs to free up
memory, or when the flush occurs as a side effect of another operation -- like
setattr. Conscious flushes would be ones that occur due to a fsync or close.

The idea is that when we do any sort of filemap_fdatawait operation that we
make sure we store the return code from it on an unconscious call. Then when we
fsync or flush, we make sure that we return any stored errors that we have if
the existing one doesn't return an error.

I'm not sure I have all of the semantics right, so this is really an RFC at
this point.

Comment 30 RHEL Program Management 2007-10-19 08:55:16 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 31 Jeff Layton 2007-10-25 19:54:06 UTC
Created attachment 237941 [details]
proposed upstream patch 3

Another proposed patch -- this one removes the changes to the AIO operation
(which I don't think is correct). Still awaiting comment at this point...

Comment 33 Jeff Layton 2007-11-15 12:55:16 UTC
Created attachment 259751 [details]
patch -- better handle writeback errors

Proposed patch sent to upstream lists this morning:

This fixes a bug in the earlier patches (write_behind_rc wasn't being properly
initialized).

I have some kernels on my people page with this patch:

http://people.redhat.com/jlayton

...if the customer has a non-critical place to do so, it would be nice if they
could test these and let me know if they fix the issue for them (or if they
notice any problems with them).

Comment 34 Issue Tracker 2007-11-20 11:41:46 UTC
Jeff,
The customer has tested you test kernels containing the patch, mv now
works as expected. However, they have highlighted another issue, not sure
whether this is related.

---
But downloading with Firefox
does not work:

While the download is in progress I do a ls on the file:

$ls -latr Fedora-8-i386-DVD.iso
-rw-r--r-- 1 m525017 bin 180226800 20. Nov 11:11 Fedora-8-i386-DVD.iso

-> 180MB downloaded that's exactly what Firefox's download dialog says.
But on the windows fileserver there are only 68MB in this file.

And dmesg gives tons of:

CIFS VFS: Write2 ret -28, written = 0
---

Cheers,
Paul on behalf of Navid

Internal Status set to 'Waiting on Engineering'

This event sent from IssueTracker by pdwyer 
 issue 134794

Comment 35 Jeff Layton 2007-11-20 12:05:52 UTC
Possibly...

#define ENOSPC          28      /* No space left on device */

...are they attempting to exceed the quota (or the amount of disk space) when
they copy this file to the share?

Comment 36 Jeff Layton 2007-11-20 12:42:28 UTC
I'm not seeing this when I 'cp' or 'dd' files to a share. Once it hits the quota
it eventually the program returns ENOSPC and the file shows the correct size on
the cifs mount. Exactly which kernel did they test and against what sort of
server? I wonder if firefox is holding the file open for some reason? Could you
have them do this test and then run:

# lsof /path/to/file/on/cifs/share

and see if firefox has it open?


Comment 37 Steve French 2007-11-20 16:35:15 UTC
Has anyone measured whether this patch affects performance much? It seems
excessive to flush all pages, rather than simply the last page, when extending
the file size.

Comment 38 Jeff Layton 2007-11-20 16:54:13 UTC
No, I've not done any real performance testing with this at all. I figured we'd
shoot for correctness first and then worry about performance afterward. That
said, I don't think I've changed the semantics of what gets flushed when the
file size changes, have I?



Comment 39 Steve French 2007-11-20 18:09:37 UTC
I misread that part of the patch - you are right, it looks like it should not
affect performance.  I like the idea of returning the writebehind rc earlier in
some cases (than the close) - but there are some interesting questions that this
brings up e.g. whether we should behave differently on "hard" vs. "soft" mounts
when the original write error occurred - and what happens to those pages in the
pagecache that we tried to write out and failed to?

But I am leaning toward merging this now and pushing for it to go in 2.6.24

Comment 40 Jeff Layton 2007-11-20 19:42:33 UTC
Steve,

I've made comment #34 public here too which you should probably be aware of.
I've not been able to replicate those results and the situation where they saw
this isn't very clear. If there's time then we may want to try to get
clarification of that before merging.

> I like the idea of returning the writebehind rc earlier in some
> cases (than the close)

I think fsync() dictates that we have to return any writeback errors at that
time, so I think we want something like this patch.

> but there are some interesting questions that this brings up
> e.g. whether we should behave differently on "hard" vs. "soft"
> mounts

Should we retry this writeback on a hard mount?

Good question. The mount.cifs manpage is a little vague on what hard mounts
mean, but the nfs manpage says this:

       hard           If an NFS file operation has a major timeout then report
                      "server  not  responding"  on  the  console and continue
                      retrying indefinitely.  This is the default.

This seems to imply that soft/hard really only comes into play when the server
isn't responding. In the case of quota overrun, the server is responding and
returned an error, so I think failing the operation is appropriate.

> and what happens to those pages in the pagecache that we tried
> to write out and failed to?

Currently (both with and without this patch), they just get tossed out, which is
why we want to make sure we return error when that occurs.


Comment 41 Steve French 2007-11-20 19:44:34 UTC
Should not hurt the performance of setattr (actually could help it) but it could
hurt both flush and sync which now with your ptach do filemap_write_and_wait()
instead of filemap_fdatawrite so could be slower.  Are you sure that we have to
do this for flush in order to fix the bug - that change seems drastic.  I don't
mind the part about reporting the writebehind rc earlier and saving the
writebehind rc in more places of course.

Comment 42 Jeff Layton 2007-11-20 20:03:16 UTC
Yes, I think we do have to do that. If I understand correctly,
filemap_fdatawrite does:

__filemap_fdatawrite(mapping, WB_SYNC_ALL);

...this basically means:

"Initiate writeback on all pages for the mapping, and if the page is busy, wait
for it so we can initiate writeback"

I don't think that waits for the writeback to actually complete, however. For
that you're supposed to use filemap_fdatawait. For CIFS though, I don't think it
really matters, because we do this in cifs_writepages:

       if (wbc->sync_mode != WB_SYNC_NONE)
             wait_on_page_writeback(page);

...so we're waiting on writeback to complete anyway.

We still need filemap_fdatawait I think though. When a page writeback fails, we
set AS_EIO (or AS_ENOSPC with the patch) in the mapping flags and continue. If
writeback of one page fails, and then the next one succeeds we can end up
returning "0" from cifs_writepages. Also we'd have never clear the bits out of
the mapping flags once we did report the error.

I wonder if we shouldn't get rid of write_behind_rc altogether and try to handle
this with mapping->flags directly, but that seems like it would be a larger
change...


Comment 43 Jeff Layton 2007-11-20 20:04:47 UTC
So to sum up the earlier comment, I don't think filemap_fdatawait will make a
big difference in performance since we wait on each page to be written out anyway.

What might help performance though is to eventually change cifs_writepages to
initiate writeback on all of the pages w/o waiting for them, and then go back
and wait on any that aren't complete (I think that's what NFS does).


Comment 44 Jeff Layton 2007-11-20 20:15:27 UTC
My understanding of the filemap_fdatawait and filemap_fdatawrite semantics could
be wrong here though, so feel free to correct me if that's the case :-)


Comment 45 Steve French 2007-11-20 22:41:07 UTC
Why isn't the writebehind rc returned on sync?

Comment 46 Steve French 2007-11-20 22:44:42 UTC
Nevermind - the writebehind rc is returned on sync in your patch - I was looking
at the wrong copy of the file for a second

Comment 47 Steve French 2007-11-20 23:22:02 UTC
Merged into mainline with the minor one line change (from Jeff's patch attached
here) discussed earlier - ie not doing a wait on the flush path (since that
changes affect on performance is unknown and it does not appear necessary to fix
this problem). Otherwise the same patch.

Comment 48 Navid Sheikhol-Eslami 2007-11-21 10:46:42 UTC
(In reply to comment #36)
> I'm not seeing this when I 'cp' or 'dd' files to a share. Once it hits the quota
> it eventually the program returns ENOSPC and the file shows the correct size on
> the cifs mount. Exactly which kernel did they test and against what sort of
> server? I wonder if firefox is holding the file open for some reason? Could you
> have them do this test and then run:
> 
> # lsof /path/to/file/on/cifs/share
> 
> and see if firefox has it open?

Customer provided the output of `lsof` just after the quota was exceeded:

# lsof /net/J/temp/Fedora-8-i386-DVD.iso
COMMAND    PID    USER   FD   TYPE DEVICE     SIZE  NODE NAME
firefox-b 6108 dominik   35w   REG   0,24 68896800 34767
/net/m525017/J/temp/Fedora-8-i386-DVD.iso

-- Navid

Comment 49 Jeff Layton 2007-11-29 19:58:14 UTC
I'm not sure that that tells me much. If they checked lsof just after the quota
was exceeded, then I assume that was before the file had been written back. It's
possible that there are outstanding pages still undergoing writeback when they
checked this.

I suspect what they're seeing is that the kernel hasn't attempted to write back
the last page in this file at the time that they're checking the file size.
Since that page is still active in the mapping, the file's size reflects that
page's position in the file and they see it showing up at 180M. The file is
sparse, however since some intervening pages aren't written out.

If they kill off firefox after downloading this file, does the file size show up
appropriately? If not, if they then do a "sync" does the file size start showing
up correctly?


Comment 50 Jeff Layton 2007-11-29 20:43:10 UTC
Ok, I think I see what may be happening...

First, Fedora-8-i386-DVD.iso totals in at ~3G, not 180M. So I presume that
they're looking at the file when it's still in the process of being downloaded
by firefox (and hence, is the file is still open).

Firefox just does a simple open() and writes to the file with a standard
synchronous write() call. These calls are generally returning successful, even
when the partition is at quota.

From the write(2) manpage:

       A  successful return from write() does not make any guarantee that data
       has been committed to disk.  In fact, on some buggy implementations, it
       does  not  even guarantee that space has successfully been reserved for
       the data.  The only way to be sure is to call fsync(2)  after  you  are
       done writing all your data.

...on a networked filesystem in particular, it's tricky to manage write errors
of this nature. We want good performance when writing which means allowing the
kernel to buffer things up without waiting for the server to respond. But, if we
allow the kernel to buffer the write and return success before actually sending
the data off to the server then we lose the ability to report that error to
userspace for the write. So we can only report that error when the file is
closed, or a fsync() is done.

A contributing factor here is that firefox deletes the file after you cancel the
download, so it doesn't bother to report that there were problems writing out
the file.

I'm going to argue that this behavior is expected (even if it isn't ideal).


Comment 52 Jeff Layton 2007-12-10 11:21:47 UTC
From the original description:

> Additional info:
> The fileserver is a standard MS-Windows Server 2003 SP1 with no modification
of > fileservices.

They've changed the server configuration since reporting this, and the new
problem they see sounds like a different issue. I suggest opening a new BZ for
that and leaving this one to handle the issue that this patch is known to fix.


Comment 53 Jeff Layton 2007-12-19 16:45:53 UTC
I've proposed an update of CIFS to a newer upstream rev for 5.2, so I'm going to
fold this BZ into that one.

*** This bug has been marked as a duplicate of 417961 ***

Comment 54 Jeff Layton 2007-12-20 20:23:57 UTC
Unduping bug since I posted a patch with this BZ number.

Comment 56 Don Zickus 2007-12-21 20:18:08 UTC
in 2.6.18-62.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 57 Navid Sheikhol-Eslami 2007-12-27 13:43:26 UTC
(In reply to comment #52)
> From the original description:
> 
> > Additional info:
> > The fileserver is a standard MS-Windows Server 2003 SP1 with no modification
> of > fileservices.
> 
> They've changed the server configuration since reporting this, and the new
> problem they see sounds like a different issue. I suggest opening a new BZ for
> that and leaving this one to handle the issue that this patch is known to fix.

I had a word with the customer and there may have been a misunderstanding: they
used MS-Windows Server 2003 R2 (also because R1 didn't have share-level quotas)
during the whole time, they simply omitted that detail when they first reported
the issue.

So, from customer's perspective the issue is only partly solved. Please let me
know if we shall still open a separate BZ.

-- Navid

Comment 58 Jeff Layton 2007-12-27 20:35:40 UTC
> So, from customer's perspective the issue is only partly solved. Please let me
> know if we shall still open a separate BZ.

Please do, since this is a different problem than what was originally understood
and fixed by the patch in this BZ.


Comment 61 errata-xmlrpc 2008-05-21 14:58:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html



Note You need to log in before you can comment on or make changes to this bug.