Created attachment 322418 [details] suggested patch POSIX requires that ctime and mtime, as reported by the stat(2) call, reflect the activity of the most recent write(2). To that end, nfs_getattr() flushes pending dirty writes to a file before doing a GETATTR to allow the NFS server to set the file's size, ctime, and mtime properly. However, nfs_getattr() can be starved when a constant stream of application writes to a file prevents nfs_sync_inode_wait() from completing. This usually results in hangs of programs doing a stat against an NFS file that is being written. "ls -l" is a common victim of this behavior. [root@node3 ~]# uname -a Linux node3 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:12 EDT 2008 i686 i686 i386 GNU/Linux [root@node3 ~]# ls -lh /tmp/Test1G -rw-r--r-- 1 root root 1000M Oct 24 12:37 /tmp/Test1G (from /proc/mounts): cluster1:/usr/local /usr/local nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=cluster1 [root@node3 tmp]# ls /usr/local/tmp locking [root@node3 tmp]# time cp Test1G /usr/local/tmp & [1] 25030 [root@node3 tmp]# time ls -lh /usr/local/tmp real 2m13.872s user 0m0.230s sys 0m7.897s total 1001M drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking -rw-r--r-- 1 root root 1000M Oct 28 09:39 Test1G [1]+ Done time cp -i Test1G /usr/local/tmp real 2m2.414s user 0m0.230s sys 0m8.062s To prevent starvation, hold the file's i_mutex in nfs_getattr() to freeze applications writes temporarily so the client can more quickly obtain clean values for a file's size, mtime, and ctime. Below is the upstream patch fixing this issue: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=28c494c5c8d425e15b7b82571e4df6d6bc34594d Another interesting patch to be applied: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=634707388baa440d9c9d082cfc4c950500c8952b they were backported to -92.EL, see the feedback below: ---- After booting the test kernel with that change, I was able to slip a few "ls" commands in during the write: [root@node3 tmp]# time cp Test1G /usr/local/tmp & [1] 10142 [root@node3 tmp]# time ls -lh /usr/local/tmp total 68M drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking -rw-r--r-- 1 root root 68M Nov 3 10:44 Test1G real 0m6.090s user 0m0.001s sys 0m0.054s [root@node3 tmp]# time ls -lh /usr/local/tmp total 114M drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking -rw-r--r-- 1 root root 114M Nov 3 10:44 Test1G real 0m5.185s user 0m0.005s sys 0m0.016s [root@node3 tmp]# time ls -lh /usr/local/tmp total 148M drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking -rw-r--r-- 1 root root 148M Nov 3 10:44 Test1G real 0m3.077s user 0m0.000s sys 0m0.030s [root@node3 tmp]# real 2m26.065s user 0m0.211s sys 0m8.042s The bonus is that the impact to the write seemed to be minimal. --- Attaching backported patch for RHEL-5 -92.el5
I'm not an NFS expert (nor do I play one on TV).. but .. Would it be possible to simply send the 1st (and only the 1st) dirty page (if at least one dirty page was present) for a given NFS inode any time a local stat is made on an NFS client? Wouldn't a one page update force mtime on the host? This event sent from IssueTracker by jwest issue 234012
Sending one page would make it look close to correct right for that one stat call. Then what happens? The rest of the pages eventually get flushed. You do another stat call and...WTF? The file has changed? Who wrote to it? The bottom line is that on a getattr we need to sync the state of the file out to the server and that means flushing all pending writes. (Peter, correct me if I'm wrong here)
This is correct, Jeff. Perhaps I oversimplified when describing the semantics that are required when handling the mtime.
*** Bug 464251 has been marked as a duplicate of this bug. ***
*** Bug 467374 has been marked as a duplicate of this bug. ***
Peter, How goes the alternative patch/solution? IBM is willing to help test the next solution we come up with. --jwest
The customer is providing access to this NFS mount to multiple clients via ftp. As such, they have a large mix of ftp related directory read activity combined with the potential for large file writes. One remote client doing a large upload effectively stalls the access for other clients. We have already set expectations that they will still see performance issues if a large write is happening. They have migrated from a RHEL4 environment (where they do not see this severe a stall), and would be happy to simply get the responsiveness closer to what it was in RHEL4. Our testing showed that the previous proposed patch accomplished this goal. This event sent from IssueTracker by jwest issue 234012
Thank you for the description of the environment and the situation. This helps me to much better understand the situation. Yes, the previously proposed patch would make things generally better. There might still be very large delays, but they would mostly be finite in nature. If I can't get my patch accepted upstream in a reasonable amount of time, then we will go with the proposed patch for the time being and then I will continue to pursue the better option.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Updating PM score.
in kernel-2.6.18-132.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
The x86_64 version of the 2.6.18-132 has not made any difference in behavior: Before: # time cp /tmp/bigfile /usr/local/tmp real 1m44.479s user 0m0.015s sys 0m0.246s # time ls /usr/local/tmp a b bigfile c d real 1m38.688s user 0m0.000s sys 0m0.009s After: # time cp /tmp/bigfile /usr/local/tmp real 1m46.442s user 0m0.017s sys 0m0.222s # time ls /usr/local/tmp a b bigfile c d real 1m44.069s user 0m0.000s sys 0m0.012s (Note: the ls was manually run in a second window once the cp was started, so a 1-2 second variation in test times would be expected) This event sent from IssueTracker by jwest issue 234012
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency, sometimes up to 30 seconds.
This looks fine, except for the very last clause. Please remove everything after the last comma in the paragraph. The latencies can be easily much higher than that.
Removed ", sometimes up to 30 seconds".
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency, sometimes up to 30 seconds.+The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency.
This looks fine to me.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html
*** Bug 491121 has been marked as a duplicate of this bug. ***
(In reply to comment #37) The original author of comment #37 subsequently made these observations in IssueTracker: "I just executed an additional test with a much larger file, and did see that the behavior was very dependent upon how long I let the write go before trying the ls. I suspect that I had delayed long enough in my previous test to allow the system to buffer 100% of the file before I issued the ls request." "The customer fell into the same trap that I did on their initial test (too small of a file used for the test resulted in the entire file being buffered before the ls command == no change in behavior as the new code still waits for the buffered data to be flushed). Using a larger file did indeed demonstrate a positive change in behavior."