Bug 469848

Summary: [RHEL5.2] nfs_getattr() hangs during heavy write workloads
Product: Red Hat Enterprise Linux 5 Reporter: Flavio Leitner <fleitner>
Component: kernelAssignee: Peter Staubach <staubach>
Status: CLOSED ERRATA QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.2CC: adestefano, dhoward, dmair, emcnabb, harshula, james.leddy, jlayton, jpirko, kurt.orita, masanari_iida, rlerch, rprice, rwheeler, skylar2, steved, tao, tbaumann, terry.johnson, wnguyen
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency.
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 08:11:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 468301, 483701, 485920, 486926, 546233    
Attachments:
Description Flags
suggested patch none

Description Flavio Leitner 2008-11-04 13:15:41 UTC
Created attachment 322418 [details]
suggested patch

POSIX requires that ctime and mtime, as reported by the stat(2) call,
reflect the activity of the most recent write(2).  To that end, nfs_getattr()
flushes pending dirty writes to a file before doing a GETATTR to allow the
NFS server to set the file's size, ctime, and mtime properly.

However, nfs_getattr() can be starved when a constant stream of application
writes to a file prevents nfs_sync_inode_wait() from completing.  This usually
results in hangs of programs doing a stat against an NFS file that is being
written.  "ls -l" is a common victim of this behavior.

[root@node3 ~]# uname -a
Linux node3 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:12 EDT 2008 i686 i686
i386 GNU/Linux

[root@node3 ~]# ls -lh /tmp/Test1G
-rw-r--r-- 1 root root 1000M Oct 24 12:37 /tmp/Test1G

(from /proc/mounts):
cluster1:/usr/local /usr/local nfs
rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=cluster1

[root@node3 tmp]# ls /usr/local/tmp
locking
[root@node3 tmp]# time cp Test1G /usr/local/tmp &
[1] 25030
[root@node3 tmp]# time ls -lh /usr/local/tmp

real    2m13.872s
user    0m0.230s
sys     0m7.897s
total 1001M
drwxr-xr-x 2 root root  4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root 1000M Oct 28 09:39 Test1G
[1]+  Done                    time cp -i Test1G /usr/local/tmp

real    2m2.414s
user    0m0.230s
sys     0m8.062s


To prevent starvation, hold the file's i_mutex in nfs_getattr() to
freeze applications writes temporarily so the client can more quickly obtain
clean values for a file's size, mtime, and ctime.

Below is the upstream patch fixing this issue:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=28c494c5c8d425e15b7b82571e4df6d6bc34594d

Another interesting patch to be applied:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=634707388baa440d9c9d082cfc4c950500c8952b

they were backported to -92.EL, see the feedback below:
----
After booting the test kernel with that change, I was able to slip a few "ls" commands in during the write:

[root@node3 tmp]# time cp Test1G /usr/local/tmp &
[1] 10142
[root@node3 tmp]# time ls -lh /usr/local/tmp
total 68M
drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root  68M Nov  3 10:44 Test1G

real    0m6.090s
user    0m0.001s
sys     0m0.054s
[root@node3 tmp]# time ls -lh /usr/local/tmp
total 114M
drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root 114M Nov  3 10:44 Test1G

real    0m5.185s
user    0m0.005s
sys     0m0.016s
[root@node3 tmp]# time ls -lh /usr/local/tmp
total 148M
drwxr-xr-x 2 root root 4.0K Jun 23 12:51 locking
-rw-r--r-- 1 root root 148M Nov  3 10:44 Test1G

real    0m3.077s
user    0m0.000s
sys     0m0.030s
[root@node3 tmp]#
real    2m26.065s
user    0m0.211s
sys     0m8.042s

The bonus is that the impact to the write seemed to be minimal.
---

Attaching backported patch for RHEL-5 -92.el5

Comment 6 Issue Tracker 2008-11-14 21:14:21 UTC
I'm not an NFS expert (nor do I play one on TV).. but ..
Would it be possible to simply send the 1st (and only the 1st) dirty page
(if at least one dirty page was present) for a given NFS inode any time a
local stat is made on an NFS client? Wouldn't a one page update force
mtime on the host?


This event sent from IssueTracker by jwest 
 issue 234012

Comment 7 Jeff Layton 2008-11-14 21:21:06 UTC
Sending one page would make it look close to correct right for that one stat call. Then what happens?

The rest of the pages eventually get flushed. You do another stat call and...WTF? The file has changed? Who wrote to it?

The bottom line is that on a getattr we need to sync the state of the file out to the server and that means flushing all pending writes.

(Peter, correct me if I'm wrong here)

Comment 8 Peter Staubach 2008-11-14 21:31:30 UTC
This is correct, Jeff.

Perhaps I oversimplified when describing the semantics that are
required when handling the mtime.

Comment 9 Jeff Layton 2008-11-18 01:53:52 UTC
*** Bug 464251 has been marked as a duplicate of this bug. ***

Comment 17 Steve Dickson 2008-12-09 21:03:32 UTC
*** Bug 467374 has been marked as a duplicate of this bug. ***

Comment 18 Jeremy West 2008-12-17 21:38:49 UTC
Peter,

How goes the alternative patch/solution?  IBM is willing to help test the next solution we come up with.

--jwest

Comment 24 Issue Tracker 2009-01-14 17:36:09 UTC
The customer is providing access to this NFS mount to multiple clients via
ftp.  As such, they have a large mix of ftp related directory read
activity combined with the potential for large file writes.  One remote
client doing a large upload effectively stalls the access for other
clients.

We have already set expectations that they will still see performance
issues if a large write is happening.  They have migrated from a RHEL4
environment (where they do not see this severe a stall), and would be
happy to simply get the responsiveness closer to what it was in RHEL4. 
Our testing showed that the previous proposed patch accomplished this
goal.


This event sent from IssueTracker by jwest 
 issue 234012

Comment 25 Peter Staubach 2009-01-14 18:39:48 UTC
Thank you for the description of the environment and the situation.  This
helps me to much better understand the situation.

Yes, the previously proposed patch would make things generally better.
There might still be very large delays, but they would mostly be finite
in nature.

If I can't get my patch accepted upstream in a reasonable amount of time,
then we will go with the proposed patch for the time being and then I
will continue to pursue the better option.

Comment 30 RHEL Program Management 2009-02-04 19:15:57 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 34 RHEL Program Management 2009-02-16 15:10:29 UTC
Updating PM score.

Comment 36 Don Zickus 2009-02-23 20:02:21 UTC
in kernel-2.6.18-132.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 37 Issue Tracker 2009-02-25 05:28:35 UTC
The x86_64 version of the 2.6.18-132 has not made any difference in
behavior:

Before:

# time cp /tmp/bigfile /usr/local/tmp

real    1m44.479s
user    0m0.015s
sys     0m0.246s

# time ls /usr/local/tmp
a  b  bigfile  c  d

real    1m38.688s
user    0m0.000s
sys     0m0.009s

After:

# time cp /tmp/bigfile /usr/local/tmp

real    1m46.442s
user    0m0.017s
sys     0m0.222s

# time ls /usr/local/tmp
a  b  bigfile  c  d

real    1m44.069s
user    0m0.000s
sys     0m0.012s

(Note: the ls was manually run in a second window once the cp was started,
so a 1-2 second variation in test times would be expected)


This event sent from IssueTracker by jwest 
 issue 234012

Comment 58 Harshula Jayasuriya 2009-06-26 09:58:24 UTC
Release note added. If any revisions are required, please set the 
"requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

New Contents:
The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency, sometimes up to 30 seconds.

Comment 59 Peter Staubach 2009-06-29 12:52:59 UTC
This looks fine, except for the very last clause.  Please remove everything
after the last comma in the paragraph.  The latencies can be easily much
higher than that.

Comment 60 Harshula Jayasuriya 2009-06-29 13:11:07 UTC
Removed ", sometimes up to 30 seconds".

Comment 61 Harshula Jayasuriya 2009-06-29 13:11:07 UTC
Release note updated. If any revisions are required, please set the 
"requires_release_notes"  flag to "?" and edit the "Release Notes" field accordingly.
All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency, sometimes up to 30 seconds.+The required semantics indicate that a process which does stat, write, stat, should see a different mtime on the file in the results from the second stat call compared to the mtime in the results from the first stat call. File times in NFS are maintained strictly by the server, so the file mtime will not be updated until the data has been transmitted to the server via the WRITE NFS protocol operation. Simply copying data into the pagecache is not sufficient to cause the mtime to be updated. This is one place where NFS differs from local file systems. Therefore, an NFS filesystem which is under a heavy write workload may result in stat calls having a high latency.

Comment 62 Peter Staubach 2009-06-29 13:29:05 UTC
This looks fine to me.

Comment 65 errata-xmlrpc 2009-09-02 08:11:54 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Comment 67 Jeremy West 2009-09-02 19:30:55 UTC
*** Bug 491121 has been marked as a duplicate of this bug. ***

Comment 68 Harshula Jayasuriya 2009-12-03 05:15:23 UTC
(In reply to comment #37)

The original author of comment #37 subsequently made these observations in IssueTracker:

"I just executed an additional test with a much larger file, and did see that the behavior was very dependent upon how long I let the write go before trying the ls. I suspect that I had delayed long enough in my previous test to allow the system to buffer 100% of the file before I issued the ls request."

"The customer fell into the same trap that I did on their initial test (too small of a file used for the test resulted in the entire file being buffered before the ls command == no change in behavior as the new code still waits for the buffered data to be flushed).  Using a larger file did indeed demonstrate a positive change in behavior."