Bug 688232

Summary: problem with nfs ls latency during high IO
Product: [Fedora] Fedora Reporter: Judith Flo <jflo>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 14CC: edgar.hoch, gansalmon, harshula, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-16 13:47:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Judith Flo 2011-03-16 15:51:54 UTC
Description of problem:
While doing a very intensive io operation (copy of a 36Gb file) to an NFS directory, an "ls -l" command hangs for MINUTES.
The client is a fedora core 14 x86_64 kernel version 2.6.35.6 mounting an HP Ibrix x9300 (a red hat 5.3 with a kernel 2.6.18-128).
I tested this with a client RHEL6 with a 2.6.32 kernel with delays of some seconds, never minutes.

There is a previous bug commenting and patching the kernel [1], but it seems that this patch it's lost between the .32 and the .35 version. 

[1]https://bugzilla.redhat.com/show_bug.cgi?id=469848

Version-Release number of selected component (if applicable):
fedora core 14, x86_64
nfs-utils-1.2.3-0.fc14.x86_64
nfs-utils-lib-1.1.5-2.fc14.x86_64

How reproducible:
Create and copy a 36Gb file to an nfs volume and do a "time ls -l" it hangs.

Steps to Reproduce:
1.dd if=/dev/zero of=file36G.tmp bs=1M count=36864
2.time cp file36G /nfs_server/
3.[another terminal] time ls -l /nfs_server/file36G

If you issue the same command from another client it's perfectly visible and you get the answer fast.
  
Actual results:
time for ls -l varies between seconds to 3, 4 minutes of delay

Expected results:
time for ls -l is never bigger than tens of seconds

Additional info:
It seems that the bug was corrected, but the patch even when it was backported is no longer available for the next kernel distributions, between them, the one from fedora core 14.

Comment 1 Harshula Jayasuriya 2011-03-17 02:18:14 UTC
RHEL 6.0 released with RH kernel 2.6.32-71 and it *does* contain the
commit acdc53b2 "NFS: Replace __nfs_write_mapping with sync_inode()"
backported to 2.6.32. So, I doubt that the reported bad Fedora
performance is due to that commit.

Comment 2 Judith Flo 2011-03-17 09:11:02 UTC
(In reply to comment #1)
> RHEL 6.0 released with RH kernel 2.6.32-71 and it *does* contain the
> commit acdc53b2 "NFS: Replace __nfs_write_mapping with sync_inode()"
> backported to 2.6.32. So, I doubt that the reported bad Fedora
> performance is due to that commit.

I have no problem with RHEL 6 but, with fedora. 
RHEL 6 does contain the change, but apparently fedora 14 (kernel 2.6.35) does not.
The code of the inode.c function is definitely different between the two kernels, I don't know if this is an issue or maybe the code is simply different but it does the same.
The behavior of the ls command in both distributions against the same server is completely different, that's for sure.

Comment 3 Harshula Jayasuriya 2011-03-17 10:00:09 UTC
(In reply to comment #2)

> RHEL 6 does contain the change, but apparently fedora 14 (kernel 2.6.35) does
> not.
> The code of the inode.c function is definitely different between the two
> kernels ...

Both Fedora 14 and RHEL 6.0 have commit acdc53b2 "NFS: Replace __nfs_write_mapping with sync_inode()":

kernel-2.6.35.11-83.fc14.src.rpm:
------------------------------------------
 495 int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 496 {
 497         struct inode *inode = dentry->d_inode;
 498         int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
 499         int err;
 500 
 501         /* Flush out writes to the server in order to update c/mtime.  */
 502         if (S_ISREG(inode->i_mode)) {
 503                 err = filemap_write_and_wait(inode->i_mapping);
 504                 if (err)
 505                         goto out;
 506         }
------------------------------------------

2.6.32-71.el6:
------------------------------------------
 494 int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 495 {
 496         struct inode *inode = dentry->d_inode;
 497         int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
 498         int err;
 499 
 500         /* Flush out writes to the server in order to update c/mtime.  */
 501         if (S_ISREG(inode->i_mode)) {
 502                 err = filemap_write_and_wait(inode->i_mapping);
 503                 if (err)
 504                         goto out;
 505         }
------------------------------------------

Comment 4 Harshula Jayasuriya 2011-03-17 11:21:09 UTC
Can you run strace on the ls -l whilst reproducing the problem and attach the strace here?  i.e. $ strace -T -tt -f -v -q -s 4096 -o <STRACE.log> ls -l /nfs_server/file36G
The size of the file only needs to be slightly larger than the amount of RAM on the system. How much RAM do you have on the system?

Start the tcpdump capture just before you do the ls -l and stop it (Ctrl-C) just after the ls -l completes.
e.g. tcpdump -n -s 0 -i <INTERFACE> -w <OUTPUT.log> host <NFS-SERVER-IP>
The file is probably going to be very large, so you might have to put it somewhere I can download it from.

Comment 5 Judith Flo 2011-03-17 12:14:03 UTC
(In reply to comment #4)
> Can you run strace on the ls -l whilst reproducing the problem and attach the
> strace here?  i.e. $ strace -T -tt -f -v -q -s 4096 -o <STRACE.log> ls -l
> /nfs_server/file36G
> The size of the file only needs to be slightly larger than the amount of RAM on
> the system. How much RAM do you have on the system?

This client has 64Gb of ram (the nfs server has 48Gb), that's why I picked that file size, with 12Gb of file size, the times are not that big and ls doesn't get that much stuck, but the true is that the most usual files that this nfs server will be dealing with, will be big, 36Gb is a "reasonable" example.
> 
> Start the tcpdump capture just before you do the ls -l and stop it (Ctrl-C)
> just after the ls -l completes.
> e.g. tcpdump -n -s 0 -i <INTERFACE> -w <OUTPUT.log> host <NFS-SERVER-IP>
> The file is probably going to be very large, so you might have to put it
> somewhere I can download it from.

Command lines issued:
tcpdump -n -s 0 -i bond0 -w /home/judith/tcpdump.log host xxxxxxx.43
strace -T -tt -f -v -q -s 4096 -o /home/judith/strace.log ls -l /ibrix_sas/110317/prova.tar

You can download the file from here [1], please let me know when you have it, I will remove it, the size of the tgz file including tcpdump.log and strace is 12G.

If you want I can provide also the same information without the vm.dirty_ratio value changed, it was 20 and now it is 10 (times for ls increase from 1-2 to 3-4,5... )

Thanks

[1] http://gattaca.imppc.org/userspace/judith/logs_test_ls_nfs.tgz

Comment 6 Harshula Jayasuriya 2011-03-17 16:28:28 UTC
The lstat() is stalled for ~ 100 seconds:
-----------------------------------
10080 12:39:04.194023 lstat("/ibrix_sas/110317/prova.tar", {st_dev=makedev(0, 22), st_ino=383974370, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=32768, st_blocks=21558200, st_size=12082348032, st_atime=2011/03/17-12:38:57, st_mtime=2011/03/17-12:40:47, st_ctime=2011/03/17-12:40:47}) = 0 <103.767545>
-----------------------------------
There were no GETATTRs in the tcpdump. I would expect to have seen at least one.

If you have 64GiBs of RAM, 10% of that is still 6.4GiBs. That's a lot dirty pages to push over a network to the NFS server. Try setting /proc/sys/vm/dirty_bytes to 500000000 (500MB). See if that improves the ls -l latency.

Comment 7 Judith Flo 2011-03-18 12:23:37 UTC
(In reply to comment #6)
> The lstat() is stalled for ~ 100 seconds:
> -----------------------------------
> 10080 12:39:04.194023 lstat("/ibrix_sas/110317/prova.tar", {st_dev=makedev(0,
> 22), st_ino=383974370, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0,
> st_blksize=32768, st_blocks=21558200, st_size=12082348032,
> st_atime=2011/03/17-12:38:57, st_mtime=2011/03/17-12:40:47,
> st_ctime=2011/03/17-12:40:47}) = 0 <103.767545>
> -----------------------------------
> There were no GETATTRs in the tcpdump. I would expect to have seen at least
> one.
> 
> If you have 64GiBs of RAM, 10% of that is still 6.4GiBs. That's a lot dirty
> pages to push over a network to the NFS server. Try setting
> /proc/sys/vm/dirty_bytes to 500000000 (500MB). See if that improves the ls -l
> latency.

Hello, thanks a lot for your suggestion, and forgive my late answer.
That change definitely changed the behavior of the ls command, now it runs smoothly.
The point now is that the copy takes more time, is this expected? 
I need to read more about the vm.dirty* parameters to understand what the implications of these changes are on the system.

For a machine with 8Gb of ram, what are the values that I should use for the dirty_ratio and the dirty_bytes? proportional to the new ones or should I keep the default ones? The amount of waiting time for ls is smaller but if all the issue with the ls -l was about this values, why do I have latencies for ls when using this 8Gb machine and I don't have them with

Comment 8 Harshula Jayasuriya 2011-03-22 02:40:39 UTC
(In reply to comment #7)

> That change definitely changed the behavior of the ls command, now it runs
> smoothly.
> The point now is that the copy takes more time, is this expected?

Yes, when you copy the file to the NFS server, the data is copied into RAM first. By reducing /proc/sys/vm/dirty_bytes, you reduce the amount of RAM available to the copy, hence triggering the dirty pages to be flushed to the NFS server earlier. When cp closes the file, it does not guarantee that the dirty pages are flushed over the network to the NFS server. Therefore, though it may have looked like the copy finished earlier in your first test, it is likely that there were more dirty pages still in RAM compared to the second test.

> For a machine with 8Gb of ram, what are the values that I should use for the
> dirty_ratio and the dirty_bytes?

You would need to tune that by doing a few experiments. There's no magic solution, it's more of a compromise between write latency Vs stat latency.

Comment 9 Edgar Hoch 2011-05-19 15:16:44 UTC
I'm not sure if I'm right - but I expect that the nfs client code should look in the vm cache for infos about the file first before asking the nfs server about this file. If it doesn't then this is a source of systematic wrong informations about nfs files. No matter how low you set dirty_ratio - only zero guaranties right file information - what means not using vm cache for nfs files at all.

Are my considerations right?

Comment 10 Harshula Jayasuriya 2011-08-10 08:04:17 UTC
(In reply to comment #9)
Please see the Technical Notes of Bug 469848 - [RHEL5.2] nfs_getattr() hangs during heavy write workloads for a complete explanation.

Comment 11 Fedora End Of Life 2012-08-16 13:47:39 UTC
This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping