Description of problem:
When an NFS shared directory contains many files (roughly 200000+) it begins to show duplicate entries for some files. ls, find, cpio et cetra will all list the file twice. File name and inode number are exactly the same. Duplicate entries do not appear on local filesystems or on the NFS server, i.e. it appears to only happen on NFS clients.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Configure and mount an NFS share (seems to happen on both nfs3 and nfs4)
2. Create several hundred thousand files on the share:
for x in $(seq 1 300000); do touch $x; done
3. Check for duplicate entries with ls:
ls -1 | wc -l
ls -1 | sort | uniq | wc -l
Some files are listed twice.
There should be a unique listing for each file.
This problem does not happen in Fedora 14.
Bug 623902 sounds similar but on an RHEL 4.
I am also seeing another problem although it is slightly less reproducible and I think it only happens with nfs3.
Occasionally, when creating several hundred thousand files, as above, and periodically listing the files during creation with ls I will begin to get the following error:
ls: reading directory .: Too many levels of symbolic links
even though the directory contains no symlinks. /var/log/messages also shows the following error when this happens:
kernel: [288734.195484] NFS: directory idfah/td contains a readdir loop.Please contact your server vendor. Offending cookie: 176586904
I'm not sure if these problems are related or not.
Except for the "does not happen in Fedora 14" part, that sounds a lot like the problem which these patches address:
It would be worth trying them, if possible. They'll also be included in 3.2.
What filesystem are you exporting, and are you using the same kernel version on client and server?
This is almost certainly a server-side issue. Basically the server is sending the same cookie for multiple files, and that's causing confusion when the next READDIR call wants to pick up where the last one left off. This can also be occur when the client is trying to ls a directory that's frequently changing (adding and removing files).
What kernel are you using on the client?
What sort of filesystem is the underlying directory here?
Interesting, I do get some hits about this on google. Looks like maybe there have been some recent changes regarding the readdir loop problem?
I am able to reproduce this with with a RHEL 5 NFS server as well as a Fedora 15 NFS server. I am unable to reproduce either problem with a Fedora 14 NFS client.
Client kernel is: 220.127.116.11-5.fc15.x86_64
Underlying filesystem is ext3
Oh, I just noticed Bruce's comment. Yes, this was the patch I noticed too. I will try a Fedora 14 client again to double check that it doesn't happen there. When I tried a Fedora 15 client and server they did both have the same kernel.
What is the status of this issue? Is it being tracked elsewhere? This is affecting a good number of my users and I don't see any updates since September.
As mentioned in comment 2, it would be worth testing after applying to the server's kernel the patches posted at http://marc.info/?l=linux-nfs&m=131281788003178&w=2.
This message is a notice that Fedora 15 is now at end of life. Fedora
has stopped maintaining and issuing updates for Fedora 15. It is
Fedora's policy to close all bug reports from releases that are no
longer maintained. At this time, all open bugs with a Fedora 'version'
of '15' have been closed as WONTFIX.
(Please note: Our normal process is to give advanced warning of this
occurring, but we forgot to do that. A thousand apologies.)
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen
this bug and simply change the 'version' to a later Fedora version.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we were unable to fix it before Fedora 15 reached end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora, you are encouraged to click on
"Clone This Bug" (top right of this page) and open it against that
version of Fedora.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
This bug appears to be fixed in Fedora 16, thank you!
It is still present, however, in RHEL 6. I have re-opened it and transferred to RHEL 6.
Created attachment 603346 [details]
Program to reproduce bug
This c program will reproduce the bug by creating 300,000 files in ./td
watch -n 10 'ls -1 ./td | sort | uniq | wc -l; ls -1 ./td | wc -l'
and you will eventually see duplicate files and get the error:
ls: reading directory ./td: Too many levels of symbolic links
It is unclear to me which versions of EL have this fix applied. I am expecting to install a new OS on our NFS server running Fedora 14 (long overdue, I know, but if it works... well now it doesn't) to correct this issue, but I need to be sure the OS I install has the necessary patches. If I understand correctly, installing RHEL6 will NOT correct the problem? Or does "It is still present [...] in RHEL 6" refer to the client side so RHEL6 on the server would do the trick? How about RHEL5? Or am I stuck with Fedora? If so, which version?
I feel like this might be the wrong place to make this request, but it seems to me that this information should be recorded here. I appologize in advance if I am violating the convention. Please correct me if I have.
Alan, I have tried the following configurations:
Server | Client | NFS Bug
RHEL 6 | RHEL 6 | Yes
RHEL 6 | Fedora 16 | Yes
Fedora 16 | Fedora 16 | No
Note, however, that I only see the problem in directories with lots of files (say 200,000) so it may not be noticeable with typical use.
Yes, we have had the same problem with a directory continaing almost 160K files, while the thousands of other directories in the same share have had no issues so far. Our server is Fedora 14 and the clients I have reproduced the issue on include Fedora 14, CentOS 6, and Ubuntu 11.10. Of course, our RHEL 6 clients should behave the same, but I don't think I directly tested them. This is consistant with your table which suggests a server side patch would fix it.
Things have calmed down on this issue recently, so I'm hoping a patch for RHEL 6 comes out before people start getting antsy again. =)
I came a cross this problem and searched the web till I came here, I try to rsync a directory which has thousands of files and it gave me that error as you reported.
My client is Centos 6.3 kernel 2.6.32-279.11.1.el6.x86_64 EXT4
My NFS server is Centos 5.6 kernel 2.6.32 EXT3
I rysnced the same directory successfully before but NFS was mounted UDP and not TCP like this time, Will this make any difference? Anyway I will try to rsync this directory again over UDP or by trying the cp command instead of rsync may it works.
*** This bug has been marked as a duplicate of bug 813070 ***
I am told that I am not authorized to access bug 813070. How can I track progress on this issue?
If you have a RHEL subscription, you should work with your RH support people to track when it will land in your supported kernel.
Of course, we always push fixes upstream as well if you are self-supporting.
This is a very similar bug to a bug I've discovered in the latest Centos 5.9 kernel. Without knowing where to post a problem, I'll try here. Please checkout this discussion thread:
Kernel 2.6.18-348.el5 always reproduces this horrendous bug. previous kernels do not appear have this bug.
Please appropriately place this bug comment where it make sense.
Me too. I just run into a similar problem. Don't know, if it is the same, but with kernel 2.6.18-348 i immediately cannot mount from LTSP-4 clients (using kernel 2.4.26) to the 5.9 server. A ls commands return emtpy, X11 looking for modules to load fail with "doesn't exist".
Reverting to kernel 2.6.18-308.24.1.el5 brings back the old functionality.
I can confirm this behaviour as well with a RHEL 5.9 server and a Fedora 18 client. Like Rudolf, reverting to kernel 2.6.18-308.16.1 fixes the issue.
I meant to comment on the main bug instead of this duplicate, but I don't have access to it.