Bug 739222

Summary: Duplicate files on NFS shares
Product: Red Hat Enterprise Linux 6 Reporter: Elliott Forney <elliott.forney>
Component: kernelAssignee: nfs-maint
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.3CC: alan, baumanmo, bfields, brian.p.stamper, f_a_f12001, info, jlayton, mozilla_bugs, redhat, Rudolf.Kollien, rwheeler, steved, toracat, ykawada
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=784191
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-11-14 09:25:59 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
Program to reproduce bug none

Description Elliott Forney 2011-09-16 17:06:15 EDT
Description of problem:

When an NFS shared directory contains many files (roughly 200000+) it begins to show duplicate entries for some files.  ls, find, cpio et cetra will all list the file twice.  File name and inode number are exactly the same.  Duplicate entries do not appear on local filesystems or on the NFS server, i.e. it appears to only happen on NFS clients.

Version-Release number of selected component (if applicable):

nfs-utils-1.2.4-1.fc15.x86_64

How reproducible:

Always.

Steps to Reproduce:
1.  Configure and mount an NFS share (seems to happen on both nfs3 and nfs4)
2.  Create several hundred thousand files on the share:
    for x in $(seq 1 300000); do touch $x; done
3.  Check for duplicate entries with ls:
    ls -1 | wc -l
    ls -1 | sort | uniq | wc -l
  
Actual results:

Some files are listed twice.

Expected results:

There should be a unique listing for each file.

Additional info:

This problem does not happen in Fedora 14.

Bug 623902 sounds similar but on an RHEL 4.
Comment 1 Elliott Forney 2011-09-16 18:36:15 EDT
I am also seeing another problem although it is slightly less reproducible and I think it only happens with nfs3.

Occasionally, when creating several hundred thousand files, as above, and periodically listing the files during creation with ls I will begin to get the following error:

ls: reading directory .: Too many levels of symbolic links

even though the directory contains no symlinks.  /var/log/messages also shows the following error when this happens:

kernel: [288734.195484] NFS: directory idfah/td contains a readdir loop.Please contact your server vendor.  Offending cookie: 176586904

I'm not sure if these problems are related or not.
Comment 2 J. Bruce Fields 2011-09-16 18:46:49 EDT
Except for the "does not happen in Fedora 14" part, that sounds a lot like the problem which these patches address:

http://marc.info/?l=linux-nfs&m=131281788003178&w=2

It would be worth trying them, if possible.  They'll also be included in 3.2.

What filesystem are you exporting, and are you using the same kernel version on client and server?
Comment 3 Jeff Layton 2011-09-16 18:50:48 EDT
This is almost certainly a server-side issue. Basically the server is sending the same cookie for multiple files, and that's causing confusion when the next READDIR call wants to pick up where the last one left off. This can also be occur when the client is trying to ls a directory that's frequently changing (adding and removing files).

What kernel are you using on the client?

What sort of filesystem is the underlying directory here?
Comment 4 Elliott Forney 2011-09-16 19:11:55 EDT
Interesting, I do get some hits about this on google.  Looks like maybe there have been some recent changes regarding the readdir loop problem?

I am able to reproduce this with with a RHEL 5 NFS server as well as a Fedora 15 NFS server.  I am unable to reproduce either problem with a Fedora 14 NFS client.

Client kernel is:  2.6.40.4-5.fc15.x86_64
Underlying filesystem is ext3
Comment 5 Elliott Forney 2011-09-16 19:17:32 EDT
Oh, I just noticed Bruce's comment.  Yes, this was the patch I noticed too.  I will try a Fedora 14 client again to double check that it doesn't happen there.  When I tried a Fedora 15 client and server they did both have the same kernel.
Comment 6 Brian 2011-11-14 18:59:04 EST
What is the status of this issue?  Is it being tracked elsewhere?  This is affecting a good number of my users and I don't see any updates since September.
Comment 7 J. Bruce Fields 2011-11-15 16:52:45 EST
As mentioned in comment 2, it would be worth testing after applying to the server's kernel the patches posted at http://marc.info/?l=linux-nfs&m=131281788003178&w=2.
Comment 9 Fedora End Of Life 2012-08-06 16:06:07 EDT
This message is a notice that Fedora 15 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 15. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '15' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 15 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 10 Fedora End Of Life 2012-08-06 16:06:07 EDT
This message is a notice that Fedora 15 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 15. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '15' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 15 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 11 Elliott Forney 2012-08-09 16:45:15 EDT
This bug appears to be fixed in Fedora 16, thank you!

It is still present, however, in RHEL 6.  I have re-opened it and transferred to RHEL 6.
Comment 12 Elliott Forney 2012-08-09 16:50:09 EDT
Created attachment 603346 [details]
Program to reproduce bug

This c program will reproduce the bug by creating 300,000 files in ./td

Run

./many_files &
watch -n 10 'ls -1 ./td | sort | uniq | wc -l; ls -1 ./td | wc -l'

and you will eventually see duplicate files and get the error:

ls: reading directory ./td: Too many levels of symbolic links
Comment 14 Alan Johnson 2012-09-20 17:29:33 EDT
It is unclear to me which versions of EL have this fix applied.  I am expecting to install a new OS on our NFS server running Fedora 14 (long overdue, I know, but if it works... well now it doesn't) to correct this issue, but I need to be sure the OS I install has the necessary patches.  If I understand correctly, installing RHEL6 will NOT correct the problem?  Or does "It is still present [...] in RHEL 6" refer to the client side so RHEL6 on the server would do the trick?  How about RHEL5?  Or am I stuck with Fedora?  If so, which version?

I feel like this might be the wrong place to make this request, but it seems to me that this information should be recorded here.  I appologize in advance if I am violating the convention.  Please correct me if I have.
Comment 15 Elliott Forney 2012-10-04 18:39:21 EDT
Alan, I have tried the following configurations:

Server    | Client    | NFS Bug
----------+-----------+-----
RHEL 6    | RHEL 6    | Yes
RHEL 6    | Fedora 16 | Yes
Fedora 16 | Fedora 16 | No

Note, however, that I only see the problem in directories with lots of files (say 200,000) so it may not be noticeable with typical use.
Comment 16 Alan Johnson 2012-10-08 12:13:57 EDT
Yes, we have had the same problem with a directory continaing almost 160K files, while the thousands of other directories in the same share have had no issues so far.  Our server is Fedora 14 and the clients I have reproduced the issue on include Fedora 14, CentOS 6, and Ubuntu 11.10.  Of course, our RHEL 6 clients should behave the same, but I don't think I directly tested them.  This is consistant with your table which suggests a server side patch would fix it.

Things have calmed down on this issue recently, so I'm hoping a patch for RHEL 6 comes out before people start getting antsy again. =)

Thanks!
Comment 17 f_a_f12001 2012-11-12 07:40:52 EST
Hi,
  I came a cross this problem and searched the web till I came here, I try to rsync a directory which has thousands of files and it gave me that error as you reported.
My client is Centos 6.3 kernel 2.6.32-279.11.1.el6.x86_64    EXT4
My NFS server is Centos 5.6 kernel 2.6.32    EXT3
I rysnced the same directory successfully before but NFS was mounted UDP and not TCP like this time, Will this make any difference? Anyway I will try to rsync this directory again over UDP or by trying the cp command instead of rsync may it works.
Comment 18 Jeff Layton 2012-11-14 09:25:59 EST

*** This bug has been marked as a duplicate of bug 813070 ***
Comment 19 Alan Johnson 2012-11-14 11:09:59 EST
I am told that I am not authorized to access bug 813070.  How can I track progress on this issue?
Comment 20 Ric Wheeler 2012-11-14 14:59:02 EST
Hi Alan,

If you have a RHEL subscription, you should work with your RH support people to track when it will land in your supported kernel.

Of course, we always push fixes upstream as well if you are self-supporting.
Comment 21 redhat 2013-01-24 12:01:09 EST
This is a very similar bug to a bug I've discovered in the latest Centos 5.9 kernel. Without knowing where to post a problem, I'll try here. Please checkout this discussion thread:

https://www.centos.org/modules/newbb/viewtopic.php?topic_id=41070&forum=37

Kernel 2.6.18-348.el5 always reproduces this horrendous bug. previous kernels do not appear have this bug.

Please appropriately place this bug comment where it make sense.

Thanks!
Comment 22 Rudolf 2013-01-31 05:55:27 EST
Me too. I just run into a similar problem. Don't know, if it is the same, but with kernel 2.6.18-348 i immediately cannot mount from LTSP-4 clients (using kernel 2.4.26) to the 5.9 server. A ls commands return emtpy, X11 looking for modules to load fail with "doesn't exist". 

Reverting to kernel 2.6.18-308.24.1.el5 brings back the old functionality.
Comment 23 info@kobaltwit.be 2013-02-17 14:41:19 EST
I can confirm this behaviour as well with a RHEL 5.9 server and a Fedora 18 client. Like Rudolf, reverting to kernel 2.6.18-308.16.1 fixes the issue.

I meant to comment on the main bug instead of this duplicate, but I don't have access to it.