Bug 1299169

Summary: [abrt] find explicitly aborts suspiciously enumerating nfs-ganesha NFS mount
Product: [Fedora] Fedora Reporter: Matt Benjamin (redhat) <mbenjamin>
Component: findutilsAssignee: Kamil Dudka <kdudka>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 23CC: kdudka, mbenjamin, zaitcev
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: findutils-4.5.14-7.fc22 findutils-4.5.16-4.fc23 findutils-4.6.0-7.fc24 findutils-4.6.0-8.fc26 findutils-4.6.0-8.fc25 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-20 17:05:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1252549    

Description Matt Benjamin (redhat) 2016-01-16 21:42:08 UTC
Description of problem:
Find segfaults due to an explicit abort at appx. line 138 fts-cycle.c, attempting to cleanup a dir cycle check record, finding on an NFSv41 mount, against an nfs-ganesha nfsd server.

The server being developed is custom, but I have some evidence that the fs objects being produced legit.

1. doesnt happen on fedora 22
2. other tools (e.g., tar, ls -R) can enumerate the mount
3. I instrumented the find code (dnf installed source and build w/debug symbols), and found that the proximate cause is find-cycle failing to remove a cycle check record (matches an inode in some earlier part of the traversal) from an internal hash table--but it has never attempted to insert a matching record;  I am not expert in the overall code, and I have not attempted to deduce whether a record should have been added, however, based on the output
of my instrumented code, it seems possible that the code is incorrectly performing the hash_delete BEFORE attempting the hash_insert (below)

I have some output from my instrumented find, which prints the dev_t and ino_t hash key each time it adds a cycle check record, and also prints the key of the record it has failed to delete--which in the shipping find, triggers the abort():

cycle check hash_insert dev 46 ino 15995756253669371374
/nfs41/bdirs1
cycle check hash_insert dev 46 ino 15995756253669371374
cycle check hash_insert dev 46 ino 12660543856672316061
/nfs41/bdirs1/dir_0
cycle check hash_insert dev 46 ino 12660543856672316061
/nfs41/bdirs1/dir_0/sfile_1
cycle check hash_insert dev 46 ino 14671467442116281020
/nfs41/bdirs1/dir_0/sdir_1
cycle check hash_insert dev 46 ino 14671467442116281020
/nfs41/bdirs1/dir_0/sdir_0
failed deleting dev 46 ino 13558495606471022821 <------ abort() here
cycle check hash_insert dev 46 ino 13558495606471022821 <------ susp. insert
/nfs41/bdirs1/dir_0/sfile_2
/nfs41/bdirs1/dir_0/sfile_0
/nfs41/bdirs1/dir_1
failed deleting dev 46 ino 18173809456036378840
cycle check hash_insert dev 46 ino 18173809456036378840
/nfs41/bdirs1/dir_1/sfile_1
cycle check hash_insert dev 46 ino 6788543906858460145
/nfs41/bdirs1/dir_1/sdir_1
cycle check hash_insert dev 46 ino 6788543906858460145
/nfs41/bdirs1/dir_1/sdir_0
failed deleting dev 46 ino 6551903446543277581
cycle check hash_insert dev 46 ino 6551903446543277581
/nfs41/bdirs1/dir_1/sfile_2
/nfs41/bdirs1/dir_1/sfile_0
/nfs41/bdirs1/dir_2
...
(more snipped)

In general, I am suspicious of explicit aborts like this.

Some special properties of -ALL- ganesha nfsd servers (i.e., all the different backing file systems they support) currently:

1. inode numbers use full 64-bit space
2. ?

A key property that WE (the Red Hat Ceph RGW developers) expect based on experience with many Red Hat Linux and other Unix versions:

* I expect find not to misbehave catastrophically if a directory has more or less dirents than reported in it's link count

Version-Release number of selected component (if applicable):
4.5.16 (this failure does not reproduce in fedora 22).

How reproducible:
100%--with at least my development NFS


Steps to Reproduce:
1. mount /nfs41
2. run find /nfs41
3. crash

Comment 1 Kamil Dudka 2016-01-18 09:24:16 UTC
Could you please re-test this with the -noleaf option of find?

Please report also the exact NVRs (Name Version Release) of findutils for both Fedora 22 and Fedora 23:

$ rpm -q findutils

Comment 2 Matt Benjamin (redhat) 2016-01-18 15:18:48 UTC
with -noleaf, there is no abort;  the nfs mount contains no symlinks

findutils-4.5.16-1.fc23.x86_64
findutils-4.5.14-3.fc22.x86_64

Comment 3 Kamil Dudka 2016-01-18 16:02:30 UTC
(In reply to Matt Benjamin (redhat) from comment #0)
> * I expect find not to misbehave catastrophically if a directory has more or
> less dirents than reported in it's link count

You have basically described the cause -- find relies on something that your file system does not guarantee to hold.

(In reply to Matt Benjamin (redhat) from comment #2)
> with -noleaf, there is no abort;  the nfs mount contains no symlinks

Thanks for the confirmation!  This means that the issue is caused by the leaf optimization, which was recently enabled in find to increase performance when traversing large NFS directories.  See bug 1252549 for details.

> findutils-4.5.16-1.fc23.x86_64
> findutils-4.5.14-3.fc22.x86_64

findutils-4.5.14-3.fc22.x86_64 does not enable the leaf optimization for NFS but, if you update to findutils-4.5.14-6.fc22, which is already stable, it will behave similarly.

Could you please check whether oldfind(1) deals with this situation any better?  

IIRC, oldfind enables leaf optimization by default but implements a fallback for file systems that are not compatible with that optimization...

Comment 4 Matt Benjamin (redhat) 2016-01-18 16:11:06 UTC
I'll be able to try it this evening.

Comment 5 Kamil Dudka 2016-01-18 16:49:00 UTC
reported upstream:

http://thread.gmane.org/gmane.comp.lib.gnulib.bugs/35539/focus=35650

Comment 6 Matt Benjamin (redhat) 2016-01-19 16:20:07 UTC
Thanks, Kamil!

Comment 7 Kamil Dudka 2016-01-19 16:29:34 UTC
Matt, have you had any time to check how oldfind(1) deals with that situation?

Is it easy to setup such a file system for experimenting locally?

Comment 8 Matt Benjamin (redhat) 2016-01-19 16:40:06 UTC
I forgot to run oldfind.  I might be able to do it during the day us eastern, I promise to do it this evening at latest.

It's all upstream first, but you need to set up a development Ceph cluster with the RGW service, and a development nfs-ganesha above that.  I you have the chops to set up a full ceph cluster and try nfs-ganesha, both built from source, then yes...

Comment 9 Kamil Dudka 2016-01-26 09:04:24 UTC
(In reply to Matt Benjamin (redhat) from comment #8)
> I forgot to run oldfind.  I might be able to do it during the day us
> eastern, I promise to do it this evening at latest.

So it is evening :-)  Have you had any luck with oldfind?

> It's all upstream first, but you need to set up a development Ceph cluster
> with the RGW service, and a development nfs-ganesha above that.  I you have
> the chops to set up a full ceph cluster and try nfs-ganesha, both built from
> source, then yes...

I have no experiences with Ceph at all, so this would be difficult for me.  Could you please grant me SSH access to a machine where the bug occurs?

Comment 10 Kamil Dudka 2016-06-24 11:01:04 UTC
Another instance of this bug was reported to me privately yesterday.  I have decided to unconditionally disable the leaf optimization for NFS in stable releases of Fedora until we have a better solution.

Comment 11 Fedora Update System 2016-06-24 12:38:58 UTC
findutils-4.5.16-4.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-6fc93cb14c

Comment 12 Fedora Update System 2016-06-24 12:39:06 UTC
findutils-4.6.0-7.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-fa6026a2e2

Comment 13 Fedora Update System 2016-06-24 12:39:11 UTC
findutils-4.5.14-7.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-4204f66082

Comment 14 Fedora Update System 2016-06-25 00:25:24 UTC
findutils-4.5.14-7.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-4204f66082

Comment 15 Fedora Update System 2016-06-25 00:27:27 UTC
findutils-4.5.16-4.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-6fc93cb14c

Comment 16 Fedora Update System 2016-06-25 00:31:05 UTC
findutils-4.6.0-7.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-fa6026a2e2

Comment 17 Fedora Update System 2016-06-26 20:54:26 UTC
findutils-4.6.0-7.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 18 Fedora Update System 2016-06-28 14:23:34 UTC
findutils-4.5.16-4.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 19 Fedora Update System 2016-07-12 02:22:14 UTC
findutils-4.5.14-7.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.

Comment 20 Kamil Dudka 2016-09-16 07:38:54 UTC
Nobody has been interested in debugging (or at least providing some debugging environment for) this bug since January.  Upstream and some Linux distros have already disabled the optimization that triggered the bug.  I am disabling the optimization in Fedora now...

Comment 21 Fedora Update System 2016-09-16 07:52:47 UTC
findutils-4.6.0-8.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-97dba33593

Comment 22 Fedora Update System 2016-09-17 20:56:25 UTC
findutils-4.6.0-8.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-97dba33593

Comment 23 Fedora Update System 2016-09-20 17:05:12 UTC
findutils-4.6.0-8.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.