Bug 1248140 - RHEL6: NFS files inside larger directories return d_type = 0 = DT_UNKNOWN for getdents calls, leading to poor performance of 'find' commands, regression in RHEL6.4 kernels and above [NEEDINFO]
RHEL6: NFS files inside larger directories return d_type = 0 = DT_UNKNOWN for...
Status: CLOSED DEFERRED
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.4
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Benjamin Coddington
Yin.JianHong
: TestCaseProvided
Depends On:
Blocks: 1374441 1461138 1507140
  Show dependency treegraph
 
Reported: 2015-07-29 13:07 EDT by Dave Wysochanski
Modified: 2017-12-19 15:39 EST (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1442272 (view as bug list)
Environment:
Last Closed: 2017-12-19 15:39:33 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
dwysocha: needinfo? (dwysocha)
swhiteho: needinfo? (bcodding)


Attachments (Terms of Use)
test case to repro this bug which takes 2 inputs: 1) directory, 2) number of files (1.71 KB, application/octet-stream)
2015-07-29 13:08 EDT, Dave Wysochanski
no flags Details
Reworked test case to add a t0_bz1248140_prep.sh and t0_bz1248140_run.sh scripts which separates the creation of the files from the mount + find (4.54 KB, application/octet-stream)
2015-09-24 12:07 EDT, Dave Wysochanski
no flags Details
Updated testcase which includes various fixups including running of tcpdump, stap, strace, and summarizing results (4.92 KB, application/octet-stream)
2015-09-24 16:20 EDT, Dave Wysochanski
no flags Details
rhel5u10 (kernel-2.6.18-371.11.1.el5) test results (8.48 KB, text/plain)
2015-09-24 16:50 EDT, Dave Wysochanski
no flags Details
rhel6u3 (kernel-2.6.32-279.11.1.el6.x86_64) test results (8.89 KB, text/plain)
2015-09-24 16:52 EDT, Dave Wysochanski
no flags Details
rhel6u4 (kernel-2.6.32-358.el6.x86_64) test results (9.01 KB, text/plain)
2015-09-24 16:59 EDT, Dave Wysochanski
no flags Details
rhel6u6 (kernel-2.6.32-504.8.1.el6.x86_64) test results (9.09 KB, text/plain)
2015-09-24 17:01 EDT, Dave Wysochanski
no flags Details
rhel7u1 (kernel-3.10.0-229.el7.x86_64) test results (9.27 KB, text/plain)
2015-09-24 17:01 EDT, Dave Wysochanski
no flags Details
fedora21 (kernel-3.17.4-301.fc21.x86_64) test results (9.04 KB, text/plain)
2015-09-24 17:02 EDT, Dave Wysochanski
no flags Details
upstream (kernel-4.2.0-0.rc3.14.el7.aarch64) test results (9.12 KB, text/plain)
2015-09-24 17:03 EDT, Dave Wysochanski
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 1539583 None None None Never

  None (edit)
Description Dave Wysochanski 2015-07-29 13:07:04 EDT
Description of problem:
Customer reported bad 'find' performance of a simple command in an NFS3 mounted directory, which was a regression from RHEL5.  Customer showed the same exact export mounted with exactly the same options, and showed:
- RHEL5.11: find takes < 1 minute
- RHEL6.6: find takes over 30 minutes

Customer reported excessive 'newfstatat' system calls and it sounded like this matched a 'find' bug here
https://bugzilla.redhat.com/show_bug.cgi?id=883285

However, upon investigation of the kernels returning of d_type for the 'getdents' call, I see kernels after RHEL6.3 returning 'd_type = 0 == DT_UNKNOWN' for files larger than around 1000 in my reproducer.  This causes problems for find and requires an extra system call for each file in a directory leading to extremely poor performance.

I have a trivial reproducer I will attach and it clearly shows we're getting extra 'newfstatat' system calls when the find is done.  In addition, I've got a simple 1-line stap which can be run to see the d_type value and name of the file, so it's pretty easy to see what is going on.



Version-Release number of selected component (if applicable):
kernels above 2.6.32-279.*


How reproducible:
Every time

Steps to Reproduce:
Run the reproducer script and systemtap as follows:

1. mount -t nfs -overs=3 nfs-server:/exports/nfs3 /mnt/nfs3

2. nohup stap -v -e 'probe kernel.function("filldir") { printf("%s name = %s d_type = %d\n", pp(), text_strn(kernel_string($name),$namlen,0), $d_type) }' &

3. ./t1_bz883285.sh /mnt/nfs3/test 2000


Actual results:
Test script shows TEST FAIL, and there are excessive 'fstatat' system calls.
# ./t1_bz883285.sh /mnt/nfs3/test 2000
Setting FIND_DIR = /mnt/nfs3/test, FILE_COUNT = 2000
Testing https://bugzilla.redhat.com/show_bug.cgi?id=883285
Creating files in /mnt/nfs3/test/883285
Filesystem           1K-blocks    Used Available Use% Mounted on
192.168.122.37:/exports/nfs3
                       8910848 3299840   5611008  38% /mnt/nfs3
strace -c -o /tmp/test-883285-strace.txt find /mnt/nfs3/test -name *.pdf
Found 2000 files
Found 986 fstatat calls in /tmp/test-883285-strace.txt
Found 6 getdents calls in /tmp/test-883285-strace.txt
Full strace file:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.66    0.005000         833         6           getdents
 20.20    0.001268           1       986           newfstatat
  0.14    0.000009           1        13           mprotect
  0.00    0.000000           0         8           read
  0.00    0.000000           0        18           write
  0.00    0.000000           0        21         6 open
  0.00    0.000000           0        17           close
  0.00    0.000000           0         1           stat
  0.00    0.000000           0        12           fstat
  0.00    0.000000           0        25           mmap
  0.00    0.000000           0         3           munmap
  0.00    0.000000           0         8           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         2         1 ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         7           fchdir
  0.00    0.000000           0         1           gettimeofday
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         2           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         3         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.006277                  1144         9 total
TEST FAIL: on findutils-4.4.2-6.2.test.bz883285.el6_7.x86_64 kernel-2.6.32-504.8.1.el6.x86_64 - excessive newfstatat system calls found


The 'nohup.out' file from systemtap shows d_type initially returning an '8' value, then once we get to the higher file numbers, it starts returning 0.
kernel.function("filldir@fs/readdir.c:149") name = file-1009.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1010.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1011.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1012.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1013.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1014.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1015.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1016.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1017.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1018.pdf d_type = 8
kernel.function("filldir@fs/readdir.c:149") name = file-1019.pdf d_type = 0
kernel.function("filldir@fs/readdir.c:149") name = file-1020.pdf d_type = 0
kernel.function("filldir@fs/readdir.c:149") name = file-1021.pdf d_type = 0
kernel.function("filldir@fs/readdir.c:149") name = file-1022.pdf d_type = 0
kernel.function("filldir@fs/readdir.c:149") name = file-1023.pdf d_type = 0
kernel.function("filldir@fs/readdir.c:149") name = file-1024.pdf d_type = 0



Expected results:
Since the test script creates files, then does a 'find', the kernel should return the type of the directory entry in 'getdents' - it should not be returning DT_UNKNOWN.  As a result, there should be no need to get the 'type' again from a fstatat system call on every file.


Additional info:
I can show RHEL5 kernel returns a proper d_type value for all files, and that earlier RHEL6 kernels did not have this problem.
Comment 1 Dave Wysochanski 2015-07-29 13:08:45 EDT
Created attachment 1057370 [details]
test case to repro this bug which takes 2 inputs: 1) directory, 2) number of files
Comment 2 Dave Wysochanski 2015-07-29 13:10:03 EDT
I just saw this bug as well, which may be related or something different:
https://bugzilla.redhat.com/show_bug.cgi?id=1236630

Also I have not tested upstream, fedora, or RHEL6.7 yet but should be fairly simple
Comment 5 Dave Wysochanski 2015-07-29 15:20:04 EDT
Passes on Fedora21 and RHEL7.1
TEST PASS: on findutils-4.5.12-7.fc21.x86_64 kernel-3.17.4-301.fc21.x86_64TEST TEST PASS: on findutils-4.5.11-3.el7.x86_64 kernel-3.10.0-229.el7.x86_64

Passes on RHEL6.3:
TEST PASS: on findutils-4.4.2-6.el6.x86_64 kernel-2.6.32-279.11.1.el6.x86_64
Comment 6 Dave Wysochanski 2015-07-29 15:27:18 EDT
Talking with Scott Mayhew and he's pretty sure it's this commit:
commit 082e5eb50ad9aec8b783372daff0479fe7ac5be1
Author: Steve Dickson <SteveD@redhat.com>
Date:   Thu May 10 14:52:47 2012 -0400

    [fs] nfs: Adapt readdirplus to application usage patterns
Comment 7 Dave Wysochanski 2015-07-29 15:47:27 EDT
TEST FAIL: on findutils-4.4.2-6.el6.x86_64 kernel-2.6.32-358.el6.x86_64
TEST FAIL: on findutils-4.4.2-6.2.test.bz883285.el6_7.x86_64 kernel-2.6.32-572.el6.sfdc01465846.x86_64

So it looks like this is a separate issue from https://bugzilla.redhat.com/show_bug.cgi?id=1236630
Comment 8 Dave Wysochanski 2015-08-03 08:00:35 EDT
(In reply to Dave Wysochanski from comment #0)
> 
> Expected results:
> Since the test script creates files, then does a 'find', the kernel should
> return the type of the directory entry in 'getdents' - it should not be
> returning DT_UNKNOWN.  As a result, there should be no need to get the
> 'type' again from a fstatat system call on every file.
> 
Unfortunately, it does not look like the above is possible.  Even though the test script is creating all the files in the directory, and so the kernel should already know the type, we don't cache the type anywhere (for example, it's not cached in the dentry).  Instead any 'getdents' call goes through the vfs to the specific filesystem (such as NFS) where the directory contents is obtained.  This all makes sense since the directory could have changed.  However, one could certainly envision designing an optimization which would cache the type after a create, perhaps at the vfs layer, though I'm not sure how reasonable / useful this is and seems it would be an RFE, probably upstream.  Also this isn't the focus of the original reported bad perf, but a side issue I noticed.
Comment 9 Dave Wysochanski 2015-09-24 12:07:05 EDT
Created attachment 1076603 [details]
Reworked test case to add a t0_bz1248140_prep.sh and t0_bz1248140_run.sh scripts which separates the creation of the files from the mount + find
Comment 10 Dave Wysochanski 2015-09-24 16:20:59 EDT
Created attachment 1076695 [details]
Updated testcase which includes various fixups including running of tcpdump, stap, strace, and summarizing results
Comment 11 Dave Wysochanski 2015-09-24 16:23:30 EDT
Comment on attachment 1076695 [details]
Updated testcase which includes various fixups including running of tcpdump, stap, strace, and summarizing results

Prep script:
# ./t0_bz1248140_prep.sh 
Usage: t0_bz1248140_prep.sh nfs-server-ip-address [file_count]

Then for actual runs:
# ./t0_bz1248140_run.sh
Usage: t0_bz1248140_run.sh nfs-server-ip-address [file_count] [0 == no remount | 1 == remount]

Example:
]# time ./t0_bz1248140_run.sh 192.168.122.37 30000 0

Will attach output for various kernels.
Comment 12 Dave Wysochanski 2015-09-24 16:48:43 EDT
Ok, I was wrong earlier when I claimed this problem does not affect RHEL7 or Fedora / upstream - https://bugzilla.redhat.com/show_bug.cgi?id=1248140#c5

Now with a revised test, I am seeing fedora / upstream fail with similar numbers as RHEL6.  I'll attach some output from test runs on various NFS clients.

It looks to me the main problem is that after RHEL6.3, we no longer cache d_type in the kernel.  So regardless of how many times you run 'find', a call to getdents will never return d_type = 8 anymore, and every entry find is forced to do a 'stat' on every entry, which translates to a GETATTR for every entry in a directory.  I am not sure why on RHEL6.3 when we were using READDIRPLUS, subsequent runs also have d_type = 8 and we don't need to do any further GETATTRs.  But after RHEL6.4 kernels all runs seem to have getdents calls returning d_type = 0 - despite the fact that we issue GETATTRs on each entry.

This goes beyond RHEL5's wrong assumption but rather I think there is probably a bug in the kernel where we're not caching d_type anymore.  It may come down to a different NFS op / system call, but it seems like we are caching the d_type in one case, and in another case we have to retrieve it each time.

At any rate, the looks like it's a problem upstream as well, so fixing it won't be as easy of just looking for patches to backport.
Comment 13 Dave Wysochanski 2015-09-24 16:50 EDT
Created attachment 1076713 [details]
rhel5u10 (kernel-2.6.18-371.11.1.el5) test results
Comment 14 Dave Wysochanski 2015-09-24 16:52 EDT
Created attachment 1076714 [details]
rhel6u3 (kernel-2.6.32-279.11.1.el6.x86_64) test results

The RHEL6.3 kernel is the last kernel which performed very close to RHEL5 for this test case.
Comment 15 Dave Wysochanski 2015-09-24 16:59 EDT
Created attachment 1076716 [details]
rhel6u4 (kernel-2.6.32-358.el6.x86_64) test results

This is the kernel were the performance regression shows up.

In the case where we're doing a fresh unmount / mount, then the find:
[root@rhel6u4-node1 01478764]# time  ./t0_bz1248140_run.sh 192.168.122.37 30000 1
...
real    0m24.180s
user    0m12.129s
sys     0m5.899s

This compares with 

RHEL6u3
[root@rhel6u3-node1 01478764]# time  ./t0_bz1248140_run.sh 192.168.122.37 30000 1
real    0m8.468s
user    0m3.968s
sys     0m1.308s

RHEL5u10
[root@rhel5u10-node1 01478764]# time ./t0_bz1248140_run.sh 192.168.122.37 30000 1
real    0m4.499s
user    0m1.549s
sys     0m0.603s


Now the case where we don't umount / remount, which is the most important

RHEL6u4
[root@rhel6u4-node1 01478764]# time  ./t0_bz1248140_run.sh 192.168.122.37 30000 0
...
real    0m20.010s
user    0m10.112s
sys     0m5.207s

And RHEL6u3
[root@rhel6u3-node1 01478764]# time  ./t0_bz1248140_run.sh 192.168.122.37 30000 0
real    0m4.356s
user    0m1.876s
sys     0m0.598s

RHEL5u10
[root@rhel5u10-node1 01478764]# time ./t0_bz1248140_run.sh 192.168.122.37 30000 0
real    0m3.065s
user    0m1.324s
sys     0m0.526s
Comment 16 Dave Wysochanski 2015-09-24 17:01 EDT
Created attachment 1076718 [details]
rhel6u6 (kernel-2.6.32-504.8.1.el6.x86_64) test results
Comment 17 Dave Wysochanski 2015-09-24 17:01 EDT
Created attachment 1076719 [details]
rhel7u1 (kernel-3.10.0-229.el7.x86_64) test results
Comment 18 Dave Wysochanski 2015-09-24 17:02 EDT
Created attachment 1076720 [details]
fedora21 (kernel-3.17.4-301.fc21.x86_64) test results
Comment 19 Dave Wysochanski 2015-09-24 17:03 EDT
Created attachment 1076721 [details]
upstream (kernel-4.2.0-0.rc3.14.el7.aarch64) test results
Comment 20 Dave Wysochanski 2015-09-24 17:11:37 EDT
Summary of test results for the non-remount scenario with 30,000 files.
This is a significant performance drop which does not seem to go away on subsequent runs (caching did not seem to help).

[root@rhel5u10-node1 01478764]# time ./t0_bz1248140_run.sh 192.168.122.37 30000 0
TEST PASS: on findutils-4.2.27-6.el5 kernel-2.6.18-371.11.1.el5 - did not find excessive newfstatat system calls
real    0m3.065s
--
[root@rhel6u3-node1 01478764]# time  ./t0_bz1248140_run.sh 192.168.122.37 30000 0
TEST PASS: on findutils-4.4.2-6.el6.x86_64 kernel-2.6.32-279.11.1.el6.x86_64 - did not find excessive newfstatat system calls
real    0m4.356s
--
[root@rhel6u4-node1 01478764]# time  ./t0_bz1248140_run.sh 192.168.122.37 30000 0
TEST FAIL: on findutils-4.4.2-6.el6.x86_64 kernel-2.6.32-358.el6.x86_64 - excessive newfstatat system calls found
real    0m20.010s
--
[root@rhel6u6-node1 01478764]# time ./t0_bz1248140_run.sh 192.168.122.37 30000 0
TEST FAIL: on findutils-4.4.2-6.2.test.bz883285.el6_7.x86_64 kernel-2.6.32-504.8.1.el6.x86_64 - excessive newfstatat system calls found
real    0m18.290s
--
[root@rhel7u1-node1 01478764]# time ./t0_bz1248140_run.sh 192.168.122.37 30000 0
TEST FAIL: on findutils-4.5.11-3.el7.x86_64 kernel-3.10.0-229.el7.x86_64 - excessive newfstatat system calls found
real    0m19.691s
--
TEST FAIL: on findutils-4.5.12-7.fc21.x86_64 kernel-3.17.4-301.fc21.x86_64 - excessive newfstatat system calls found
real    0m18.648s
--
[root@core-aarch64-01 01478764]# time ./t0_bz1248140_run.sh 10.12.211.246 30000 0
TEST FAIL: on findutils-4.5.11-5.el7.aarch64 kernel-4.2.0-0.rc3.14.el7.aarch64 - excessive newfstatat system calls found
real    0m17.976s
Comment 21 Dave Wysochanski 2015-10-06 21:38:27 EDT
From the other bug, this is the upstream commit which changed the behavior.
d69ee9b NFS: Adapt readdirplus to application usage patterns
Comment 26 Dave Wysochanski 2015-10-28 13:26:33 EDT
I just did a test where I reverted "d69ee9b NFS: Adapt readdirplus to application usage patterns" in RHEL6.4 GA kernel.  This did not fix the problem so there's something else going on.

I'm going to do one more test on the latest RHEL6 kernel with that patch reverted to be sure, then I may start a bisect if I can't get a better idea of where the regression occurred.
Comment 27 Dave Wysochanski 2015-10-28 14:51:27 EDT
Ok, I'm making some progress.  There was this patch added in RHEL6.7:
    [fs] nfs: Be more aggressive in using readdirplus for 'ls -l' situations

Interestingly, on a RHEL6.7 VM, if you run the testcase, you'll see it fail as usual.  But then if you do a 'ls -l' in the test directory with all the files, this triggers readdirplus on the directory, and the d_type field gets filled in for all the directory entries (d_type = 8)..  After this point, the problem no longer occurs due to the fact that 'getdents' after this point will always return 'd_type = 8' for all the entries in the directory.  This is exactly the behavior of the testcase in RHEL5 and in RHEL6, prior to RHEL6.4

So it seems if we get into the situation where readdirplus is not used on a directory, then each time the attribute cache expires, the 'stat' of each entry in the directory will trigger another GETATTR.  But these GETATTRs never fill in the 'd_type' field in the readdir cache - apparently these caches are separate and it does not seem like the readdir cache ever expires.
Comment 28 Kamil Dudka 2015-10-29 06:30:15 EDT
Comment on attachment 1076695 [details]
Updated testcase which includes various fixups including running of tcpdump, stap, strace, and summarizing results

Note there are typos in a verbose message slightly complicating the debugging:

--- a/t0_bz1248140.sh
+++ b/t0_bz1248140.sh
@@ -55,7 +55,7 @@ sleep 1
 # FIXME: What if tcpdump fails for some reason?

 # 2. Mount nfs server
-echo "`date`: Step 2. On NFS client: mount -t nfs4 $NFS_SERVER:$EXP1 $MNT1"
+echo "`date`: Step 2. On NFS client: mount -t nfs $NFS_SERVER:$EXP $MNT"
 mount -t nfs -overs=3 $NFS_SERVER:$EXP $MNT
 check_nfs_mount $MNT $NFS_SERVER:$EXP
Comment 29 Dave Wysochanski 2015-10-30 14:03:32 EDT
(In reply to Kamil Dudka from comment #28)
> Comment on attachment 1076695 [details]
> Updated testcase which includes various fixups including running of tcpdump,
> stap, strace, and summarizing results
> 
> Note there are typos in a verbose message slightly complicating the
> debugging:

Yeah sorry I've been updating the testcase quite a bit but didn't upload a new tarball recently.  Once I get something more stable I'll upload it again.
Comment 31 RHEL Product and Program Management 2015-11-01 19:13:37 EST
This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.
Comment 32 Benjamin Coddington 2015-11-11 09:58:15 EST
The point at which this starts happening is dependent upon the length of the filename.. and after a bit the client stops sending count3 in the request, so the server starts sending much larger responses.  Why is that?
Comment 33 Benjamin Coddington 2015-11-11 13:54:18 EST
(In reply to Benjamin Coddington from comment #32)
> The point at which this starts happening is dependent upon the length of the
> filename.. and after a bit the client stops sending count3 in the request,
> so the server starts sending much larger responses.  Why is that?

The count changes because we stop doing readdirplus..  but why are we doing readdir so many times?  Shouldn't we be doing it once - then caching pages on the directory inode?

I think there are a couple problems here.  This upstream commit stopped incrementing the page_index in readdir_search_pagecache():

47c716c NFS: Readdir cleanups

Then, this upstream patch caused us to only perform readdirplus on the very first readdir:

d69ee9b NFS: Adapt readdirplus to application usage patterns

But now that the cache is broken, we have to continue to issue readdirs to find entries that don't fit on the first page.  At least.. that is my theory.. testing of that theory to follow.
Comment 34 Benjamin Coddington 2015-11-13 09:31:14 EST
So, it looks like NFS readdir code is working properly, but glibc is doing getdents(2) with a 32k buffer, so after the first call to getdents, READDIRPLUS is disabled (since the optimization added in d69ee9b).  I think glibc should be setting that buffer rather larger based on the size of the directory's st_blksize -- which on my system is reporting to be 262144..

I'm now trying to figure out why that doesn't seem to be happening..
Comment 35 Benjamin Coddington 2015-11-13 10:20:58 EST
Looks like glibc-2.12 is skipping the stat() that would set the directory's allocation size if the open with O_DIRECTORY worked..  I think that upstream glibc won't behave this way, so it might be worth finding where that behavior changed..

Looks like glibc stopped doing that here:
62f63c4 * sysdeps/unix/opendir.c (__alloc_dir): We have a lot more memory

Here's where Roland McGrath fixed glibc to use st_blksize again:
https://sourceware.org/ml/libc-alpha/2015-05/msg00231.html
in commit:
46f894d Refactor opendir.

But that fix is so recent as to not be in any released glibc yet.

We could at this point punt this off to the glibc guys, or try to find another fix in NFS.. Dave, since this one is yours, I'll let you decide.
Comment 36 Dave Wysochanski 2015-11-21 10:27:11 EST
(In reply to Benjamin Coddington from comment #35)
> Looks like glibc-2.12 is skipping the stat() that would set the directory's
> allocation size if the open with O_DIRECTORY worked..  I think that upstream
> glibc won't behave this way, so it might be worth finding where that
> behavior changed..
> 
> Looks like glibc stopped doing that here:
> 62f63c4 * sysdeps/unix/opendir.c (__alloc_dir): We have a lot more memory
> 
> Here's where Roland McGrath fixed glibc to use st_blksize again:
> https://sourceware.org/ml/libc-alpha/2015-05/msg00231.html
> in commit:
> 46f894d Refactor opendir.
> 
> But that fix is so recent as to not be in any released glibc yet.
> 
> We could at this point punt this off to the glibc guys, or try to find
> another fix in NFS.. Dave, since this one is yours, I'll let you decide.

I don't think this will solve the problem.  It sounds like it'll just defer the problem to some larger size, and on top of that, make the problem potentially harder to understand as it will change based on st_blksize.  Am I missing something?

Also I'm concerned about this other commit you flagged, and then alluded to the cache being broken:
47c716c NFS: Readdir cleanups

In that commit header Trond says 'No functional changes, but clarify the code.' and yet you point out the commit "stopped incrementing the page_index in readdir_search_pagecache()"  So is that comment wrong and that commit actually broke the cache?

I'm just getting back to this and trying to figure out where I left off in comment #27.  You seem to take a different view on it or coming at it from a different angle.  I'm not sure this is a solvable problem but I'm hoping it is.  This seems like fragile code to me.
Comment 37 Benjamin Coddington 2015-11-22 05:54:03 EST
(In reply to Dave Wysochanski from comment #36)
> I don't think this will solve the problem.  It sounds like it'll just defer
> the problem to some larger size, and on top of that, make the problem
> potentially harder to understand as it will change based on st_blksize.  Am
> I missing something?

I don't think you're missing anything.. but using the proper st_blksize for getdents is going to improve things tremendously for the customer.  I don't think there's a problem in the NFS bits right now.  NFS is working as designed.

When you say "problem" what behavior do you believe is problematic?  Is it that NFS does not always use readdirplus?  The "use readdirplus on the first readdir" behavior is an optimization specifically for reading directories with a lot of entries.

> In that commit header Trond says 'No functional changes, but clarify the
> code.' and yet you point out the commit "stopped incrementing the page_index
> in readdir_search_pagecache()"  So is that comment wrong and that commit
> actually broke the cache?

My fault - 47c716c didn't break behavior.  I had missed that the page_index increment was moved to nfs_readdir_search_array().  I should have explained that better.

> I'm just getting back to this and trying to figure out where I left off in
> comment #27.  You seem to take a different view on it or coming at it from a
> different angle.  I'm not sure this is a solvable problem but I'm hoping it
> is.  This seems like fragile code to me.

The readdir cache you're describing in comment 27 are cached pages on the directory we're reading.  Just like cached pages on a regular file, unless we do something to invalidate those pages (like update the timestamps on the directory) the NFS client will serve getdents() from the cached pages it has, and those entries have d_type set or not based on how they were retrieved from the server.
Comment 38 Dave Wysochanski 2015-11-22 08:17:56 EST
(In reply to Benjamin Coddington from comment #37)
> (In reply to Dave Wysochanski from comment #36)
> > I don't think this will solve the problem.  It sounds like it'll just defer
> > the problem to some larger size, and on top of that, make the problem
> > potentially harder to understand as it will change based on st_blksize.  Am
> > I missing something?
> 
> I don't think you're missing anything.. but using the proper st_blksize for
> getdents is going to improve things tremendously for the customer.  I don't
> think there's a problem in the NFS bits right now.  NFS is working as
> designed.
> 

You said this:
"So, it looks like NFS readdir code is working properly, but glibc is doing getdents(2) with a 32k buffer, so after the first call to getdents, READDIRPLUS is disabled (since the optimization added in d69ee9b).  I think glibc should be setting that buffer rather larger based on the size of the directory's st_blksize -- which on my system is reporting to be 262144.."

Isn't st_blksize of the directory set based on rsize / wsize?  In a lot of cases this is recommended by the NFS server vendor, and I don't think it's quite that large.  Even if I grant that you're right, and we go from 32k to 256k, we're talking about at best getting 8x's the entries we have now.  In my testcase wouldn't that translate to approximately 8,000 entries out of 30,000? So we'll have good performance on approximately 8k entries, then we disable readdirplus never to enable it again, even if you run find over and over and the directory never changes.  I think the customers comparing with RHEL5 won't think performance is acceptable.


> When you say "problem" what behavior do you believe is problematic?  Is it
> that NFS does not always use readdirplus?  The "use readdirplus on the first
> readdir" behavior is an optimization specifically for reading directories
> with a lot of entries.
> 
Please see https://bugzilla.redhat.com/show_bug.cgi?id=1248140#c0

We can't regress performance this bad and expect it to be ok, unless we can show it's impossible to get the performance back because of some other performance tradeoff.


> > In that commit header Trond says 'No functional changes, but clarify the
> > code.' and yet you point out the commit "stopped incrementing the page_index
> > in readdir_search_pagecache()"  So is that comment wrong and that commit
> > actually broke the cache?
> 
> My fault - 47c716c didn't break behavior.  I had missed that the page_index
> increment was moved to nfs_readdir_search_array().  I should have explained
> that better.
> 
Ok


> > I'm just getting back to this and trying to figure out where I left off in
> > comment #27.  You seem to take a different view on it or coming at it from a
> > different angle.  I'm not sure this is a solvable problem but I'm hoping it
> > is.  This seems like fragile code to me.
> 
> The readdir cache you're describing in comment 27 are cached pages on the
> directory we're reading.  Just like cached pages on a regular file, unless
> we do something to invalidate those pages (like update the timestamps on the
> directory) the NFS client will serve getdents() from the cached pages it
> has, and those entries have d_type set or not based on how they were
> retrieved from the server.

Well, this is my whole point.  If you run my find testcase over and over on RHEL6 after RHEL6.4, the performance will never get any better since we're never filling in the page cache properly.  Subject to the attribute cache settings, we will effectively always have to re-ask the server for the same information, because we're not smart enough to store it in the page cache, even if nothing changes on the directory.  The only way we fill it in properly is if you do an 'ls -l' or something else which causes us to store the d_type.  This is specifically what I think is broken and was not the case prior to RHEL6.4

I will try to make more progress on this today.  I had intended to spend more time on this yesterday then got pulled into a sev 1 system down case after the customer upgraded from RHEL7.1 to RHEL7.2 and couldn't even boot previous kernels.
Comment 39 Dave Wysochanski 2015-11-22 12:28:46 EST
Well I've been trying to trace this but I think I may need to recompile the kernel to figure this out since it seems nearly impossible to trace when NFS_INO_ADVISE_RDPLUS gets set / cleared on the nfs_inode.flags due to static / inlines.
Comment 40 Dave Wysochanski 2015-11-22 13:05:43 EST
Ok so Trond's patch:
    NFS: Adapt readdirplus to application usage patterns

In the header he talks about selectively disabling READDIRPLUS but I think he means selectively ENABLING it, since the LOOKUP is what triggers the re-enable - by default we're disabling it (except the first one) unless some specific 'hint' is given.

    This patch tracks when lookups are attempted on the directory,
    and uses that information to selectively disable READDIRPLUS
    on that directory.
    The first 'readdir' call is always served using READDIRPLUS.
    Subsequent calls only use READDIRPLUS if there was a successful
    lookup or revalidation on a child in the mean time.

Maybe I'm misunderstanding the intent of the patch, or something else isn't working the way it should.  But unfortunately, find doesn't work the way the patch describes, and I'm guessing a number of other applications would not as well.  The only 'hint' given is that all attributes of all children are obtained at some point after the readdir / gendents calls, and oddly the d_type isn't stored.  So the NFS logic turns on readdirplus for the next time, but doesn't leave it on for the full directory.  So then the rest of the getdents just uses readdir and then we have to obtain the attributes again with LOOKUPs, we don't cache this in the page cache, etc.  The logic seems incomplete and/or flawed for this use case.
Comment 41 Dave Wysochanski 2015-11-22 13:32:09 EST
Ok here's an attempt at a bit more complicated logic, though I'm not sure it's feasible or side effects.

First approach:
- if we get the hint of the LOOKUP, store the current size of the directory (?), and set the flag
- inside nfs_use_readdirplus, don't clear the bit until we've read the stored size of the directory entries, or reach end of the directory

Second approach:
- update the page cache based on the attributes we get in the GETATTR / LOOKUP

Another possibility I thought of though not sure it's feasible either - could we have the 'find' implementation do a 'stat' on one of the files after each 'getdents' call?  That seems like a band-aid for an NFS problem though.
Comment 42 Benjamin Coddington 2015-11-23 08:33:22 EST
(In reply to Dave Wysochanski from comment #38)
> You said this:
> "So, it looks like NFS readdir code is working properly, but glibc is doing
> getdents(2) with a 32k buffer, so after the first call to getdents,
> READDIRPLUS is disabled (since the optimization added in d69ee9b).  I think
> glibc should be setting that buffer rather larger based on the size of the
> directory's st_blksize -- which on my system is reporting to be 262144.."
> 
> Isn't st_blksize of the directory set based on rsize / wsize?

Yes, right now it is.

> In a lot of cases this is recommended by the NFS server vendor, and I don't think it's
> quite that large.  Even if I grant that you're right, and we go from 32k to
> 256k, we're talking about at best getting 8x's the entries we have now.

Use `stat --printf=%o`

> In my testcase wouldn't that translate to approximately 8,000 entries out of
> 30,000? So we'll have good performance on approximately 8k entries, then we
> disable readdirplus never to enable it again, even if you run find over and
> over and the directory never changes. I think the customers comparing with
> RHEL5 won't think performance is acceptable.

That's because you're equating the customer's acceptability to how well your testcase performs.  That might be a fair comparison for this particular customer if your testcase is exactly the real-world conditions they are using.

(In reply to Dave Wysochanski from comment #38)
> (In reply to Benjamin Coddington from comment #37)
> > When you say "problem" what behavior do you believe is problematic?  Is it
> > that NFS does not always use readdirplus?  The "use readdirplus on the first
> > readdir" behavior is an optimization specifically for reading directories
> > with a lot of entries.
> > 
> Please see https://bugzilla.redhat.com/show_bug.cgi?id=1248140#c0

OK..

> We can't regress performance this bad and expect it to be ok, unless we can
> show it's impossible to get the performance back because of some other
> performance tradeoff.

This performance regression is indeed a performance tradeoff created by upstream d69ee9b.  The discussion around that fix is here:

http://thread.gmane.org/gmane.linux.nfs/39392/focus=39392

To sum: the use of readdirplus was causing performance regressions reading very large directories when just the list of entries was required.  Your customer is running into this tradeoff, only greatly exacerbated by the current glibc behavior of only using 32k for a buffer size (glibc used to use the st_blksize - I'm not sure in which RHEL where that logic was ripped out..)

(In reply to Dave Wysochanski from comment #41)
> Ok here's an attempt at a bit more complicated logic, though I'm not sure
> it's feasible or side effects.
> 
> First approach:
> - if we get the hint of the LOOKUP, store the current size of the directory
> (?), and set the flag
> - inside nfs_use_readdirplus, don't clear the bit until we've read the
> stored size of the directory entries, or reach end of the directory

This effectively disables the selective optimization added in d69ee9b.  You might well then have other customers complaining that getdents() is very slow on large directories because their NFS server is overloaded pulling attributes on every entry, or the wire-protocol is bogged down transferring all these attributes they don't need.

> Second approach:
> - update the page cache based on the attributes we get in the GETATTR /
> LOOKUP

Maybe we could do this?  Might get very tricky to go hunting through directory cache pages for specific entries to update them..  This won't solve the initial performance problem -- it'll only be good for a second pass if the cache pages haven't been invalidated yet.

> Another possibility I thought of though not sure it's feasible either -
> could we have the 'find' implementation do a 'stat' on one of the files
> after each 'getdents' call?  That seems like a band-aid for an NFS problem
> though.

I think an ltrace of find will show it is using readdir().. the place you want to look at modifying things is within glibc.  Once you dig into that, you'll find the st_blksize issue.

I think the best fix would be to get glibc to use the st_blksize for getdents -- which is the "preferred" blocksize for efficient filesystem I/O.  Then in NFS we can look at moving st_blksize to be a representative multiple of the number of entries within a particular directory, or make it more tuneable via mount options (dsize?).

Another 'hit it with a hammer' fix might be to make the selective readdirplus behavior tunable via mount, sysctl, or other.  Then the customers that need fast getdents w/o attributes can retain the current behavior, and customers that want fast readdir(), + d_type can turn off the optimization added in d69ee9b.
Comment 43 Dave Wysochanski 2015-11-23 13:14:53 EST
(In reply to Benjamin Coddington from comment #42)
> (In reply to Dave Wysochanski from comment #38)
> > You said this:
> > "So, it looks like NFS readdir code is working properly, but glibc is doing
> > getdents(2) with a 32k buffer, so after the first call to getdents,
> > READDIRPLUS is disabled (since the optimization added in d69ee9b).  I think
> > glibc should be setting that buffer rather larger based on the size of the
> > directory's st_blksize -- which on my system is reporting to be 262144.."
> > 
> > Isn't st_blksize of the directory set based on rsize / wsize?
> 
> Yes, right now it is.
> 
> > In a lot of cases this is recommended by the NFS server vendor, and I don't think it's
> > quite that large.  Even if I grant that you're right, and we go from 32k to
> > 256k, we're talking about at best getting 8x's the entries we have now.
> 
> Use `stat --printf=%o`
> 
> > In my testcase wouldn't that translate to approximately 8,000 entries out of
> > 30,000? So we'll have good performance on approximately 8k entries, then we
> > disable readdirplus never to enable it again, even if you run find over and
> > over and the directory never changes. I think the customers comparing with
> > RHEL5 won't think performance is acceptable.
> 
> That's because you're equating the customer's acceptability to how well your
> testcase performs.  That might be a fair comparison for this particular
> customer if your testcase is exactly the real-world conditions they are
> using.
> 
My testcase was based on all the data I got from the customer (strace, tcpdump, etc), including the commandline and filenames.
I did reduce the number of files though, customers filenames were longer and more random, and I'm sure their NFS server had load on it.  So they'll likely see much worse performance than my benchmark suggests.  However the underlying strace and tcpdump data shows the same thing and I think I understand the problem pretty well.


> (In reply to Dave Wysochanski from comment #38)
> > (In reply to Benjamin Coddington from comment #37)
> > > When you say "problem" what behavior do you believe is problematic?  Is it
> > > that NFS does not always use readdirplus?  The "use readdirplus on the first
> > > readdir" behavior is an optimization specifically for reading directories
> > > with a lot of entries.
> > > 
> > Please see https://bugzilla.redhat.com/show_bug.cgi?id=1248140#c0
> 
> OK..
> 
> > We can't regress performance this bad and expect it to be ok, unless we can
> > show it's impossible to get the performance back because of some other
> > performance tradeoff.
> 
> This performance regression is indeed a performance tradeoff created by
> upstream d69ee9b.  The discussion around that fix is here:
> 
> http://thread.gmane.org/gmane.linux.nfs/39392/focus=39392
> 
> To sum: the use of readdirplus was causing performance regressions reading
> very large directories when just the list of entries was required.  Your
> customer is running into this tradeoff, only greatly exacerbated by the
> current glibc behavior of only using 32k for a buffer size (glibc used to
> use the st_blksize - I'm not sure in which RHEL where that logic was ripped
> out..)
> 

Well, I'm starting to think the optimization was not done properly.  If you read that thread, in the initial reply Peter Staubach says
http://article.gmane.org/gmane.linux.nfs/39397
"Perhaps the use of a heuristic that enables readdirplus only after the application has shown that it is
interested in the attributes for each entry in the directory?  Thus, if the application does
readdir()/stat()/stat()/stat()/readdir()/... then the NFS client could use readdirplus to fill the
caches.  If the application is just reading the directory and looking at the names, then the client could
just use readdir."

My read of that is this.  The original suggestion was to do what my first suggestion was - fill the cache, not a tiny arbitrary part of it, after the heuristic returns true.  The problem with what was implemented is that it only fills part of the cache based on the hint, and that how much is arbitrary.
Now one might argue that maybe one LOOKUP shouldn't be enough - maybe it should be more than that to turn back on READDIRPLUS (Note the 3 stat's Peter has in his example).  

> (In reply to Dave Wysochanski from comment #41)
> > Ok here's an attempt at a bit more complicated logic, though I'm not sure
> > it's feasible or side effects.
> > 
> > First approach:
> > - if we get the hint of the LOOKUP, store the current size of the directory
> > (?), and set the flag
> > - inside nfs_use_readdirplus, don't clear the bit until we've read the
> > stored size of the directory entries, or reach end of the directory
> 
> This effectively disables the selective optimization added in d69ee9b.  You
> might well then have other customers complaining that getdents() is very
> slow on large directories because their NFS server is overloaded pulling
> attributes on every entry, or the wire-protocol is bogged down transferring
> all these attributes they don't need.
> 
Right.  My guess is the reason why this regression occurred is that people were not thinking about enough scenarios / commands which are run on typical NFS clients to traverse directories.  It's important to note that 'find' is not mentioned in that thread anywhere that I could tell, though Neil did a good job of various other commands here: https://wiki.linux-nfs.org/wiki/index.php/Readdir_performance_results

So any fix I suggest here should also consider a few of the other commands to see how they perform (see below benchmarks), and look at NFSv4 in addition to make sure nothing else regresses with any "fix".  Also I've not yet considered how changing directories would perform either so there is probably a lot to consider.  Here's what I've got so far:
  benchmark: [0 == 'find -name' (default) | 1 == 'ls --color=never' | 2 == 'ls --color=never -l']

Now that I think about it I'll probably need "find -type f" and a few other variants to be sure since that showed up in the other bug.

> > Second approach:
> > - update the page cache based on the attributes we get in the GETATTR /
> > LOOKUP
> 
> Maybe we could do this?  Might get very tricky to go hunting through
> directory cache pages for specific entries to update them..  This won't
> solve the initial performance problem -- it'll only be good for a second
> pass if the cache pages haven't been invalidated yet.
> 
Yeah it may not be feasible.  I also find this "readdir cache with a d_type that might be there" a bit odd.  It seems out of sync with the other attributes so I think it's worth at least looking into.  I wonder for example what happens if you remove a file and create a directory of the same name - do we handle that right?


> > Another possibility I thought of though not sure it's feasible either -
> > could we have the 'find' implementation do a 'stat' on one of the files
> > after each 'getdents' call?  That seems like a band-aid for an NFS problem
> > though.
> 
> I think an ltrace of find will show it is using readdir().. the place you
> want to look at modifying things is within glibc.  Once you dig into that,
> you'll find the st_blksize issue.
> 
> I think the best fix would be to get glibc to use the st_blksize for
> getdents -- which is the "preferred" blocksize for efficient filesystem I/O.
> Then in NFS we can look at moving st_blksize to be a representative multiple
> of the number of entries within a particular directory, or make it more
> tuneable via mount options (dsize?).
> 
I will look into it but my original objection still holds that the NFS logic is wrong.  Put another way, why would NFS client think that a single readdir (which translates to getdents) with N entries in a directory of 100*N entries wouldn't be followed 99 other readdir calls with N entries, so it should treat the first one different than the others?  Why would NFS get the d_type in the first call but not the subsequent ones?  I may be misunderstanding the patch, not sure.

A larger st_blksize will make a very tiny dent in this problem - it's not a large contributor, because normally we're talking about tens of thousands or hundreds of thousands of files in a directory we're traversing.  I can try a backport of that and see how it affects things, then file a separate bug for it but I'm not going to move this bug to glibc unless there's something subtle there that would trick NFS into doing the right thing.


> Another 'hit it with a hammer' fix might be to make the selective
> readdirplus behavior tunable via mount, sysctl, or other.  Then the
> customers that need fast getdents w/o attributes can retain the current
> behavior, and customers that want fast readdir(), + d_type can turn off the
> optimization added in d69ee9b.

It would be nice if a mount option would fix this but my guess is we're going to run into people that don't want to choose but asks "can't NFS be smarter"?  ;-)  My read of that thread Trond was that pointed at the nordirplus to be used and initially wasn't keen on the new logic but then was convinced over time.
Comment 45 Dave Wysochanski 2015-11-25 15:18:19 EST
I just realized my test case is not using a '-type ' predicate, and yet find seems to be needing the type for some reason.  I'm not sure why this is now, and I'm wondering about a 'find' bug.  Kamil, can you comment please?  Why does the following find command need anything more than names?  It seems like if readdir just returns names and some d_type = UNKNOWN it should not matter to the outcome of the find command, so it should be issuing a 'stat' call regardless of any NFS caching type bug where readdirplus is used or not depending on heuristics:
find /mnt/bz1248140-export/test -name "*.pdf"
Comment 46 Dave Wysochanski 2015-11-25 15:57:16 EST
BTW, the test does not improve with findutils-4.4.2-7.el6 which fixes https://bugzilla.redhat.com/show_bug.cgi?id=883285

Ok I think my question may have an obvious answer.  There's no way to traverse subdirectories with names only - i.e. you have to know if a directory entry is a file, directory, etc.  Most likely this is the reason that find will always need the d_type field to be valid to do its job.  Is that right Kamil?
Comment 47 Kamil Dudka 2015-11-26 05:05:55 EST
Thanks to leaf optimization, find can traverse a directory tree even without having valid d_type field.  The optimization relies on the link count previously returned for the directory being traversed and, if all its subdirectories were already traversed, all other files are assumed to be non-directories.

The optimization was enabled by default on RHEL-5.  RHEL-6 findutils does not implement this optimization.  RHEL-7 findutils implements it but it is enabled only for reiserfs.  Bug #1252549 is about enabling it for nfs, too.  The problem is that I was not able to reproduce the performance drop with RHEL-7 findutils.  See bug #1252549 comment #1 for details.  If we have a reproducer that works on a stable RHEL-7 system, I will propose the patch (attachment #1087477 [details]) upstream.
Comment 56 Dave Wysochanski 2017-04-13 17:28:51 EDT
(In reply to Steve Whitehouse from comment #55)
> Yes, definitely clone to 7.4. It is important to make sure that we don't see
> regressions like this. Is there something we can do to improve performance
> testing of various different readdir scenarios? It seems that there have
> been a few issues recently related to this area.
> 
Cloned to https://bugzilla.redhat.com/show_bug.cgi?id=1442272

I don't think anyone has complained about find in RHEL7 though it may just be a matter of time.

> I'm also moving this to high priority. It is too late for 6.9, but it would
> be good to have this resolved for 7.4 at least, since it is likely to affect
> a number of customers and there are already a number of customer cases
> attached here (even though most are closed)
> 
> What is the latest status here?

I am not sure anything has changed - I've not looked at this in a while as there's been other higher priorities.

Regarding RHEL7, I am not sure if the situation is identical to RHEL6 or not.  I don't think we have customers complaining about 'find' perf on RHEL7 but I could be mistaken.  It may just be that people don't notice 'find' being slow as much as they notice 'ls' or a general directory listing since 'find' is often run from a batch job.  So the problem probably still exists everywhere but needs a re-assessment.

I can take a look but it won't be until April 24th at the earliest.

Note You need to log in before you can comment on or make changes to this bug.