Bug 1371611

Summary: Multiple processes listing a directory over nfs concurrently takes a very long time
Product: Red Hat Enterprise Linux 6 Reporter: Dave Wysochanski <dwysocha>
Component: kernelAssignee: nfs-maint
kernel sub component: NFS QA Contact: JianHong Yin <jiyin>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: bcodding, bfields, cww, dwysocha, eguan, fsorenso, fs-qe, jiyin, kcleveng, nfs-maint, qguo, swhiteho, yoyang
Version: 6.8Keywords: Patch
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1371587 Environment:
Last Closed: 2017-10-30 11:25:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1324930, 1374441, 1461138, 1507140    

Description Dave Wysochanski 2016-08-30 15:42:35 UTC
+++ This bug was initially created as a clone of Bug #1371587 +++

Description of problem:

When multiple processes are listing the contents of a directory simultaneously, calls to getdents() can take a very long time.  This occurs even if the contents of the directory are unchanging.

As the number of directory entries increases, and as the number of processes listing the directory concurrently increases, the time required to list a directory can become huge.



Version-Release number of selected component (if applicable):

RHEL 7 kernels
	kernel 3.10.0-327.10.1.el7.x86_64
	kernel 3.10.0-229.el7.x86_64

RHEL 6 kernels
	2.6.32-573.8.1.el6.x86_64
	2.6.32-642.el6.x86_64

upstream
	4.8.0-rc2+


How reproducible:

	fairly reliably


Steps to Reproduce:

	start multiple 'ls -fl' commands on an nfs-mounted directory simultaneously

	simple reproducer attached


Actual results:

for a directory with 200,000 files, each having a 23-byte filename:
1 process: 13.250018110

20 simultaneously: 53.123712762
20 processes, 1 second delay: 58.065284049
20 processes, 2 second delay: 98.601665008
20 processes, 3 second delay: 96.790224244
20 processes, 4 second delay, 97.487171089
(time from start of first thread to finish of last thread)


Expected results:

some performance loss due to scalability problems is expected, but the very long listing times are much longer than expected


Additional info:

Using 30 processes simultaneously listing a directory with 1.8 million 23-byte files (arbitrary, but seen in customer case), I have reproduced a 39-hour hang (after which all 30 processes completed within a second of each other)


I have additional results, debugging, and analysis, which I will add/attach


parallel_ls reproducer script can take up to 3 arguments, with default values if not provided:

# parallel_ls <threads> <path> <delay>
	threads: # of processes to start simultaneously (default: 2)
	path: directory to list (default: /mnt/tmp)
	delay: time to sleep between starting each process (default: no delay)

--- Additional comment from Frank Sorenson on 2016-08-30 11:17:31 EDT ---

After doing some debugging, I believe that much of the issue may be related to the following (cut down):

int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
{
        ret = nfs_readdir_xdr_to_array(desc, page, inode);
 
        if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) {
                /* Should never happen */
                nfs_zap_mapping(inode, inode->i_mapping);
        }

here, every time that we finish with an nfs_readdir, we invalidate all the remaining pages of the directory inode mapping.  This includes any pages just filled by another process further along in the directory listing.  As the number of processes calling nfs_readdir increases, the processes each invalidate pages, so each is required to make its own READDIR/READDIRPLUS calls.

(note: I have not found evidence of the invalidate failing, and the nfs_zap_mapping being called, but it's possible that may occur as well)



Also, calling nfs_zap_mapping every time we decide to force readdirplus seems suspicious:

void nfs_force_use_readdirplus(struct inode *dir)
{
        if (!list_empty(&NFS_I(dir)->open_files)) {
                nfs_advise_use_readdirplus(dir);
                nfs_zap_mapping(dir, dir->i_mapping);

Wouldn't something like this make more sense?
void nfs_force_use_readdirplus(struct inode *dir)
{
        int status;

        if (!list_empty(&NFS_I(dir)->open_files)) {
                if (! test_and_set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                        nfs_zap_mapping(dir, dir->i_mapping);



also, also it seems non-intuitive that 'nfs_use_readdirplus', while returning the status of the NFS_INO_ADVISE_RDPLUS flag, also clears that flag:

bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
        if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
                return false;
        if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                return true;
        if (ctx->pos == 0)
                return true;
        return false;

What if this were to only test the bit:
bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
        if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
                return false;
        if (test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                return true;
        if (ctx->pos == 0)
                return true;
        return false;

and have nfs_readdir clear the bit at the bottom IFF nfs_force_readdirplus() hasn't been called to set it again since the last time that nfs_readdir was called (at least by this process?)?  (obviously requires a bit more logic to get right)

--- Additional comment from Benjamin Coddington on 2016-08-30 11:23:18 EDT ---

Hi Frank, we've been working on a very similar problem in bug 1325766.

I think things should work better with this fix:
http://marc.info/?l=linux-nfs&m=147222590709214&w=2

Comment 9 Dave Wysochanski 2016-09-28 01:29:45 UTC
Please note the following items were found in the vmcores for this issue:
- the directory was not being modified when the listing was being done, so this is not https://bugzilla.redhat.com/show_bug.cgi?id=1325766
- the directory contained over 1 million entries
- multiple processes were listing the directory at the same time
- the process listing the directory and holding the directory mutex was waiting on a readdirplus response
- there were 43 waiters on the directory mutex
- the oldest waiter was almost 48 minutes, and the youngest waiter was 40 minutes

Comment 15 Dave Wysochanski 2017-10-30 11:25:30 UTC
This bug is actually a dup of bug 1325766.  The 3 patches for this bug are already in RHEL6 kernels starting with kernel-2.6.32-682.el6 so it is fixed in RHEL6.9 (2.6.32-696.el6) or above.

*** This bug has been marked as a duplicate of bug 1325766 ***