Bug 1371611 - Multiple processes listing a directory over nfs concurrently takes a very long time
Summary: Multiple processes listing a directory over nfs concurrently takes a very lon...
Keywords:
Status: CLOSED DUPLICATE of bug 1325766
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.8
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: nfs-maint
QA Contact: JianHong Yin
URL:
Whiteboard:
Depends On:
Blocks: 1324930 1374441 1461138 1507140
TreeView+ depends on / blocked
 
Reported: 2016-08-30 15:42 UTC by Dave Wysochanski
Modified: 2022-03-13 14:05 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1371587
Environment:
Last Closed: 2017-10-30 11:25:30 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1325766 0 high CLOSED RHEL6.7: NFSv3 client performance regression where ls -l takes too long with "aggressive readdirplus" commit 2022-03-13 14:01:50 UTC
Red Hat Knowledge Base (Solution) 2592711 0 None None None 2016-08-30 17:06:52 UTC

Internal Links: 1325766

Description Dave Wysochanski 2016-08-30 15:42:35 UTC
+++ This bug was initially created as a clone of Bug #1371587 +++

Description of problem:

When multiple processes are listing the contents of a directory simultaneously, calls to getdents() can take a very long time.  This occurs even if the contents of the directory are unchanging.

As the number of directory entries increases, and as the number of processes listing the directory concurrently increases, the time required to list a directory can become huge.



Version-Release number of selected component (if applicable):

RHEL 7 kernels
	kernel 3.10.0-327.10.1.el7.x86_64
	kernel 3.10.0-229.el7.x86_64

RHEL 6 kernels
	2.6.32-573.8.1.el6.x86_64
	2.6.32-642.el6.x86_64

upstream
	4.8.0-rc2+


How reproducible:

	fairly reliably


Steps to Reproduce:

	start multiple 'ls -fl' commands on an nfs-mounted directory simultaneously

	simple reproducer attached


Actual results:

for a directory with 200,000 files, each having a 23-byte filename:
1 process: 13.250018110

20 simultaneously: 53.123712762
20 processes, 1 second delay: 58.065284049
20 processes, 2 second delay: 98.601665008
20 processes, 3 second delay: 96.790224244
20 processes, 4 second delay, 97.487171089
(time from start of first thread to finish of last thread)


Expected results:

some performance loss due to scalability problems is expected, but the very long listing times are much longer than expected


Additional info:

Using 30 processes simultaneously listing a directory with 1.8 million 23-byte files (arbitrary, but seen in customer case), I have reproduced a 39-hour hang (after which all 30 processes completed within a second of each other)


I have additional results, debugging, and analysis, which I will add/attach


parallel_ls reproducer script can take up to 3 arguments, with default values if not provided:

# parallel_ls <threads> <path> <delay>
	threads: # of processes to start simultaneously (default: 2)
	path: directory to list (default: /mnt/tmp)
	delay: time to sleep between starting each process (default: no delay)

--- Additional comment from Frank Sorenson on 2016-08-30 11:17:31 EDT ---

After doing some debugging, I believe that much of the issue may be related to the following (cut down):

int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
{
        ret = nfs_readdir_xdr_to_array(desc, page, inode);
 
        if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) {
                /* Should never happen */
                nfs_zap_mapping(inode, inode->i_mapping);
        }

here, every time that we finish with an nfs_readdir, we invalidate all the remaining pages of the directory inode mapping.  This includes any pages just filled by another process further along in the directory listing.  As the number of processes calling nfs_readdir increases, the processes each invalidate pages, so each is required to make its own READDIR/READDIRPLUS calls.

(note: I have not found evidence of the invalidate failing, and the nfs_zap_mapping being called, but it's possible that may occur as well)



Also, calling nfs_zap_mapping every time we decide to force readdirplus seems suspicious:

void nfs_force_use_readdirplus(struct inode *dir)
{
        if (!list_empty(&NFS_I(dir)->open_files)) {
                nfs_advise_use_readdirplus(dir);
                nfs_zap_mapping(dir, dir->i_mapping);

Wouldn't something like this make more sense?
void nfs_force_use_readdirplus(struct inode *dir)
{
        int status;

        if (!list_empty(&NFS_I(dir)->open_files)) {
                if (! test_and_set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                        nfs_zap_mapping(dir, dir->i_mapping);



also, also it seems non-intuitive that 'nfs_use_readdirplus', while returning the status of the NFS_INO_ADVISE_RDPLUS flag, also clears that flag:

bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
        if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
                return false;
        if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                return true;
        if (ctx->pos == 0)
                return true;
        return false;

What if this were to only test the bit:
bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
        if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
                return false;
        if (test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                return true;
        if (ctx->pos == 0)
                return true;
        return false;

and have nfs_readdir clear the bit at the bottom IFF nfs_force_readdirplus() hasn't been called to set it again since the last time that nfs_readdir was called (at least by this process?)?  (obviously requires a bit more logic to get right)

--- Additional comment from Benjamin Coddington on 2016-08-30 11:23:18 EDT ---

Hi Frank, we've been working on a very similar problem in bug 1325766.

I think things should work better with this fix:
http://marc.info/?l=linux-nfs&m=147222590709214&w=2

Comment 9 Dave Wysochanski 2016-09-28 01:29:45 UTC
Please note the following items were found in the vmcores for this issue:
- the directory was not being modified when the listing was being done, so this is not https://bugzilla.redhat.com/show_bug.cgi?id=1325766
- the directory contained over 1 million entries
- multiple processes were listing the directory at the same time
- the process listing the directory and holding the directory mutex was waiting on a readdirplus response
- there were 43 waiters on the directory mutex
- the oldest waiter was almost 48 minutes, and the youngest waiter was 40 minutes

Comment 15 Dave Wysochanski 2017-10-30 11:25:30 UTC
This bug is actually a dup of bug 1325766.  The 3 patches for this bug are already in RHEL6 kernels starting with kernel-2.6.32-682.el6 so it is fixed in RHEL6.9 (2.6.32-696.el6) or above.

*** This bug has been marked as a duplicate of bug 1325766 ***


Note You need to log in before you can comment on or make changes to this bug.