Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1371587

Summary:

Multiple processes listing a directory over nfs concurrently takes a very long time

Product:

Red Hat Enterprise Linux 7

Reporter:

Frank Sorenson <fsorenso>

Component:

kernel

Assignee:

Benjamin Coddington <bcodding>

kernel sub component:

NFS

QA Contact:

JianHong Yin <jiyin>

Status:

CLOSED DUPLICATE

Docs Contact:

Severity:

medium

Priority:

high

CC:

ajmitchell, arawat, bcodding, bfields, dwysocha, eguan, jiyin, joao.matos, lmiksik, steved, swhiteho, yoyang

Version:

7.3

Keywords:

Reproducer

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1371611 (view as bug list)

Environment:

Last Closed:

2017-04-18 14:20:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1298243, 1385242

Attachments:

Description	Flags
simple script to list directo	none

Description Frank Sorenson 2016-08-30 14:38:11 UTC

Created attachment 1195937 [details]
simple script to list directo

Description of problem:

When multiple processes are listing the contents of a directory simultaneously, calls to getdents() can take a very long time.  This occurs even if the contents of the directory are unchanging.

As the number of directory entries increases, and as the number of processes listing the directory concurrently increases, the time required to list a directory can become huge.



Version-Release number of selected component (if applicable):

RHEL 7 kernels
	kernel 3.10.0-327.10.1.el7.x86_64
	kernel 3.10.0-229.el7.x86_64

RHEL 6 kernels
	2.6.32-573.8.1.el6.x86_64
	2.6.32-642.el6.x86_64

upstream
	4.8.0-rc2+


How reproducible:

	fairly reliably


Steps to Reproduce:

	start multiple 'ls -fl' commands on an nfs-mounted directory simultaneously

	simple reproducer attached


Actual results:

for a directory with 200,000 files, each having a 23-byte filename:
1 process: 13.250018110

20 simultaneously: 53.123712762
20 processes, 1 second delay: 58.065284049
20 processes, 2 second delay: 98.601665008
20 processes, 3 second delay: 96.790224244
20 processes, 4 second delay, 97.487171089
(time from start of first thread to finish of last thread)


Expected results:

some performance loss due to scalability problems is expected, but the very long listing times are much longer than expected


Additional info:

Using 30 processes simultaneously listing a directory with 1.8 million 23-byte files (arbitrary, but seen in customer case), I have reproduced a 39-hour hang (after which all 30 processes completed within a second of each other)


I have additional results, debugging, and analysis, which I will add/attach


parallel_ls reproducer script can take up to 3 arguments, with default values if not provided:

# parallel_ls <threads> <path> <delay>
	threads: # of processes to start simultaneously (default: 2)
	path: directory to list (default: /mnt/tmp)
	delay: time to sleep between starting each process (default: no delay)

Comment 1 Frank Sorenson 2016-08-30 15:17:31 UTC

After doing some debugging, I believe that much of the issue may be related to the following (cut down):

int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
{
        ret = nfs_readdir_xdr_to_array(desc, page, inode);
 
        if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) {
                /* Should never happen */
                nfs_zap_mapping(inode, inode->i_mapping);
        }

here, every time that we finish with an nfs_readdir, we invalidate all the remaining pages of the directory inode mapping.  This includes any pages just filled by another process further along in the directory listing.  As the number of processes calling nfs_readdir increases, the processes each invalidate pages, so each is required to make its own READDIR/READDIRPLUS calls.

(note: I have not found evidence of the invalidate failing, and the nfs_zap_mapping being called, but it's possible that may occur as well)



Also, calling nfs_zap_mapping every time we decide to force readdirplus seems suspicious:

void nfs_force_use_readdirplus(struct inode *dir)
{
        if (!list_empty(&NFS_I(dir)->open_files)) {
                nfs_advise_use_readdirplus(dir);
                nfs_zap_mapping(dir, dir->i_mapping);

Wouldn't something like this make more sense?
void nfs_force_use_readdirplus(struct inode *dir)
{
        int status;

        if (!list_empty(&NFS_I(dir)->open_files)) {
                if (! test_and_set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                        nfs_zap_mapping(dir, dir->i_mapping);



also, also it seems non-intuitive that 'nfs_use_readdirplus', while returning the status of the NFS_INO_ADVISE_RDPLUS flag, also clears that flag:

bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
        if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
                return false;
        if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                return true;
        if (ctx->pos == 0)
                return true;
        return false;

What if this were to only test the bit:
bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
{
        if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
                return false;
        if (test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
                return true;
        if (ctx->pos == 0)
                return true;
        return false;

and have nfs_readdir clear the bit at the bottom IFF nfs_force_readdirplus() hasn't been called to set it again since the last time that nfs_readdir was called (at least by this process?)?  (obviously requires a bit more logic to get right)

Comment 2 Benjamin Coddington 2016-08-30 15:23:18 UTC

Hi Frank, we've been working on a very similar problem in bug 1325766.

I think things should work better with this fix:
http://marc.info/?l=linux-nfs&m=147222590709214&w=2

Comment 3 Benjamin Coddington 2016-08-30 15:39:13 UTC

(In reply to Benjamin Coddington from comment #2)
> I think things should work better with this fix:
> http://marc.info/?l=linux-nfs&m=147222590709214&w=2

Actually, it may not help this problem:

(In reply to Frank Sorenson from comment #1)
> After doing some debugging, I believe that much of the issue may be related
> to the following (cut down):
> 
> int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
> {
>         ret = nfs_readdir_xdr_to_array(desc, page, inode);
>  
>         if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1,
> -1) < 0) {
>                 /* Should never happen */
>                 nfs_zap_mapping(inode, inode->i_mapping);
>         }
...
> Also, calling nfs_zap_mapping every time we decide to force readdirplus
> seems suspicious:

If the pagecache doesn't currently have the extra attributes we would get from a readdirplus, then we zap the mapping.  Forcing readdirplus is a signal from nfs_getattr() that we might be doing a ls -l on the directory and the entry attributes have expired or need updating, and readdirplus avoids us from sending GETATTR for every single entry, so we should drop the cached pages.

> void nfs_force_use_readdirplus(struct inode *dir)
> {
>         if (!list_empty(&NFS_I(dir)->open_files)) {
>                 nfs_advise_use_readdirplus(dir);
>                 nfs_zap_mapping(dir, dir->i_mapping);
> 
> Wouldn't something like this make more sense?
> void nfs_force_use_readdirplus(struct inode *dir)
> {
>         int status;
> 
>         if (!list_empty(&NFS_I(dir)->open_files)) {
>                 if (! test_and_set_bit(NFS_INO_ADVISE_RDPLUS,
> &NFS_I(dir)->flags))
>                         nfs_zap_mapping(dir, dir->i_mapping);
> 
> 

No, I think just because NFS_INO_ADVISE_RDPLUS is not currently set doesn't mean we don't have readdirplus cache pages.  I think NFS_INO_ADVISE_RDPLUS is cleared by nfs_use_readdirplus().  It is not an indicator of the type of entries in the page cache.

> also, also it seems non-intuitive that 'nfs_use_readdirplus', while
> returning the status of the NFS_INO_ADVISE_RDPLUS flag, also clears that
> flag:
> 
> bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> {
>         if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
>                 return false;
>         if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
>                 return true;
>         if (ctx->pos == 0)
>                 return true;
>         return false;

Right!  We usually don't want to use readdirplus.  Only when we need the extra attributes.

> What if this were to only test the bit:
> bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx)
> {
>         if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS))
>                 return false;
>         if (test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags))
>                 return true;
>         if (ctx->pos == 0)
>                 return true;
>         return false;
> 
> and have nfs_readdir clear the bit at the bottom IFF nfs_force_readdirplus()
> hasn't been called to set it again since the last time that nfs_readdir was
> called (at least by this process?)?  (obviously requires a bit more logic to
> get right)

There are three optimizations (that I know about) that we should maintain:

1 - don't use READDIRPLUS unnecessarily
2 - if we're doing ls -l, detect this so that the second batch of entries returned by nfs_readdir() have been retrieved with READDIRPLUS to avoid a GETATTR storm
3 - the one broken in bug 1325766: don't invalidate the page cache every time the directory is modified if we are in the middle of a listing.  At least wait until the attributes time out

Comment 11 Benjamin Coddington 2016-12-02 20:50:29 UTC

(In reply to Frank Sorenson from comment #1)
> After doing some debugging, I believe that much of the issue may be related
> to the following (cut down):
> 
> int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
> {
>         ret = nfs_readdir_xdr_to_array(desc, page, inode);
>  
>         if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1,
> -1) < 0) {
>                 /* Should never happen */
>                 nfs_zap_mapping(inode, inode->i_mapping);
>         }
> 
> here, every time that we finish with an nfs_readdir, we invalidate all the
> remaining pages of the directory inode mapping.  This includes any pages
> just filled by another process further along in the directory listing.  As
> the number of processes calling nfs_readdir increases, the processes each
> invalidate pages, so each is required to make its own READDIR/READDIRPLUS
> calls.

This can't be the problem, since we'd only have gotten here if the page wasn't found in the cache, so processes that are "behind" never get here and cut off pages for processes that are "ahead".

However, new processes starting in nfs_readdir() such that ctx->pos = 0 are going to nfs_revalidate_mapping().  If the dircache has timed out, and we keep spawning new processes it seems like we could be stuck not making forward progress.. but I don't think that could be the case with the reported 39 hour hang with only 30 processes.  Even if the 30 processes were started 4 seconds apart, they should converge and move forward at roughly the same pace as a single process after 2 minutes.

I've been playing with this today.  Frank, do you remember if you were using "noacl" on your mounts when testing this?  I found the extra ACL RPCs to be a significant slowdown.  I'm going to work without ACL from here on.

Comment 12 Benjamin Coddington 2016-12-07 13:40:37 UTC

Short description of the problem upstream:
http://marc.info/?l=linux-nfs&m=148111734626658&w=2

Proposed fix:
http://marc.info/?l=linux-nfs&m=148111783326790&w=2