Hide Forgot
Created attachment 1195937 [details] simple script to list directo Description of problem: When multiple processes are listing the contents of a directory simultaneously, calls to getdents() can take a very long time. This occurs even if the contents of the directory are unchanging. As the number of directory entries increases, and as the number of processes listing the directory concurrently increases, the time required to list a directory can become huge. Version-Release number of selected component (if applicable): RHEL 7 kernels kernel 3.10.0-327.10.1.el7.x86_64 kernel 3.10.0-229.el7.x86_64 RHEL 6 kernels 2.6.32-573.8.1.el6.x86_64 2.6.32-642.el6.x86_64 upstream 4.8.0-rc2+ How reproducible: fairly reliably Steps to Reproduce: start multiple 'ls -fl' commands on an nfs-mounted directory simultaneously simple reproducer attached Actual results: for a directory with 200,000 files, each having a 23-byte filename: 1 process: 13.250018110 20 simultaneously: 53.123712762 20 processes, 1 second delay: 58.065284049 20 processes, 2 second delay: 98.601665008 20 processes, 3 second delay: 96.790224244 20 processes, 4 second delay, 97.487171089 (time from start of first thread to finish of last thread) Expected results: some performance loss due to scalability problems is expected, but the very long listing times are much longer than expected Additional info: Using 30 processes simultaneously listing a directory with 1.8 million 23-byte files (arbitrary, but seen in customer case), I have reproduced a 39-hour hang (after which all 30 processes completed within a second of each other) I have additional results, debugging, and analysis, which I will add/attach parallel_ls reproducer script can take up to 3 arguments, with default values if not provided: # parallel_ls <threads> <path> <delay> threads: # of processes to start simultaneously (default: 2) path: directory to list (default: /mnt/tmp) delay: time to sleep between starting each process (default: no delay)
After doing some debugging, I believe that much of the issue may be related to the following (cut down): int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) { ret = nfs_readdir_xdr_to_array(desc, page, inode); if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, -1) < 0) { /* Should never happen */ nfs_zap_mapping(inode, inode->i_mapping); } here, every time that we finish with an nfs_readdir, we invalidate all the remaining pages of the directory inode mapping. This includes any pages just filled by another process further along in the directory listing. As the number of processes calling nfs_readdir increases, the processes each invalidate pages, so each is required to make its own READDIR/READDIRPLUS calls. (note: I have not found evidence of the invalidate failing, and the nfs_zap_mapping being called, but it's possible that may occur as well) Also, calling nfs_zap_mapping every time we decide to force readdirplus seems suspicious: void nfs_force_use_readdirplus(struct inode *dir) { if (!list_empty(&NFS_I(dir)->open_files)) { nfs_advise_use_readdirplus(dir); nfs_zap_mapping(dir, dir->i_mapping); Wouldn't something like this make more sense? void nfs_force_use_readdirplus(struct inode *dir) { int status; if (!list_empty(&NFS_I(dir)->open_files)) { if (! test_and_set_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags)) nfs_zap_mapping(dir, dir->i_mapping); also, also it seems non-intuitive that 'nfs_use_readdirplus', while returning the status of the NFS_INO_ADVISE_RDPLUS flag, also clears that flag: bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx) { if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) return false; if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags)) return true; if (ctx->pos == 0) return true; return false; What if this were to only test the bit: bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx) { if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) return false; if (test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags)) return true; if (ctx->pos == 0) return true; return false; and have nfs_readdir clear the bit at the bottom IFF nfs_force_readdirplus() hasn't been called to set it again since the last time that nfs_readdir was called (at least by this process?)? (obviously requires a bit more logic to get right)
Hi Frank, we've been working on a very similar problem in bug 1325766. I think things should work better with this fix: http://marc.info/?l=linux-nfs&m=147222590709214&w=2
(In reply to Benjamin Coddington from comment #2) > I think things should work better with this fix: > http://marc.info/?l=linux-nfs&m=147222590709214&w=2 Actually, it may not help this problem: (In reply to Frank Sorenson from comment #1) > After doing some debugging, I believe that much of the issue may be related > to the following (cut down): > > int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) > { > ret = nfs_readdir_xdr_to_array(desc, page, inode); > > if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, > -1) < 0) { > /* Should never happen */ > nfs_zap_mapping(inode, inode->i_mapping); > } ... > Also, calling nfs_zap_mapping every time we decide to force readdirplus > seems suspicious: If the pagecache doesn't currently have the extra attributes we would get from a readdirplus, then we zap the mapping. Forcing readdirplus is a signal from nfs_getattr() that we might be doing a ls -l on the directory and the entry attributes have expired or need updating, and readdirplus avoids us from sending GETATTR for every single entry, so we should drop the cached pages. > void nfs_force_use_readdirplus(struct inode *dir) > { > if (!list_empty(&NFS_I(dir)->open_files)) { > nfs_advise_use_readdirplus(dir); > nfs_zap_mapping(dir, dir->i_mapping); > > Wouldn't something like this make more sense? > void nfs_force_use_readdirplus(struct inode *dir) > { > int status; > > if (!list_empty(&NFS_I(dir)->open_files)) { > if (! test_and_set_bit(NFS_INO_ADVISE_RDPLUS, > &NFS_I(dir)->flags)) > nfs_zap_mapping(dir, dir->i_mapping); > > No, I think just because NFS_INO_ADVISE_RDPLUS is not currently set doesn't mean we don't have readdirplus cache pages. I think NFS_INO_ADVISE_RDPLUS is cleared by nfs_use_readdirplus(). It is not an indicator of the type of entries in the page cache. > also, also it seems non-intuitive that 'nfs_use_readdirplus', while > returning the status of the NFS_INO_ADVISE_RDPLUS flag, also clears that > flag: > > bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx) > { > if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) > return false; > if (test_and_clear_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags)) > return true; > if (ctx->pos == 0) > return true; > return false; Right! We usually don't want to use readdirplus. Only when we need the extra attributes. > What if this were to only test the bit: > bool nfs_use_readdirplus(struct inode *dir, struct dir_context *ctx) > { > if (!nfs_server_capable(dir, NFS_CAP_READDIRPLUS)) > return false; > if (test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(dir)->flags)) > return true; > if (ctx->pos == 0) > return true; > return false; > > and have nfs_readdir clear the bit at the bottom IFF nfs_force_readdirplus() > hasn't been called to set it again since the last time that nfs_readdir was > called (at least by this process?)? (obviously requires a bit more logic to > get right) There are three optimizations (that I know about) that we should maintain: 1 - don't use READDIRPLUS unnecessarily 2 - if we're doing ls -l, detect this so that the second batch of entries returned by nfs_readdir() have been retrieved with READDIRPLUS to avoid a GETATTR storm 3 - the one broken in bug 1325766: don't invalidate the page cache every time the directory is modified if we are in the middle of a listing. At least wait until the attributes time out
(In reply to Frank Sorenson from comment #1) > After doing some debugging, I believe that much of the issue may be related > to the following (cut down): > > int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page) > { > ret = nfs_readdir_xdr_to_array(desc, page, inode); > > if (invalidate_inode_pages2_range(inode->i_mapping, page->index + 1, > -1) < 0) { > /* Should never happen */ > nfs_zap_mapping(inode, inode->i_mapping); > } > > here, every time that we finish with an nfs_readdir, we invalidate all the > remaining pages of the directory inode mapping. This includes any pages > just filled by another process further along in the directory listing. As > the number of processes calling nfs_readdir increases, the processes each > invalidate pages, so each is required to make its own READDIR/READDIRPLUS > calls. This can't be the problem, since we'd only have gotten here if the page wasn't found in the cache, so processes that are "behind" never get here and cut off pages for processes that are "ahead". However, new processes starting in nfs_readdir() such that ctx->pos = 0 are going to nfs_revalidate_mapping(). If the dircache has timed out, and we keep spawning new processes it seems like we could be stuck not making forward progress.. but I don't think that could be the case with the reported 39 hour hang with only 30 processes. Even if the 30 processes were started 4 seconds apart, they should converge and move forward at roughly the same pace as a single process after 2 minutes. I've been playing with this today. Frank, do you remember if you were using "noacl" on your mounts when testing this? I found the extra ACL RPCs to be a significant slowdown. I'm going to work without ACL from here on.
Short description of the problem upstream: http://marc.info/?l=linux-nfs&m=148111734626658&w=2 Proposed fix: http://marc.info/?l=linux-nfs&m=148111783326790&w=2