1583150 – [Ganesha] ls -laRt on Ganesha mount is taking ~3.6 Hrs for around 11,00,000 files

Bug 1583150 - [Ganesha] ls -laRt on Ganesha mount is taking ~3.6 Hrs for around 11,00,000 files

Summary: [Ganesha] ls -laRt on Ganesha mount is taking ~3.6 Hrs for around 11,00,000 f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.z Batch Update 3
Assignee:	Kaleb KEITHLEY
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-05-28 10:28 UTC by Manisha Saini
Modified:	2019-02-06 12:23 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-04 07:34:12 UTC
Embargoed:
Dependent Products:
Flags:	jijoy: needinfo- jijoy: needinfo- jijoy: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0260	0	None	None	None	2019-02-04 07:34:15 UTC

Description Manisha Saini 2018-05-28 10:28:05 UTC

Description of problem:

Performing ls -laRt on Ganesha mount point is taking around ~3.6 Hrs for around 11 lakhs file.

Version-Release number of selected component (if applicable):

nfs-ganesha-gluster-2.5.5-7.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.5.5-7.el7rhgs.x86_64
nfs-ganesha-2.5.5-7.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-11.el7rhgs.x86_64


How reproducible:

1/1

Steps to Reproduce:

1.Create 6 node ganesha cluster
2.Create  4 x 3 = 12 Distributed-Replicate Volume.
3.Export the volume via Ganesha
4.Mount the volume to 4 different clients via 4.0
5.Run following workload from 4 clients

Client 1: Create 2 lakhs file using touch command
Client 2: Run bonnie
Client 3, Client 4- Create file in loop using dd command (for i in {1..1000000};do dd if=/dev/urandom of=stressc3$i conv=fdatasync bs=100 count=10000;done)

Stop the IO's after sometime (When IO's on client 1 and Client 2 are finished)

6.Perform ls -laRt from single client


Actual results:

ls -laRt took around ~3.6 Hrs for 11 lakhs files

-----------------
4 directories, 1129061 files



real    218m24.789s
user    0m9.004s
sys     0m51.261s
-----------------

Expected results:

ls -laRt should not take this long.


Additional info:


When performed ls -laRt on Fuse client, with same volume/same data mounted via glusterfs,It took around ~2 minutes

----------
real    2m53.605s
user    0m10.317s
sys     0m19.533s
---------

Comment 5 Daniel Gryniewicz 2018-05-30 14:41:35 UTC

The after-reboot is the same directory tree?  So after reboot, Ganesha is *faster* than FUSE?  That's very surprising...

I wonder if this is creations causing a degeneration in readdir chunks... Frank?

Comment 6 Frank Filz 2018-05-30 17:27:43 UTC

So you do a bunch of creates, then when all done with creates, do ls -laRt?

The creates will fill the dirent cache (up to the limit of Dir_Chunk * Detached_Mult). The dirents will be in the "unattached" chunk. Assuming your mdcache is big enough, all of the fsal_obj_handles will be cached with attributes.

Hmm, is the ls -laRt from the same client that did the creates? If so, and it has cached the dirents client side, it MAY just do a LOOKUP and/or GETATTR not a READDIR...

Assuming the client DOES do a READDIR, I'm not sure what the implications of readdir call back passing an fsal_obj_handle that duplicates one already in the mdcache.

Comment 7 Manisha Saini 2018-05-31 16:21:28 UTC

(In reply to Daniel Gryniewicz from comment #5)
> The after-reboot is the same directory tree?  So after reboot, Ganesha is
> *faster* than FUSE?  That's very surprising...
> 
> I wonder if this is creations causing a degeneration in readdir chunks...
> Frank?

Yes...After restarting ganesha service,its the same data  set on which ls -laRt was performed (1 lakh files)

After writing data set on mount,ls -laRt took around 15 mins.But on restarting ganesha service,It took way lesser time i.e ~28 seconds.

Note-Same client was used in both the iterations.

Comment 9 Frank Filz 2018-05-31 19:37:33 UTC

chunks_hwmark is 5000, that's the number of chunks, so 640,000 dirents, so your entire directory can live in the dirent cache.

Since nothing is changing any attributes, the directory is not being invalidated (and the activity keeps the mdcache entry for the directory from being reaped).

We need to be setting entries_hwmark to something larger than chunks_hwmark * dir_chunk. Ideally we should have a large enough dirent cache for the largest directory and thus enough mdcache entries also. But we do still need to have a somewhat reasonable performance for a directory larger than that.

Looking through why mdcache_readdir_chunked is calling getattrs, it's because the mdcache entry for the dirent was no longer present. Clearly Gluster's performance doing everything that is needed to re-instantiate an entry is far worse than doing a readdir plus. Given that, I would strongly suggest we set entries_hwmark to at least 10 * chunks_hwmark * dir_chunk to assure that we almost always have the mdcache entries corresponding to a dirent cached.

Comment 10 Daniel Gryniewicz 2018-06-01 12:47:40 UTC

A dirent doesn't actually contain any information about the entry itself except the key to look it up.  So we can't recreate the entry directly from the dirent, instead, if the entry has been reaped, we need to re-create it with a lookup/getattrs pair.

You will always have this problem if your entries_hwmark is lower than the number of dirents in a directory.  You've created a situation where ganesha's cache cannot hold an entire directory, so listing the directory will have to do lookups every time.  We have the odd situation here where lookup/getattrs is much much slower than readdir_p, so the first readdir is faster, but the best solution is to have a large enough cache.

It may be possible to have a "large directory" detection where we dump the dirent cache for very large directories, if the FSAL has a setting that wants it?  I'm not sure if it will help, but it could be tried.

It would be interesting to see performance of readdir for #dirents vs. entries_hwmark.  Is there radio of (dirents) / (entries_hiwmark) that starts getting really bad?  You're example here is (20k) / (5k) = 4.  Obviously 1 is fine.  Is 2 bad?  How is 3?  It'd be nice to have some numbers we can recommend to customers.

Comment 12 Daniel Gryniewicz 2018-06-08 14:01:12 UTC

I don't think this should be a blocker.  If the handle cache is large enough, there's no problem; and if the directory is huge, it will be slow. But it won't be any slower than previous versions, which didn't have readdir_p.

It the medium term, I have a proposal to fix this.  It will complicate the readdir code to the point where I'm not comfortable squeezing it into a release that is about to come out, but it can go into the next one, once properly tested.

Comment 32 errata-xmlrpc 2019-02-04 07:34:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0260

Note You need to log in before you can comment on or make changes to this bug.