Bug 1369312

Summary: [RFE] DHT performance improvements for directory operations
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nithya Balachandran <nbalacha>
Component: distributeAssignee: Raghavendra G <rgowdapp>
Status: CLOSED ERRATA QA Contact: Sachin P Mali <smali>
Severity: high Docs Contact:
Priority: urgent    
Version: rhgs-3.1CC: amukherj, jahernan, ksandha, msaini, rcyriac, rgowdapp, rhinduja, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone: ---Keywords: FutureFeature, ZStream
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: rebase
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:29:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1558995    
Bug Blocks: 1118770, 1301474, 1336766, 1345828, 1503132    

Description Nithya Balachandran 2016-08-23 06:15:14 UTC
Description of problem:

The fixes for directory consistency use locks to prevent concurrent ops from tromping on each other. Taking these locks causes the performance to degrade.

This BZ has been opened to track potential improvements to this approach.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Raghavendra G 2017-08-23 10:16:35 UTC
Patch [1] has been merged in upstream v3.12. This patch aims to bring down performance penalty due to locking and also fixes other consistency issues. This fix is not present in rhgs-3.3.0 and can be targeted for rhgs-3.4.0

[1] https://review.gluster.org/15472

Comment 5 Raghavendra G 2017-08-23 11:10:04 UTC
(In reply to Raghavendra G from comment #4)
> Patch [1] has been merged in upstream v3.12. This patch aims to bring down
> performance penalty due to locking and also fixes other consistency issues.
> This fix is not present in rhgs-3.3.0 and can be targeted for rhgs-3.4.0
> 
> [1] https://review.gluster.org/15472

More detailed breakdown of impact of this patch on different fops:

* mkdir
Note that the locking penalty in dht_mkdir codepath was always constant irrespective of scale before [1]. In fact [1] increases the number of serialized locks acquired, by 1 (though penalty is constant and independent of scale). So, not much of improvement or regression can be expected in this codepath. However, if there are parallel access (lookups) to directory, [1] is expected to improve the performance significantly as directory creation phase acquired locks serially on all subvolumes (see directory creation by selfheal during lookup).

* rmdir
Note that performance penalty due to locking increased linearly with scale before [1] as we used to acquire lock on all subvolumes of dht. With [1] the penalty is constant and is independent of scale. So, [1] is expected to improve performance of rmdir significantly, especially when number of subvolumes to dht is relatively large

* renamedir
Note that performance penalty due to locking increased linearly with scale before [1] as we used to acquire lock on all subvolumes of dht. With [1] the penalty is constant and is independent of scale. So, [1] is expected to improve performance of renamedir significantly, especially when number of subvolumes to dht is relatively large

* directory creation by selfheal during lookup
Before [1], directory creation during selfheal would acquire lock on all subvols. Also, this is the same lock acquired by mkdir codepath while setting layout. So, a parallel heal on a directory being created can add this locking latency to mkdir. [1] makes this locking penalty constant irrespective of scale.

* layout healing by selfheal during lookup
No change in locking algorithm is introduced by [1]

More details of the algorithm itself can be found at [2].

[2] https://github.com/gluster/glusterfs/blob/master/doc/developer-guide/dirops-transactions-in-dht.md

Comment 6 Raghavendra G 2017-09-04 06:50:22 UTC
Performance for the following operations should improve when compared with rhgs-3.3.0:
* rmdir
* renamedir
* mkdir, when the directory is accessed in parallel to directory creation.
* directory healing (for the cases where few subvolumes were down during directory creation and the directory is accessed later after those subvolumes are up).

No performance improvement is expected for,
* standalone mkdir (no parallel access during directory creation)

Comment 7 Raghavendra G 2017-09-04 07:08:00 UTC
(In reply to Raghavendra G from comment #6)
> Performance for the following operations should improve when compared with
> rhgs-3.3.0:
> * rmdir

Note that bz 1330235 which is CLOSED WONTFIX will be fixed as part of current bz.

Comment 10 Ambarish 2018-03-26 11:51:07 UTC
Karan has found two massive perf regressions on the latest interim build on mkdirs and rmdirs :

https://bugzilla.redhat.com/show_bug.cgi?id=1558995 - 30% regression on small-file rmdirs from 3.3.1

https://bugzilla.redhat.com/show_bug.cgi?id=1558994 -  47% regression in mkdir from 3.3.1


Note to Self and other QEs :

Verification of this RFE would involve :

A) The above perf regressions to be fixed.
B) Substantial perf improvement from baseline(any RHGS build without these fixes)


Since on glusterfs-3.12.2-5 , I find mkdirs and rmdirs to be VERY slow on the basic use case (Dist Rep + FUSE ) , and this particular RFE tracks perf improvements on directory operations , I cannot move this bug to Verified.

Comment 18 Raghavendra G 2018-07-23 11:14:55 UTC
From,

https://bugzilla.redhat.com/show_bug.cgi?id=1598424#c23

  331 total time      	    | 340 (bmux off) total time
=======================================================
entrylk-total-time 10750    |entrylk-total-time 16746.1
getxattr-total-time 93.4708 |
opendir-total-time 3388.05  |opendir-total-time 4024.19
readdirp-total-time 3841.3  |readdirp-total-time 4394.42
inodelk-total-time 6131.27  |inodelk-total-time 341.158
finodelk-total-time 0.011168|
rmdir-total-time 7849.47    |rmdir-total-time 9430.73
lookup-total-time 35545.3   |lookup-total-time 40633.7

 331 total calls      	    | 340 (bmux off) total calls
=========================================================
entrylk-total-calls 69120612|entrylk-total-calls 70560624
getxattr-total-calls 383820 |
opendir-total-calls 17280936|opendir-total-calls 17280936
readdirp-total-calls 5760317|readdirp-total-calls 5760213
inodelk-total-calls 34560300|inodelk-total-calls 1440162
finodelk-total-calls 48     |
rmdir-total-calls 17280072  |rmdir-total-calls 17280072
lookup-total-calls 18604130 |lookup-total-calls 21468343


 331 total times      	    | 340 (bmux on) total times
========================================================
entrylk-total-time 10750    |entrylk-total-time 10090.8
getxattr-total-time 93.4708 |getxattr-total-time 0.117269
opendir-total-time 3388.05  |opendir-total-time 3252.09
readdirp-total-time 3841.3  |readdirp-total-time 3436.54
inodelk-total-time 6131.27  |inodelk-total-time 202.09
finodelk-total-time 0.011168|finodelk-total-time 0.003563
rmdir-total-time 7849.47    |rmdir-total-time 6923.74
lookup-total-time 35545.3   |lookup-total-time 43867.4

  331 total calls           |  340 (bmux on) total calls
==========================================================
entrylk-total-calls 69120612|entrylk-total-calls 70560624
getxattr-total-calls 383820 |getxattr-total-calls 864
opendir-total-calls 17280936|opendir-total-calls 17281152
readdirp-total-calls 5760317|readdirp-total-calls 5760213
inodelk-total-calls 34560300|inodelk-total-calls 1440162
finodelk-total-calls 48     |finodelk-total-calls 48
rmdir-total-calls 17280072  |rmdir-total-calls 17280072
lookup-total-calls 18604130 |lookup-total-calls 18201803

Observe the number of inodelks
bmux on:

inodelk-total-calls 34560300|inodelk-total-calls 1440162

bmux off
inodelk-total-calls 34560300|inodelk-total-calls 1440162


and total time:
bmux on:

inodelk-total-time 6131.27  |inodelk-total-time 202.09

bmux off:
inodelk-total-time 6131.27  |inodelk-total-time 341.158

So, from perspective of DHT, there is an improvement. Also, its observed that with bmux off there is an improvement in number of rmdirs in 3.4.0 wrt 3.3.1

The gains with this RFE are offset with losses from bmux. As there are already different bugs bz 1598424 and bz 1598056 to track regressions wrt to bmux, I propose to move this bug to ON_QA.

Also, I need data for renamedir as rmdir and renamedir are two operations that are benefited by this improvement as already noted in comment #5. @Karan, can you update the bug with perf numbers for renamedir?

Comment 19 Raghavendra G 2018-07-23 11:17:40 UTC
NOTE: The scope of this RFE is improvements in DHT.

Comment 25 errata-xmlrpc 2018-09-04 06:29:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607