Bug 1598056

Summary: [Perf] mkdirs are regressing 54% on 3 way replicated volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Karan Sandha <ksandha>
Component: coreAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED DEFERRED QA Contact: Prasanth <pprakash>
Severity: medium Docs Contact:
Priority: medium    
Version: rhgs-3.4CC: bkunal, guillaume.pavese, jahernan, ksubrahm, moagrawa, mpillai, nchilaka, pprakash, rhinduja, rhs-bugs, sanandpa, sheggodu, storage-qa-internal, vdas
Target Milestone: ---Keywords: Performance, Regression, Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: brick-multiplexing
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-25 15:21:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1577750, 1610751, 1651050, 1651508    
Attachments:
Description Flags
mkdir_331
none
mkdir_340
none
bmux-off-mkdir total time
none
bmux-on-mkdir-total-time
none
sorted mkdir latencies on brick stack with bmux on
none
sorted mkdir latencies on brick stack with bmux off none

Comment 12 Karan Sandha 2018-07-09 08:46:52 UTC
Created attachment 1457407 [details]
mkdir_331

Comment 13 Karan Sandha 2018-07-09 08:47:22 UTC
Created attachment 1457408 [details]
mkdir_340

Comment 15 Raghavendra G 2018-07-11 06:31:48 UTC
Following is the condensed information from comment #12 and comment #13. Time is in seconds.

 total-times<331>     	      	      |	total-times<340>
=========================================================================
entrylk-total-time 5300.63            |entrylk-total-time 6675.02
lookup-total-time 5539.24             |lookup-total-time 6547.7
inodelk-total-time 5646.45            |inodelk-total-time 6791.9
finodelk-total-time 0.003457          |finodelk-total-time 0.003292
mkdir-total-time 23383                |mkdir-total-time 23599
setxattr-total-time 4341.38           |setxattr-total-time 5710.13
stat-total-time                       |stat-total-time 19.09
statfs-total-time 18.845              |statfs-total-time 42.8818

 total-calls<331>                     | total-calls<340>
 ========================================================================
entrylk-total-calls 34560612          |entrylk-total-calls 36000624
lookup-total-calls 18833403           |lookup-total-calls 18583905
inodelk-total-calls 37440300          |inodelk-total-calls 37440162
finodelk-total-calls 48               |finodelk-total-calls 48
mkdir-total-calls 17280072            |mkdir-total-calls 17280072
setxattr-total-calls 17280072         |setxattr-total-calls 17280072
stat-total-calls                      |stat-total-calls 122498
statfs-total-calls 145440             |statfs-total-calls 287064

Note the increase in total times of entrylk, lookup, inodelk, mkdir (slightly around 220s), stat and statfs.

However,
* number of entrylk calls is increased on brick (as expected as DHT is adding two entrylks for mkdir)
* number of lookup calls is actually _REDUCED_ (but lookup latency has increased on brick for 340 by 1008 seconds)
* number of inodelk calls has almost remained constant (but inodelk latency is increased in 340 by 1146 seconds)
* mkdir total calls are exactly same, but there is a slight increase in latency (of around 220s)
* number of setxattr calls has remained exactly same, but there is a latency increase by 1369 seconds
* there were no stats in 331, but there are few in 340, but their latency is negligible
* number of statfs calls have increased by 141624 in 340, but the increase in latency is by a negligible 20s

Note that I've added up times without considering replica count. But that should give a rough overview. I'll think more about this data before coming to conclusions.

Comment 18 Raghavendra G 2018-07-12 07:54:30 UTC
Due to patch,
https://review.gluster.org/#/c/19732

I would expect more lookups as mkdir would initiate server-resolve with RESOLVE_NOT, but I am seeing less number of lookups. This is puzzling.

Comment 22 Raghavendra G 2018-07-18 06:43:47 UTC
 331 total time                       |  340 total time (bmux off)
=============================================================
entrylk-total-time 5300.63            |entrylk-total-time 7294
lookup-total-time 5539.24             |lookup-total-time 6613.1
inodelk-total-time 5646.45            |inodelk-total-time 7007.72
finodelk-total-time 0.003457          |
mkdir-total-time 23383                |mkdir-total-time 24624.4
setxattr-total-time 4341.38           |setxattr-total-time 4333.94
stat-total-time                       |
statfs-total-time 18.845              |

  331 total calls                     | 340 total calls (bmux off)
============================================================
entrylk-total-calls 34560612          |entrylk-total-calls 36000624
lookup-total-calls 18833403           |lookup-total-calls 20836566
inodelk-total-calls 37440300          |inodelk-total-calls 37470204
finodelk-total-calls 48               |
mkdir-total-calls 17280072            |mkdir-total-calls 17280072
setxattr-total-calls 17280072         |setxattr-total-calls 17295093
stat-total-calls                      |


 total-times<331>     	      	      |	total-times<340> (bmux on)
=========================================================================
entrylk-total-time 5300.63            |entrylk-total-time 6675.02
lookup-total-time 5539.24             |lookup-total-time 6547.7
inodelk-total-time 5646.45            |inodelk-total-time 6791.9
finodelk-total-time 0.003457          |finodelk-total-time 0.003292
mkdir-total-time 23383                |mkdir-total-time 23599
setxattr-total-time 4341.38           |setxattr-total-time 5710.13
stat-total-time                       |stat-total-time 19.09
statfs-total-time 18.845              |statfs-total-time 42.8818

 total-calls<331>                     | total-calls<340> (bmux on)
 ========================================================================
entrylk-total-calls 34560612          |entrylk-total-calls 36000624
lookup-total-calls 18833403           |lookup-total-calls 18583905
inodelk-total-calls 37440300          |inodelk-total-calls 37440162
finodelk-total-calls 48               |finodelk-total-calls 48
mkdir-total-calls 17280072            |mkdir-total-calls 17280072
setxattr-total-calls 17280072         |setxattr-total-calls 17280072
stat-total-calls                      |stat-total-calls 122498
statfs-total-calls 145440             |statfs-total-calls 287064

Comment 26 Raghavendra G 2018-08-01 04:05:32 UTC
Created attachment 1471969 [details]
bmux-off-mkdir total time

Time is in seconds. Each line represents total time latency of mkdir as seen on protocol/client in a setup with bmux off

Comment 27 Raghavendra G 2018-08-01 04:06:51 UTC
Created attachment 1471970 [details]
bmux-on-mkdir-total-time

Time is in seconds. Each line represents total time latency of mkdir as seen on protocol/client in a setup with bmux on

Comment 31 Raghavendra G 2018-08-02 09:41:21 UTC
Created attachment 1472629 [details]
sorted mkdir latencies on brick stack with bmux on

Comment 32 Raghavendra G 2018-08-02 09:42:12 UTC
Created attachment 1472630 [details]
sorted mkdir latencies on brick stack with bmux off

Comment 56 Mohit Agrawal 2018-09-19 08:14:32 UTC
Hi,

I have tried to find out why we have a performance difference for mkdir brick_mux vs non_brick_mux ??
 
I have found the main root cause of this issue is lock contention at io-thread.

Here I have setup  distribute-replicate environment (12x3 nodes)
 
 In case of non-brick mux scenario glusterd spawn 12 brick processes and every process spawn 4 pooler 
 thread to poll the request from the socket and every brick process can scale 16 io-thread at peak rate.

 In case of brick_mux scenario, a glusterd spawn a single brick process to handle all bricks.In this case by default, brick_process spawn 1 poller thread per brick and it also scales 16 io-thread per brick to handle
fop request. In case of 12 bricks (16x12 = 192 io-thread) can be handled at the single brick process at high peak rate.


The main difference between these (brick mux vs non brick_mux)

 1) pooler thread (Non brick_mux 4 pooler thread per brick, brick_mux 1 pooler thread per brick)
 2) io-thread  (Non brick_mux 16 io-thread per process  brick_mux 192 io-thread for a single process at peak) 


Below is the current io-thread design

1) We have 4 type of queues based on priority and single mutex to enqueue/dequeue and signaling for
   all queues

I have changes to like below

1) Introduce separate lock for every queue
2) Use atomic ops to know if a queue has any data.
3) Use the separate lock to for signaling purpose


I have executed mkdir test case as QE was running and I have found performance was improved with below configuration

1) 4 io-thread
2) set 37 server-event threads

Here I have used 4 io-thread because we have 4 queues so that every queue has a similar priority
to handle the fop while multiple types of fop are requested.(In this test case most of the fops are belongs to the same queue)

37 server-event threads is required 37 + 1 default pooler thread per brick(brick_mux) it will become 48 pooler thread similar to non_brick mux scenario. For now server.event-thread the option does not
work perfectly for brick_mux so I have configured this number. (we need to correct it)

I think to test this patch I need to generate parallel workload for fop of multiple queues otherwise, it is difficult to measure the improvement.

In case of mkdir/rmdir with the smallfile tool after use above configuration, i have not found major regression in a current standard build.

Below are the numbers i got in my testing

with io-thread granular patch (with brick_mux)
Brick_mux
1) 4-io-thread
mkdir (109 files /sec)
rmdir (108 files /sec)

2) 4 io-thread with 15 server thread
mkdir 118 files /sec
rmdir 114 files /sec

3) 4 io-thread with 37 server-thread
mkdir 119 files /sec
rmdir 125 files /sec
     


Baseline (on current build 3.4((without brick_mux) 
mkdir (122 files /sec) 
rmdir (136 files /sec))

Current available build (4 io-thread , 37 server-thread brick_mux)
mkdir 118 files /sec
rmdir 117 files /sc

Before sharing any patch with QE I have requested to share the perf number with above option on current build.

Regards,
Mohit Agrawal

Comment 57 Nithya Balachandran 2018-10-08 07:35:57 UTC
As this regression is pronounced with brick mux enabled, I am moving it out of dht.

Comment 64 Sachin P Mali 2018-12-18 11:47:14 UTC
I am not aware about brickmux enable by default in BU2. About testing ,there are some fixes going on related to brickmux in BU2 because of that we have initiated performance run with brickmux enable.

Comment 65 Amar Tumballi 2019-04-24 17:30:40 UTC
Sachin, I believe with 3.5.0 release, this is now not observed. Would like someone to revisit this bug, and close it if its not the case with 3.5.0 builds.

Comment 66 Sweta Anandpara 2019-04-25 10:50:44 UTC
Moving the needinfo. This will be revisited in the coming build(s) of RHGS 3.5.0. Will keep this bug updated with the progress.

Comment 89 Red Hat Bugzilla 2023-09-14 04:30:56 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days