Bug 1530519

Summary: disperse eager-lock degrades performance for file create workloads
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ashish Pandey <aspandey>
Component: disperseAssignee: Xavi Hernandez <jahernan>
Status: CLOSED ERRATA QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: amukherj, aspandey, asriram, bugs, jahernan, mpillai, nchilaka, pkarampu, rhinduja, rhs-bugs, sheggodu, srmukher, storage-qa-internal, ubansal
Target Milestone: ---Keywords: Triaged
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-5 Doc Type: Enhancement
Doc Text:
Previously, the eager locking option was used to provide good performance for file access, however, directory access suffered when eager-lockk was enabled for some use cases. To overcome this problem, A a new option 'other-eager-lock' is introduced. This option keeps eager-locking enabled for regular files but disabled for directory accesses. As a result, Use cases where directories are accessed from multiple clients can benefit from disabling eager-locking for directories without losing performance on file accesses.
Story Points: ---
Clone Of: 1502610 Environment:
Last Closed: 2018-09-04 06:40:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1502610    
Bug Blocks: 1502455, 1503137, 1512460    

Comment 14 Nag Pavan Chilakam 2018-03-26 12:37:28 UTC
I found that lookup seems to have regressed due to this fix
when an ls is issued from multiple clients(4 in my case), it's taking anywhere between 4-12 seconds to display the list(about 15 files)
This seems to be due to 'other-eager-lock'

When i disabled "other-eager-lock", then `ls` takes less than a second or two on an avg.


----------default, ie other-eager-lock is enabled---------------
[root@dhcp37-49 ecode]# time ls 
dir1                                      filett.dhcp37-49.lab.eng.blr.redhat.com
file.                                     filett.rhs-client18.lab.eng.blr.redhat.com
file1                                     filett.rhs-client19.lab.eng.blr.redhat.com
file.dhcp37-146.lab.eng.blr.redhat.com    glen
file.dhcp37-49.lab.eng.blr.redhat.com     tunc.dhcp37-146.lab.eng.blr.redhat.com
file.rhs-client18.lab.eng.blr.redhat.com  tunc.dhcp37-49.lab.eng.blr.redhat.com
file.rhs-client19.lab.eng.blr.redhat.com  tunc.rhs-client18.lab.eng.blr.redhat.com
filett.dhcp37-146.lab.eng.blr.redhat.com  tunc.rhs-client19.lab.eng.blr.redhat.com

real    0m7.944s
user    0m0.001s
sys    0m0.006s

----------default, ie other-eager-lock is disabled---------------
[root@dhcp37-49 ecode]# time ls 
dir1                                      filett.dhcp37-49.lab.eng.blr.redhat.com
file.                                     filett.rhs-client18.lab.eng.blr.redhat.com
file1                                     filett.rhs-client19.lab.eng.blr.redhat.com
file.dhcp37-146.lab.eng.blr.redhat.com    glen
file.dhcp37-49.lab.eng.blr.redhat.com     tunc.dhcp37-146.lab.eng.blr.redhat.com
file.rhs-client18.lab.eng.blr.redhat.com  tunc.dhcp37-49.lab.eng.blr.redhat.com
file.rhs-client19.lab.eng.blr.redhat.com  tunc.rhs-client18.lab.eng.blr.redhat.com
filett.dhcp37-146.lab.eng.blr.redhat.com  tunc.rhs-client19.lab.eng.blr.redhat.com

real    0m0.146s
user    0m0.002s
sys    0m0.002s
[root@dhcp37-49 ecode]#

Comment 15 Xavi Hernandez 2018-03-26 12:59:17 UTC
The behavior with other-eager-lock disabled is expected. When it's enabled it shouldn't be any functional difference compared to the same version without the patch.

If I understand correctly, to execute this test you are running an 'ls' in an infinite loop from 4 clients and then checking the time taken by one of them (or another fifth ls executed manually). In this case each 'ls' can block the directory for up to 1 second, causing all other 'ls' to have to wait. Considering this, the time doesn't seem unexpected to me.

Did this test take less time before the patch ?

Comment 16 Ashish Pandey 2018-03-27 10:14:10 UTC
Xavi,

There were only 15 entries in a directory.
He opened 4 console and mounted same volume on 4 different mount point.
Then he executed the "ls" command in that directory at the same time (pressed enter and using broadcast feature of the terminal) from all the clients.

It was not an infinite loop on any client.

I am not sure about the regression and if the previous release without this patch was taking the same time or not.

I think in any case it is too much time to list 15 entries only.
However, disabling other-eager-lock is giving good performance.

Nag,
If you can also provide the number without this patch in previous release then that would be great.

Comment 17 Xavi Hernandez 2018-03-27 10:27:26 UTC
If 'ls' is not executed in a loop, then anything beyond 4/5 seconds seems bad, but it should be the same that it took before the patch, so there shouldn't be any regression.

Maybe self-heal is being triggered for some reason and is competing with regular clients. I'll need to check this.

Comment 18 Ashish Pandey 2018-03-27 10:58:55 UTC

In that case we should execute the same steps with shd disabled.
Profile info and other logs will also be helpful.

Comment 19 Nag Pavan Chilakam 2018-03-27 11:46:48 UTC
(In reply to Ashish Pandey from comment #16)
> Xavi,
> 
> There were only 15 entries in a directory.
> He opened 4 console and mounted same volume on 4 different mount point.
> Then he executed the "ls" command in that directory at the same time
> (pressed enter and using broadcast feature of the terminal) from all the
> clients.
> 
> It was not an infinite loop on any client.
> 
> I am not sure about the regression and if the previous release without this
> patch was taking the same time or not.
> 
> I think in any case it is too much time to list 15 entries only.
> However, disabling other-eager-lock is giving good performance.
> 
> Nag,
> If you can also provide the number without this patch in previous release
> then that would be great.


I checked in 3.3.1-async ie 3.8.4-54-3 and when triggered from 2 clients, simultaneously(didn't have 4 clients), it too less than 1 sec here.

Comment 25 Nag Pavan Chilakam 2018-05-14 14:18:06 UTC
as discussed above and over emails  i have raised a new bug for the parallel lookup perf degradation (1577750 - severe drop in response time of simultaneous lookups with other-eager-lock enabled)
moving this to verified on 3.12.2-9 as the different options are available in the form of other-eager-lock

Comment 29 errata-xmlrpc 2018-09-04 06:40:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607