1530519 – disperse eager-lock degrades performance for file create workloads

Bug 1530519 - disperse eager-lock degrades performance for file create workloads

Summary: disperse eager-lock degrades performance for file create workloads

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Xavi Hernandez
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:
Depends On:	1502610
Blocks:	1502455 1503137 1512460
TreeView+	depends on / blocked

Reported:	2018-01-03 08:30 UTC by Ashish Pandey
Modified:	2018-09-17 14:07 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-3.12.2-5
Doc Type:	Enhancement
Doc Text:	Previously, the eager locking option was used to provide good performance for file access, however, directory access suffered when eager-lockk was enabled for some use cases. To overcome this problem, A a new option 'other-eager-lock' is introduced. This option keeps eager-locking enabled for regular files but disabled for directory accesses. As a result, Use cases where directories are accessed from multiple clients can benefit from disabling eager-locking for directories without losing performance on file accesses.
Clone Of:	1502610
Environment:
Last Closed:	2018-09-04 06:40:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:42:22 UTC

Comment 14 Nag Pavan Chilakam 2018-03-26 12:37:28 UTC

I found that lookup seems to have regressed due to this fix
when an ls is issued from multiple clients(4 in my case), it's taking anywhere between 4-12 seconds to display the list(about 15 files)
This seems to be due to 'other-eager-lock'

When i disabled "other-eager-lock", then `ls` takes less than a second or two on an avg.


----------default, ie other-eager-lock is enabled---------------
[root@dhcp37-49 ecode]# time ls 
dir1                                      filett.dhcp37-49.lab.eng.blr.redhat.com
file.                                     filett.rhs-client18.lab.eng.blr.redhat.com
file1                                     filett.rhs-client19.lab.eng.blr.redhat.com
file.dhcp37-146.lab.eng.blr.redhat.com    glen
file.dhcp37-49.lab.eng.blr.redhat.com     tunc.dhcp37-146.lab.eng.blr.redhat.com
file.rhs-client18.lab.eng.blr.redhat.com  tunc.dhcp37-49.lab.eng.blr.redhat.com
file.rhs-client19.lab.eng.blr.redhat.com  tunc.rhs-client18.lab.eng.blr.redhat.com
filett.dhcp37-146.lab.eng.blr.redhat.com  tunc.rhs-client19.lab.eng.blr.redhat.com

real    0m7.944s
user    0m0.001s
sys    0m0.006s

----------default, ie other-eager-lock is disabled---------------
[root@dhcp37-49 ecode]# time ls 
dir1                                      filett.dhcp37-49.lab.eng.blr.redhat.com
file.                                     filett.rhs-client18.lab.eng.blr.redhat.com
file1                                     filett.rhs-client19.lab.eng.blr.redhat.com
file.dhcp37-146.lab.eng.blr.redhat.com    glen
file.dhcp37-49.lab.eng.blr.redhat.com     tunc.dhcp37-146.lab.eng.blr.redhat.com
file.rhs-client18.lab.eng.blr.redhat.com  tunc.dhcp37-49.lab.eng.blr.redhat.com
file.rhs-client19.lab.eng.blr.redhat.com  tunc.rhs-client18.lab.eng.blr.redhat.com
filett.dhcp37-146.lab.eng.blr.redhat.com  tunc.rhs-client19.lab.eng.blr.redhat.com

real    0m0.146s
user    0m0.002s
sys    0m0.002s
[root@dhcp37-49 ecode]#

Comment 15 Xavi Hernandez 2018-03-26 12:59:17 UTC

The behavior with other-eager-lock disabled is expected. When it's enabled it shouldn't be any functional difference compared to the same version without the patch.

If I understand correctly, to execute this test you are running an 'ls' in an infinite loop from 4 clients and then checking the time taken by one of them (or another fifth ls executed manually). In this case each 'ls' can block the directory for up to 1 second, causing all other 'ls' to have to wait. Considering this, the time doesn't seem unexpected to me.

Did this test take less time before the patch ?

Comment 16 Ashish Pandey 2018-03-27 10:14:10 UTC

Xavi,

There were only 15 entries in a directory.
He opened 4 console and mounted same volume on 4 different mount point.
Then he executed the "ls" command in that directory at the same time (pressed enter and using broadcast feature of the terminal) from all the clients.

It was not an infinite loop on any client.

I am not sure about the regression and if the previous release without this patch was taking the same time or not.

I think in any case it is too much time to list 15 entries only.
However, disabling other-eager-lock is giving good performance.

Nag,
If you can also provide the number without this patch in previous release then that would be great.

Comment 17 Xavi Hernandez 2018-03-27 10:27:26 UTC

If 'ls' is not executed in a loop, then anything beyond 4/5 seconds seems bad, but it should be the same that it took before the patch, so there shouldn't be any regression.

Maybe self-heal is being triggered for some reason and is competing with regular clients. I'll need to check this.

Comment 18 Ashish Pandey 2018-03-27 10:58:55 UTC


In that case we should execute the same steps with shd disabled.
Profile info and other logs will also be helpful.

Comment 19 Nag Pavan Chilakam 2018-03-27 11:46:48 UTC

(In reply to Ashish Pandey from comment #16)
> Xavi,
> 
> There were only 15 entries in a directory.
> He opened 4 console and mounted same volume on 4 different mount point.
> Then he executed the "ls" command in that directory at the same time
> (pressed enter and using broadcast feature of the terminal) from all the
> clients.
> 
> It was not an infinite loop on any client.
> 
> I am not sure about the regression and if the previous release without this
> patch was taking the same time or not.
> 
> I think in any case it is too much time to list 15 entries only.
> However, disabling other-eager-lock is giving good performance.
> 
> Nag,
> If you can also provide the number without this patch in previous release
> then that would be great.


I checked in 3.3.1-async ie 3.8.4-54-3 and when triggered from 2 clients, simultaneously(didn't have 4 clients), it too less than 1 sec here.

Comment 25 Nag Pavan Chilakam 2018-05-14 14:18:06 UTC

as discussed above and over emails  i have raised a new bug for the parallel lookup perf degradation (1577750 - severe drop in response time of simultaneous lookups with other-eager-lock enabled)
moving this to verified on 3.12.2-9 as the different options are available in the form of other-eager-lock

Comment 29 errata-xmlrpc 2018-09-04 06:40:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.