Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1530519 - disperse eager-lock degrades performance for file create workloads
disperse eager-lock degrades performance for file create workloads
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: disperse (Show other bugs)
3.3
x86_64 Linux
unspecified Severity high
: ---
: RHGS 3.4.0
Assigned To: Xavi Hernandez
nchilaka
: Triaged
Depends On: 1502610
Blocks: 1502455 1503137 1512460
  Show dependency treegraph
 
Reported: 2018-01-03 03:30 EST by Ashish Pandey
Modified: 2018-09-17 10:07 EDT (History)
14 users (show)

See Also:
Fixed In Version: glusterfs-3.12.2-5
Doc Type: Enhancement
Doc Text:
Previously, the eager locking option was used to provide good performance for file access, however, directory access suffered when eager-lockk was enabled for some use cases. To overcome this problem, A a new option 'other-eager-lock' is introduced. This option keeps eager-locking enabled for regular files but disabled for directory accesses. As a result, Use cases where directories are accessed from multiple clients can benefit from disabling eager-locking for directories without losing performance on file accesses.
Story Points: ---
Clone Of: 1502610
Environment:
Last Closed: 2018-09-04 02:40:51 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 02:42 EDT

  None (edit)
Comment 14 nchilaka 2018-03-26 08:37:28 EDT
I found that lookup seems to have regressed due to this fix
when an ls is issued from multiple clients(4 in my case), it's taking anywhere between 4-12 seconds to display the list(about 15 files)
This seems to be due to 'other-eager-lock'

When i disabled "other-eager-lock", then `ls` takes less than a second or two on an avg.


----------default, ie other-eager-lock is enabled---------------
[root@dhcp37-49 ecode]# time ls 
dir1                                      filett.dhcp37-49.lab.eng.blr.redhat.com
file.                                     filett.rhs-client18.lab.eng.blr.redhat.com
file1                                     filett.rhs-client19.lab.eng.blr.redhat.com
file.dhcp37-146.lab.eng.blr.redhat.com    glen
file.dhcp37-49.lab.eng.blr.redhat.com     tunc.dhcp37-146.lab.eng.blr.redhat.com
file.rhs-client18.lab.eng.blr.redhat.com  tunc.dhcp37-49.lab.eng.blr.redhat.com
file.rhs-client19.lab.eng.blr.redhat.com  tunc.rhs-client18.lab.eng.blr.redhat.com
filett.dhcp37-146.lab.eng.blr.redhat.com  tunc.rhs-client19.lab.eng.blr.redhat.com

real    0m7.944s
user    0m0.001s
sys    0m0.006s

----------default, ie other-eager-lock is disabled---------------
[root@dhcp37-49 ecode]# time ls 
dir1                                      filett.dhcp37-49.lab.eng.blr.redhat.com
file.                                     filett.rhs-client18.lab.eng.blr.redhat.com
file1                                     filett.rhs-client19.lab.eng.blr.redhat.com
file.dhcp37-146.lab.eng.blr.redhat.com    glen
file.dhcp37-49.lab.eng.blr.redhat.com     tunc.dhcp37-146.lab.eng.blr.redhat.com
file.rhs-client18.lab.eng.blr.redhat.com  tunc.dhcp37-49.lab.eng.blr.redhat.com
file.rhs-client19.lab.eng.blr.redhat.com  tunc.rhs-client18.lab.eng.blr.redhat.com
filett.dhcp37-146.lab.eng.blr.redhat.com  tunc.rhs-client19.lab.eng.blr.redhat.com

real    0m0.146s
user    0m0.002s
sys    0m0.002s
[root@dhcp37-49 ecode]#
Comment 15 Xavi Hernandez 2018-03-26 08:59:17 EDT
The behavior with other-eager-lock disabled is expected. When it's enabled it shouldn't be any functional difference compared to the same version without the patch.

If I understand correctly, to execute this test you are running an 'ls' in an infinite loop from 4 clients and then checking the time taken by one of them (or another fifth ls executed manually). In this case each 'ls' can block the directory for up to 1 second, causing all other 'ls' to have to wait. Considering this, the time doesn't seem unexpected to me.

Did this test take less time before the patch ?
Comment 16 Ashish Pandey 2018-03-27 06:14:10 EDT
Xavi,

There were only 15 entries in a directory.
He opened 4 console and mounted same volume on 4 different mount point.
Then he executed the "ls" command in that directory at the same time (pressed enter and using broadcast feature of the terminal) from all the clients.

It was not an infinite loop on any client.

I am not sure about the regression and if the previous release without this patch was taking the same time or not.

I think in any case it is too much time to list 15 entries only.
However, disabling other-eager-lock is giving good performance.

Nag,
If you can also provide the number without this patch in previous release then that would be great.
Comment 17 Xavi Hernandez 2018-03-27 06:27:26 EDT
If 'ls' is not executed in a loop, then anything beyond 4/5 seconds seems bad, but it should be the same that it took before the patch, so there shouldn't be any regression.

Maybe self-heal is being triggered for some reason and is competing with regular clients. I'll need to check this.
Comment 18 Ashish Pandey 2018-03-27 06:58:55 EDT

In that case we should execute the same steps with shd disabled.
Profile info and other logs will also be helpful.
Comment 19 nchilaka 2018-03-27 07:46:48 EDT
(In reply to Ashish Pandey from comment #16)
> Xavi,
> 
> There were only 15 entries in a directory.
> He opened 4 console and mounted same volume on 4 different mount point.
> Then he executed the "ls" command in that directory at the same time
> (pressed enter and using broadcast feature of the terminal) from all the
> clients.
> 
> It was not an infinite loop on any client.
> 
> I am not sure about the regression and if the previous release without this
> patch was taking the same time or not.
> 
> I think in any case it is too much time to list 15 entries only.
> However, disabling other-eager-lock is giving good performance.
> 
> Nag,
> If you can also provide the number without this patch in previous release
> then that would be great.


I checked in 3.3.1-async ie 3.8.4-54-3 and when triggered from 2 clients, simultaneously(didn't have 4 clients), it too less than 1 sec here.
Comment 25 nchilaka 2018-05-14 10:18:06 EDT
as discussed above and over emails  i have raised a new bug for the parallel lookup perf degradation (1577750 - severe drop in response time of simultaneous lookups with other-eager-lock enabled)
moving this to verified on 3.12.2-9 as the different options are available in the form of other-eager-lock
Comment 29 errata-xmlrpc 2018-09-04 02:40:51 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.