I found that lookup seems to have regressed due to this fix when an ls is issued from multiple clients(4 in my case), it's taking anywhere between 4-12 seconds to display the list(about 15 files) This seems to be due to 'other-eager-lock' When i disabled "other-eager-lock", then `ls` takes less than a second or two on an avg. ----------default, ie other-eager-lock is enabled--------------- [root@dhcp37-49 ecode]# time ls dir1 filett.dhcp37-49.lab.eng.blr.redhat.com file. filett.rhs-client18.lab.eng.blr.redhat.com file1 filett.rhs-client19.lab.eng.blr.redhat.com file.dhcp37-146.lab.eng.blr.redhat.com glen file.dhcp37-49.lab.eng.blr.redhat.com tunc.dhcp37-146.lab.eng.blr.redhat.com file.rhs-client18.lab.eng.blr.redhat.com tunc.dhcp37-49.lab.eng.blr.redhat.com file.rhs-client19.lab.eng.blr.redhat.com tunc.rhs-client18.lab.eng.blr.redhat.com filett.dhcp37-146.lab.eng.blr.redhat.com tunc.rhs-client19.lab.eng.blr.redhat.com real 0m7.944s user 0m0.001s sys 0m0.006s ----------default, ie other-eager-lock is disabled--------------- [root@dhcp37-49 ecode]# time ls dir1 filett.dhcp37-49.lab.eng.blr.redhat.com file. filett.rhs-client18.lab.eng.blr.redhat.com file1 filett.rhs-client19.lab.eng.blr.redhat.com file.dhcp37-146.lab.eng.blr.redhat.com glen file.dhcp37-49.lab.eng.blr.redhat.com tunc.dhcp37-146.lab.eng.blr.redhat.com file.rhs-client18.lab.eng.blr.redhat.com tunc.dhcp37-49.lab.eng.blr.redhat.com file.rhs-client19.lab.eng.blr.redhat.com tunc.rhs-client18.lab.eng.blr.redhat.com filett.dhcp37-146.lab.eng.blr.redhat.com tunc.rhs-client19.lab.eng.blr.redhat.com real 0m0.146s user 0m0.002s sys 0m0.002s [root@dhcp37-49 ecode]#
The behavior with other-eager-lock disabled is expected. When it's enabled it shouldn't be any functional difference compared to the same version without the patch. If I understand correctly, to execute this test you are running an 'ls' in an infinite loop from 4 clients and then checking the time taken by one of them (or another fifth ls executed manually). In this case each 'ls' can block the directory for up to 1 second, causing all other 'ls' to have to wait. Considering this, the time doesn't seem unexpected to me. Did this test take less time before the patch ?
Xavi, There were only 15 entries in a directory. He opened 4 console and mounted same volume on 4 different mount point. Then he executed the "ls" command in that directory at the same time (pressed enter and using broadcast feature of the terminal) from all the clients. It was not an infinite loop on any client. I am not sure about the regression and if the previous release without this patch was taking the same time or not. I think in any case it is too much time to list 15 entries only. However, disabling other-eager-lock is giving good performance. Nag, If you can also provide the number without this patch in previous release then that would be great.
If 'ls' is not executed in a loop, then anything beyond 4/5 seconds seems bad, but it should be the same that it took before the patch, so there shouldn't be any regression. Maybe self-heal is being triggered for some reason and is competing with regular clients. I'll need to check this.
In that case we should execute the same steps with shd disabled. Profile info and other logs will also be helpful.
(In reply to Ashish Pandey from comment #16) > Xavi, > > There were only 15 entries in a directory. > He opened 4 console and mounted same volume on 4 different mount point. > Then he executed the "ls" command in that directory at the same time > (pressed enter and using broadcast feature of the terminal) from all the > clients. > > It was not an infinite loop on any client. > > I am not sure about the regression and if the previous release without this > patch was taking the same time or not. > > I think in any case it is too much time to list 15 entries only. > However, disabling other-eager-lock is giving good performance. > > Nag, > If you can also provide the number without this patch in previous release > then that would be great. I checked in 3.3.1-async ie 3.8.4-54-3 and when triggered from 2 clients, simultaneously(didn't have 4 clients), it too less than 1 sec here.
as discussed above and over emails i have raised a new bug for the parallel lookup perf degradation (1577750 - severe drop in response time of simultaneous lookups with other-eager-lock enabled) moving this to verified on 3.12.2-9 as the different options are available in the form of other-eager-lock
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607