Description of problem: On my RHHI setup, by default, the cluster.eager-lock was enabled on all the volumes. When I disabled it, and recorded the numbers, I observed a rise in numbers approx. 20-25% in range. Here is the data: Database for Bench-marking used: MySQL Database Size: 100GB 16 vcpu / 32G memory in the Database VM 5 minute run - 8G Bufferpool |with cluster.eager-lock off on allvolumes No of clients n_overall opm | n_overall opm 10 1640 161 | 2501 221 20 1828 180 | 2957 291 40 1831 112 | 2926 288 80 1991 196 | 2833 279 99 2763 272 | 2714 267 Link to the Gluster profiling Data: http://perf1.perf.lab.eng.bos.redhat.com/Nikhil/gluster_profile/EagerLockOFF/ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
(In reply to Nikhil Chawla from comment #0) > 16 vcpu / 32G memory in the Database VM > 5 minute run - 8G Bufferpool |with cluster.eager-lock off on > allvolumes > No of clients n_overall opm | n_overall opm > 10 1640 161 | 2501 221 > 20 1828 180 | 2957 291 > 40 1831 112 | 2926 288 > 80 1991 196 | 2833 279 > 99 2763 272 | 2714 267 > > Sorry, I don't understand this metric. What is n_overall and opm? Just to be clear, {1640, 161} is with eager-lock on and {2501, 221} is with eager-lock off? > Link to the Gluster profiling Data: > > http://perf1.perf.lab.eng.bos.redhat.com/Nikhil/gluster_profile/EagerLockOFF/ > I see profile info for different no. of clients, but is it for eager lock off or on? We would need both to compare and contrast.
(In reply to Ravishankar N from comment #5) > (In reply to Nikhil Chawla from comment #0) > > > 16 vcpu / 32G memory in the Database VM > > 5 minute run - 8G Bufferpool |with cluster.eager-lock off on > > allvolumes > > No of clients n_overall opm | n_overall opm > > 10 1640 161 | 2501 221 > > 20 1828 180 | 2957 291 > > 40 1831 112 | 2926 288 > > 80 1991 196 | 2833 279 > > 99 2763 272 | 2714 267 > > > > > > Sorry, I don't understand this metric. What is n_overall and opm? > Just to be clear, {1640, 161} is with eager-lock on and {2501, 221} is with > eager-lock off? > Since I am using a MySQL simulator which MySQL queries in the DB. n_overall is the total number of transactions done in 5 minutes of Run-time and "opm" is the average transactions per minute. Yes! The higher number obtained after disabling the eager lock. > > Link to the Gluster profiling Data: > > > > http://perf1.perf.lab.eng.bos.redhat.com/Nikhil/gluster_profile/EagerLockOFF/ > > > > I see profile info for different no. of clients, but is it for eager lock > off or on? We would need both to compare and contrast. > This is for Eager Lock OFF. I don't have the profiling data with eager lock ON, and I can't generate it, because I've to regenerate the database for this, which will take 3-4 days. Although I am trying my best to remove the MySQL caches from all the sources, and regenerate the profiling data with Eager Lock ON.
(In reply to Nikhil Chawla from comment #6) > Since I am using a MySQL simulator which MySQL queries in the DB. n_overall is > the total number of transactions done in 5 minutes of Run-time and "opm" is the > average transactions per minute. Yes! The higher number obtained after >disabling the eager lock. Thanks Nikhil. It is strange though, that (n_overall divided by 5) does not seem to be near the opm values. > This is for Eager Lock OFF. I don't have the profiling data with eager lock >> > ON, and I can't generate it, because I've to regenerate the database for this, > which will take 3-4 days. Cool, I will wait for the profile info from both runs. You could just run the test for '10 clients' with both eager lock on and off. Maybe something like: 1. gluster volume profile volname eager-lock off 2. gluster volume profile volname info clear 3. Run the test for 10 clients, let it complete. 4 capture `gluster volume profile volname info` 5.gluster volume profile volname eager-lock on 6. Repeat steps 2 to 4. >Although I am trying my best to remove the MySQL caches from all the sources, >and regenerate the profiling data with Eager Lock ON. Maybe restarting the VM and the gluster volumes on each run might help to remove caching.
(In reply to Ravishankar N from comment #7) > (In reply to Nikhil Chawla from comment #6) > > Since I am using a MySQL simulator which MySQL queries in the DB. n_overall is > the total number of transactions done in 5 minutes of Run-time and "opm" is the > average transactions per minute. Yes! The higher number obtained after >disabling the eager lock. > > Thanks Nikhil. It is strange though, that (n_overall divided by 5) does not > seem to be near the opm values. > Yupp!, that's strange because I am also observing it for the first time. > > This is for Eager Lock OFF. I don't have the profiling data with eager lock >> > ON, and I can't generate it, because I've to regenerate the database for this, > which will take 3-4 days. > > Cool, I will wait for the profile info from both runs. You could just run > the test for '10 clients' with both eager lock on and off. Maybe something > like: > > 1. gluster volume profile volname eager-lock off > > 2. gluster volume profile volname info clear > 3. Run the test for 10 clients, let it complete. > 4 capture `gluster volume profile volname info` > > 5.gluster volume profile volname eager-lock on > 6. Repeat steps 2 to 4. > > >Although I am trying my best to remove the MySQL caches from all the sources, >and regenerate the profiling data with Eager Lock ON. > > Maybe restarting the VM and the gluster volumes on each run might help to > remove caching. Restarting the VM didn't help. So, I created a new DB with 10 GB data and here are the performance numbers: eager lock ON: n_overall : 18570 opm : 1661 eager lock OFF: n_overall : 23516 opm : 4634 And the Gluster Profile Info for both the cases is available here: http://perf1.perf.lab.eng.bos.redhat.com/Nikhil/gluster_profile/
(In reply to Nikhil Chawla from comment #8) > And the Gluster Profile Info for both the cases is available here: > > http://perf1.perf.lab.eng.bos.redhat.com/Nikhil/gluster_profile/ Sorry for the delay Nikhil. Compared the profile infos of eager lock ON vs OFF, there is not much significant difference in the *percentage* change in numbers for a given FOP between the two cases. However in both cases, FINODELK seems to be taking maximum % latency (around 77%) which should not be the case for a write intensive workload (mysql). Was the VM disk on which the test was run pre-allocated or thinly provisioned? If thinly provisioned, can you re-run the test with a pre-allocated disk and see if that makes a difference in the performance? I think shard xlator sends fxattrops (involving FINODELKS) to update size and block count when creating the individual shards, so preallocating the image before hand could mitigate that behaviour.
http://perf1.perf.lab.eng.bos.redhat.com/Nikhil/gluster_profile/LATEST/
https://code.engineering.redhat.com/gerrit/133659 storage/posix: Add active-fd-count option in gluster https://code.engineering.redhat.com/gerrit/133660 cluster/afr: Switch to active-fd-count for open-fd checks https://code.engineering.redhat.com/gerrit/131944 cluster/afr: Remove unused code paths https://code.engineering.redhat.com/gerrit/131945 cluster/afr: Make AFR eager-locking similar to EC
fio sequential write results: from VM with eager-lock on: write throughput: 2MB/s from VM with eager-lock off: write throughput: 351 MB/s
Latest Performance numbers with RHHI 1.1 and RHHI 2.0(with Eager-lock fixes) --------------- RHHI 1.1 Eager Lock ON : 2.037MB/s Eager Lock OFF : 107MB/s --------------- RHHI 2.0 Eager Lock ON : 291MB/s Eager Lock OFF : 34.0MB/s --------------- Iostat logs and FIO results are available here : http://file.blr.redhat.com/nichawla/work/RHHI_Seq_RUNS/
(In reply to Nikhil Chawla from comment #21) > Latest Performance numbers with RHHI 1.1 and RHHI 2.0(with Eager-lock fixes) > > --------------- > RHHI 1.1 > > Eager Lock ON : 2.037MB/s > Eager Lock OFF : 107MB/s > > --------------- > > RHHI 2.0 > > Eager Lock ON : 291MB/s > Eager Lock OFF : 34.0MB/s > --------------- Looks like there is a regression in eager-lock off? I am suspecting this because of cluster.full-lock option that we introduced to prevent split-brains. Could you do a test with RHHI-2.0 with eager-lock off and also cluster.full-lock off and post the result? > > Iostat logs and FIO results are available here : > > http://file.blr.redhat.com/nichawla/work/RHHI_Seq_RUNS/
tested this bz on 3.12.2.16 , running sequential writes over 3 way replicated volume. gluster v set testvol cluster.eager-lock on iteration 1:- 1703305.84 KB/sec iteration 2:- 1693176.39 KB/sec gluster v set testvol cluster.eager-lock off iteration 1: 1715066.19 KB/sec iteration 2: 1703044.18 KB/sec Hence marking it verified.
(In reply to Karan Sandha from comment #28) > tested this bz on 3.12.2.16 , running sequential writes over 3 way > replicated volume. > > gluster v set testvol cluster.eager-lock on > > iteration 1:- 1703305.84 KB/sec > iteration 2:- 1693176.39 KB/sec > > gluster v set testvol cluster.eager-lock off > > iteration 1: 1715066.19 KB/sec > iteration 2: 1703044.18 KB/sec > > Hence marking it verified. Is the test done in a VM on a RHHI setup?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607