Description of problem: ------------------------ Testbed : 12*(4+2),6 servers,6 workload generating clients. Benchmark : 3.1.3 with io-threads enabled. 3.2 testing was done with io-threads enabled and mdcache parameters set. It looks like we have regressed with 3.2 on large file writes/rand writes : ****************** Sequential Writes ****************** 3.1.3 : 2838601.16 kB/sec 3.2 : 2506687.55 kB/sec Regression : ~ 12% ****************** Random Writes ****************** 3.1.3 : 617384.17 kB/sec 3.2 : 480226.17 kB/sec Regression : ~22% Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.8.4-10.el7rhgs.x86_64 How reproducible: ------------------ 100% Actual results: ---------------- Regressions on sequential and random large file writes. Expected results: ----------------- Regression Threshold is within +-10% Additional info: ---------------- > Server Profile will be attached. > Client and Server OS : RHEL 7.3 > *Vol config* : Volume Name: butcher Type: Distributed-Disperse Volume ID: 4a377ad9-0c87-4553-b45f-95ab0590c055 Status: Started Snapshot Count: 0 Number of Bricks: 12 x (4 + 2) = 72 Transport-type: tcp Bricks: Brick1: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick4: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick5: gqas001.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick6: gqas016.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick7: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick8: gqas009.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick9: gqas010.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick10: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick11: gqas001.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick12: gqas016.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick13: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick14: gqas009.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick15: gqas010.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick16: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick17: gqas001.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick18: gqas016.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick19: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick20: gqas009.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick21: gqas010.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick22: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick23: gqas001.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick24: gqas016.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick25: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick26: gqas009.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick27: gqas010.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick28: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick29: gqas001.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick30: gqas016.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick31: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick32: gqas009.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick33: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick34: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick35: gqas001.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick36: gqas016.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick37: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick38: gqas009.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick39: gqas010.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick40: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick41: gqas001.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick42: gqas016.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick43: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick44: gqas009.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick45: gqas010.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick46: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick47: gqas001.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick48: gqas016.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick49: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick50: gqas009.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick51: gqas010.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick52: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick53: gqas001.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick54: gqas016.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick55: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick56: gqas009.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick57: gqas010.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick58: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick59: gqas001.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick60: gqas016.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick61: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick62: gqas009.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick63: gqas010.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick64: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick65: gqas001.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick66: gqas016.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick67: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick68: gqas009.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick69: gqas010.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick70: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick71: gqas001.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick72: gqas016.sbu.lab.eng.bos.redhat.com:/bricks12/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on cluster.lookup-optimize: on performance.client-io-threads: on nfs.disable: off performance.readdir-ahead: on transport.address-family: inet server.event-threads: 4 client.event-threads: 4 features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-samba-metadata: on performance.cache-invalidation: on performance.md-cache-timeout: 600 [root@gqas008 ~]#
After setting io-threads to off,I do not see any improvement(unlike on a *2 deployment). Seq Writes(io-threads off) : 2581707.91 kB/sec Random Writes(io-threads off) : 485993.3 kB/sec
I tried on my laptop and could see that there is a regression of around 22% for seq writes using dd command. ====================================== master - [root@apandey testvol]# time dd if=/dev/urandom of=file count=1024000 bs=1024 conv=fdatasync 1024000+0 records in 1024000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 117.527 s, 8.9 MB/s real 1m57.591s user 0m0.680s sys 0m18.235s ====================================== 3.1.3 - [root@apandey glusterfs]# [root@apandey glusterfs]# cd /mnt/testvol/ [root@apandey testvol]# time dd if=/dev/urandom of=file count=1024000 bs=1024 conv=fdatasync 1024000+0 records in 1024000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 92.1061 s, 11.4 MB/s real 1m32.116s user 0m0.602s sys 0m16.018s ====================================== profile for master - Block Size: 32768b+ 65536b+ No. of Reads: 0 0 No. of Writes: 19179 8 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 3 FORGET 0.00 0.00 us 0.00 us 0.00 us 27 RELEASE 0.00 0.00 us 0.00 us 0.00 us 50 RELEASEDIR 0.00 2.17 us 2.00 us 3.00 us 6 OPENDIR 0.00 31.00 us 31.00 us 31.00 us 1 STATFS 0.00 42.67 us 28.00 us 58.00 us 3 FLUSH 0.00 102.00 us 61.00 us 143.00 us 2 XATTROP 0.00 270.00 us 270.00 us 270.00 us 1 CREATE 0.00 30.08 us 23.00 us 45.00 us 12 FSTAT 0.00 143.46 us 66.00 us 800.00 us 13 OPEN 0.00 94.31 us 18.00 us 163.00 us 26 GETXATTR 0.01 141.17 us 19.00 us 573.00 us 36 LOOKUP 0.01 100.73 us 35.00 us 658.00 us 62 FXATTROP 0.02 811.00 us 508.00 us 2052.00 us 12 READDIR 0.10 58962.00 us 58962.00 us 58962.00 us 1 TRUNCATE 0.63 2355.08 us 20.00 us 30162.00 us 158 INODELK 5.82 51.46 us 31.00 us 5417.00 us 66437 READ 18.37 51.48 us 28.00 us 433140.00 us 209666 WRITE 75.03 22041646.50 us 4844215.00 us 39239078.00 us 2 FSYNC Profile for 3.1.3 - Block Size: 32768b+ No. of Reads: 0 No. of Writes: 81 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 1 RELEASE 0.00 0.00 us 0.00 us 0.00 us 3 RELEASEDIR 0.00 2.00 us 2.00 us 2.00 us 2 OPENDIR 0.00 35.00 us 35.00 us 35.00 us 1 STATFS 0.00 52.00 us 52.00 us 52.00 us 1 XATTROP 0.00 39.00 us 33.00 us 45.00 us 2 FLUSH 0.00 42.75 us 27.00 us 51.00 us 4 INODELK 0.00 88.50 us 40.00 us 137.00 us 2 FXATTROP 0.00 48.25 us 19.00 us 96.00 us 4 READDIR 0.00 349.00 us 349.00 us 349.00 us 1 CREATE 0.00 103.25 us 25.00 us 238.00 us 4 GETXATTR 0.00 102.45 us 9.00 us 291.00 us 11 LOOKUP 5.51 46.41 us 28.00 us 5835.00 us 41210 READ 23.69 8223891.00 us 8223891.00 us 8223891.00 us 1 FSYNC 70.79 196.77 us 25.00 us 19247646.00 us 124899 WRITE Duration: 159 seconds Data Read: 21098496 bytes Data Written: 293858304 bytes Clearly we can see the number of inodelk and fxattrop and other fop has increased considerably.
Possible RCA - After implementing patch http://review.gluster.org/#/c/13733/, before writing on a file we set dirty flag and at the end we remove this flag. This creates an index entry in .glusterfs/indices/xattrop/ . which remains there through out write fop. every 60 seconds shd will come up and scan this entry and starts heal, Heal in turn takes a lot of locks to FIND and heal the file. Which raises the number of inodelk fop count and could be a possible culprit. I disabled the shd and wrote a file - time dd if=/dev/urandom of=a1 count=1024 bs=1M conv=fdatasync Profile shows only 4 calls inodelk. Brick: apandey:/brick/gluster/testvol-6 --------------------------------------- Cumulative Stats: Block Size: 32768b+ 65536b+ No. of Reads: 0 0 No. of Writes: 8188 2 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 1 RELEASE 0.00 47.00 us 47.00 us 47.00 us 1 STATFS 0.00 49.50 us 46.00 us 53.00 us 2 FLUSH 0.00 38.00 us 26.00 us 52.00 us 4 INODELK 0.00 92.50 us 85.00 us 100.00 us 2 XATTROP 0.00 305.00 us 305.00 us 305.00 us 1 CREATE 0.00 138.00 us 32.00 us 395.00 us 4 FXATTROP 0.00 164.14 us 119.00 us 212.00 us 7 LOOKUP 0.92 72.73 us 43.00 us 8431.00 us 8190 WRITE 99.08 64142355.00 us 64142355.00 us 64142355.00 us 1 FSYNC With shd enable it is around 54- Brick: apandey:/brick/gluster/testvol-1 --------------------------------------- Cumulative Stats: Block Size: 32768b+ 65536b+ No. of Reads: 0 0 No. of Writes: 8190 1 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 7 RELEASE 0.00 0.00 us 0.00 us 0.00 us 21 RELEASEDIR 0.00 30.00 us 30.00 us 30.00 us 1 STATFS 0.00 5.76 us 2.00 us 9.00 us 21 OPENDIR 0.00 64.50 us 30.00 us 99.00 us 2 FLUSH 0.00 23.17 us 20.00 us 27.00 us 6 FSTAT 0.00 95.50 us 89.00 us 102.00 us 2 XATTROP 0.00 272.00 us 272.00 us 272.00 us 1 CREATE 0.00 61.67 us 42.00 us 85.00 us 6 OPEN 0.00 98.94 us 31.00 us 428.00 us 16 FXATTROP 0.00 79.92 us 22.00 us 190.00 us 38 LOOKUP 0.12 2379.48 us 1376.00 us 4600.00 us 42 READDIR 0.74 74.70 us 42.00 us 49556.00 us 8191 WRITE 10.29 163490.19 us 19.00 us 1405941.00 us 52 INODELK 19.02 320668.04 us 26.00 us 15705174.00 us 49 GETXATTR 69.83 57700430.00 us 57700430.00 us 57700430.00 us 1 FSY -------
http://review.gluster.org/16377
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/95870/
Sequential Write on 3.8.4-13 - 2709973.61 kB/sec,which is well within the Perf Regression threshold,Verified. Raised a separate BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1419549) for Rand Write regression post discussion with Atin,Ashish and Rahul.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html