Description of problem: ------------------------ Smallfile Appends are off target by 25% on the latest 3.2 bits : *3.1.3* : 1708 files/sec *3.2/3.8.4-13* : 1265 files/sec Regression : ~-25% Version-Release number of selected component (if applicable): ------------------------------------------------------------ glusterfs-server-3.8.4-13.el7rhgs.x86_64 How reproducible: ---------------- Every which way I try. Actual results: --------------- smallfile appends are 25% slower. Expected results: ----------------- Regression Threshold : 25% Additional info: ---------------- Volume Name: rep3 Type: Distributed-Replicate Volume ID: 390622f1-9ff1-4d04-b342-5492254f25fe Status: Started Snapshot Count: 0 Number of Bricks: 12 x 3 = 36 Transport-type: tcp Bricks: Brick1: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick2: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick3: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick4: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick5: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick6: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick7: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick8: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick9: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick10: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick13: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick14: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick15: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick16: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick17: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick18: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick19: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick20: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick21: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick22: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick25: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick26: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick27: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick28: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick29: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick30: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick31: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick32: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick33: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick34: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick35: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick36: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on nfs.disable: off performance.readdir-ahead: on transport.address-family: inet features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-samba-metadata: on performance.cache-invalidation: on performance.md-cache-timeout: 600 [root@gqas005 ~]#
Created attachment 1245018 [details] 3.2_ServerProfile
*EXACT WORKLOAD* : *Drop kernel caches* python /small-files/smallfile/smallfile_cli.py --operation append --threads 8 --file-size 64 --files 10000 --top <mount point> --host-set <comma separated host list>' ' ','`" Run creates first to create the data set though : python /small-files/smallfile/smallfile_cli.py --operation create --threads 8 --file-size 64 --files 10000 --top <mount point> --host-set <comma separated host list>' ' ','`"
Switching io-threads to off made appends slightly worse - 1192 files/sec [I thought I already updated this is description for some reason].
I know this is not what was asked ,but I could reproduce this on another setup as well..And I see high latencies on LOOKUP on this too.. I'll fetch the straces as soon as I can.
I see an increase in the number of GETXATTR calls from 3.1.3 and 3.2 - from 6/8 to almost 600,but not sure if it is big enough to cause a perf regression.
Ambarish, I analysed the strace files you gave. Based on my analysis the number of syscalls is low in 3.2.0 but the overall time it takes to execute the syscalls is more in 3.2.0 compared to 3.1.3. And further the time it takes to do lgetxattr of trusted.gfid in 3.2.0 is much more than 3.1.3. Here is the comparison: On 3.1.3: trusted.gfid 62.3767 in seconds 208291 number of calls On 3.2.0: trusted.gfid 136.881 in seconds 195520 number of calls So for some reason even when number of calls is less in 3.2.0 time it takes to getxattr is more with same RHEL version. We should probably involve Brian as well? Brian, What should be debugged further to find RC as to why there is more delay? Could you advise?
(In reply to Pranith Kumar K from comment #14) ... > > Brian, > What should be debugged further to find RC as to why there is more > delay? Could you advise? If the same rhel version is involved, I assume the tests include the exact same volume configuration as well..? E.g., there are no default volumes that are being added or removed between the two tests? That aside, there's probably a few approaches you could use to narrow this down. One is to look at ftrace or perf data. I'm not sure how much that would help if the difference is in userspace, however. I suppose you could pull translators out of the the newer configuration (or actually both) one at a time and try and compare to isolate where the regression lives. Finally, if you can't get anywhere with that I suspect a git bisect will point out where the regression is introduced given you can consistently reproduce.
I ran a scaled down test (--files =1000 instead of 10000) on fedora VMs on my laptop and was able to see a slightly better performance for rhgs 3.2 when compared to rhgs 3.1.3. Volume=2x3 dist-rep volume using 3 fedora 22 VMs. (with default volume options) Client= fuse mount on a fedora 24 VM. Test ran: 1.Create the files: #python smallfile_cli.py --operation create --threads 8 --file-size 64 --files 1000 --top /mnt/fuse_mount 2. umount, drop caches and remount. 3. Append to the files: # python smallfile_cli.py --operation append --threads 8 --file-size 64 --files 1000 --top /mnt/fuse_mnt 3.1.3 Results from 2 test runs for append: ------------------------------------------ 100.00% of requested files processed, minimum is 90.00 159.136881 sec elapsed time 50.271188 files/sec 50.271188 IOPS 3.141949 MB/sec 100.00% of requested files processed, minimum is 90.00 161.230330 sec elapsed time 49.618456 files/sec 49.618456 IOPS 3.101153 MB/sec 3.2 Results from 2 test runs for append: ---------------------------------------- 100.00% of requested files processed, minimum is 90.00 137.299699 sec elapsed time 58.266697 files/sec 58.266697 IOPS 3.641669 MB/sec 100.00% of requested files processed, minimum is 90.00 137.465230 sec elapsed time 58.196535 files/sec 58.196535 IOPS 3.637283 MB/sec So it is around 50 files/sec in 3.1.3 vs 58 files/sec on 3.2
I ran the tests on the perf machines using a 2x3 setup. 1) 3.1.3 with default volume options ================================== Did 3 runs: 426.198634 files/sec <---This can be ignored since XFS was freshly formatted. 291.988836 files/sec 299.597521 files/sec 2) rhgs.3.2 with default volume options (HEAD, @ ganesha/scripts : restart pcs cluster during add node) ====================================== Did 3 runs: 300.858783 files/sec 278.304994 files/sec 278.231024 files/sec 3) rhgs=3.2 with the following options: =========================================== gluster volume set testvol features.cache-invalidation on gluster volume set testvol features.cache-invalidation-timeout 600 gluster volume set testvol performance.stat-prefetch on gluster volume set testvol performance.cache-samba-metadata on gluster volume set testvol performance.cache-invalidation on gluster volume set testvol performance.md-cache-timeout 600 Did 1 run: 296.241372 files/sec Between 1 and 2, the perf drop is ~7%. Between 1 and 3, there is no drop. Will also test on a 12x3 set up tomorrow.
Ran the tests on a 12x3 dist-rep volume with the following options set: features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on server.event-threads: 4 client.event-threads: 4 cluster.lookup-optimize: on transport.address-family: inet performance.readdir-ahead: on 1) rhgs-3.1.3: Ran 3 iterations: 1197.813533 files/sec 1101.542113 files/sec 1084.906356 files/sec Avg=1128 2) rhgs-3.2: Ran 3 iterations: 1162.215698 files/sec 1075.400852 files/sec 1028.074938 files/sec Avg=1089 3) rhgs-3.2 with the 2 extra md-cache options:(2 iterations) performance.cache-invalidation: on performance.md-cache-timeout: 600 1053.610636 files/sec 1044.966673 files/sec Avg=1049 Perf drop b/w avg values of 1 and 2 = 3.5% Perf drop b/w avg values of 1 and 3 = 7%
Created attachment 1257821 [details] Results for comment #18