Description of problem: ----------------------- With 3.2 bits , it seems a regression was introduced on large file sequential reads specific to replica 3. *3.1.3 GA bits* : 1568891.33 kB/sec *3.8.4-13 bits* : 1093736.45 kB/sec Regression : -30% I do not see a problem with 2 way replica though. Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-server-3.8.4-13.el7rhgs.x86_64 How reproducible: ---------------- Every which way I try. Actual results: --------------- 30% regression on large file reads on 3.2. Expected results: ----------------- Regression Threshold is within 10% Additional info: ---------------- Volume Name: rep3 Type: Distributed-Replicate Volume ID: 390622f1-9ff1-4d04-b342-5492254f25fe Status: Started Snapshot Count: 0 Number of Bricks: 12 x 3 = 36 Transport-type: tcp Bricks: Brick1: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick2: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick3: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick4: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick5: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick6: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick7: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick8: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick9: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick10: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick13: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick14: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick15: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick16: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick17: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick18: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick19: gqas005.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick20: gqas006.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick21: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick22: gqas005.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick25: gqas005.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick26: gqas006.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick27: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick28: gqas005.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick29: gqas006.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick30: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick31: gqas005.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick32: gqas006.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick33: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick34: gqas005.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick35: gqas006.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick36: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on nfs.disable: off performance.readdir-ahead: on transport.address-family: inet features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-samba-metadata: on performance.cache-invalidation: on performance.md-cache-timeout: 600 [root@gqas005 ~]#
On switching off io-threads ,the magnitude of regression mellowed down,but still not within acceptable limits : *3.1.3* : 1568891.33 kB/sec *3.8.4-13.io-threads off* : 1192683.07 kB/sec Regression : -23%
I stripped the cumulative profile info numbers in to separate files. (cumu-3.1.3.txt, cumu-3.2-iot-on.txt and cumu-3.2-iot-off.txt) Some observations: ------------------------------------------------------------------------- 1) The sum total no. of READ calls made on all bricks in 3.1.3 > 3.2-without-io-threads > 3.2-with-io-threads. Despite the same workload and file size, 3.1.3 had more number of reads but still better performance. $ grep -re "No. of Reads" cumu-3.1.3.txt|cut -d ":" -f2|tr -d [:blank:]|awk '{ sum+=$0} END {print sum}' 691474 $ grep -re "No. of Reads" cumu-3.2-iot-off.txt|cut -d ":" -f2|tr -d [:blank:]|awk '{ sum+=$0} END {print sum}' 546177 $ grep -re "No. of Reads" cumu-3.2-iot-on.txt|cut -d ":" -f2|tr -d [:blank:]|awk '{ sum+=$0} END {print sum}' 491279 ------------------------------------------------------------------------- 2). The Average-latency (in micro secs) of READs in 3.2 setup seems to be more, possibly reducing the performance. 3.1.3 had one occurrence of more than 2500us while 3.2 ones have many more. $ grep -w READ cumu-3.1.3.txt | awk '{print $2}'|sort 2660.47 482.76 485.19 524.39 632.65 654.08 654.81 673.18 674.24 685.41 864.87 $ grep -w READ cumu-3.2-iot-off.txt | awk '{print $2}'|sort 2735.70 2801.96 2856.74 566.85 641.55 675.00 678.89 681.29 699.62 $ grep -w READ cumu-3.2-iot-on.txt | awk '{print $2}'|sort 2712.74 2737.09 2751.98 2764.30 560.78 583.90 617.89 788.29 We need to find why there is higher latency of READS in the 3.2 runs ------------------------------------------------------------------------- Ambarish 1. Could you provide the strace output of all the bricks when you run the workload (in 3.1.3 and 3.2)? This can give us some idea of the time spent in syscalls etc. 2. For the sake of easier debugging, is it possible to see if we get similar regressions in a smaller setup? (say 2x3 volume, single reader workload).
Created attachment 1246197 [details] stripped cumulative profile infos from the original attachments.
Cleared need info on me by mistake
A solid 27% regression on large file reads on 2*3 : 3.1.3 : 801896 kB/sec 3.2 : 583641 kB/sec IOT was off in both cases,FYI.
Observations from the attachments in comment #12 and comment #13. 0. volinfo was not available but I've assumed the 2 replica-3 subvols are made up of the following bricks: Brick: gqas010.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick: gqas010.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/brick 1. In this run also, the no. of READs (from the profile info) for the same test is more in 3.1.3 than 3.2: $grep -re "No. of Reads" 3.1.3-iot-off.txt |awk '{print $4}'|awk '{ sum+=$0} END {print sum}' 1118115 $ grep -re "No. of Reads" 3.2-iot-off.txt |awk '{print $4}'|awk '{ sum+=$0} END {print sum}' 674470 2. The time taken by posix xlator for the pread syscall is significantly more on gqas014 for 3.2 compared to 3.1.3: Highest 10 values for 3.1.3: grep -w pread 313/gqas010/*.txt*|awk '{print $8}'|tr '<' ' '|tr '>' ' '|sort|tail -10 0.032043 0.032882 0.033726 0.033868 0.034844 0.039520 0.039931 0.046005 0.055675 0.069946 Highest 10 values for 3.2 $ grep -w pread 32/gqas010/*.txt*|awk '{print $8}'|tr '<' ' '|tr '>' ' '|sort|tail -10 0.417160 0.434892 0.436256 0.440567 0.457439 0.457820 0.494122 0.495383 0.503378 0.504748 The values on the other 2 nodes gqas014 and gqas015 are in the same 'range', although a bit higher for 3.2. $ grep -w pread 313/gqas014/*.txt*|awk '{print $8}'|tr '<' ' '|tr '>' ' '|sort|tail -10 0.247469 0.250882 0.266613 0.282624 0.283825 0.284675 0.304581 0.305631 0.323572 0.323947 $ grep -w pread 32/gqas014/*.txt*|awk '{print $8}'|tr '<' ' '|tr '>' ' '|sort|tail -10 0.531700 0.537332 0.549611 0.569431 0.576682 0.583793 0.587856 0.590194 0.590830 0.663191 $ grep -w pread 313/gqas015/*.txt*|awk '{print $8}'|tr '<' ' '|tr '>' ' '|sort|tail -10 0.640470 0.701599 0.704544 0.710213 0.710533 0.726115 0.726428 0.770418 0.976238 0.976581 $ grep -w pread 32/gqas015/*.txt*|awk '{print $8}'|tr '<' ' '|tr '>' ' '|sort|tail -10 0.800597 0.800757 0.881200 0.884361 0.986186 0.986745 1.005609 1.005773 1.011861 1.011967 ---------------------------------------------------------------------- I have requested Ambarish to try out the test on a different set of machines, both with and without perf xlators, for single and multi-writer scenarios, to eliminate possible issues with hardware.
(In reply to Ravishankar N from comment #14) > 2. The time taken by posix xlator for the pread syscall is significantly > more on gqas014 for 3.2 compared to 3.1.3: Sorry typo. I meant gqas010 instead of gqas014
I could not repro this on another setup.
Based on comment 16, I am closing the bug, please reopen if the issue is hit again.