Description of problem: Regression seen on smallfile creates from rhgs 3.3.1 to rhgs 3.4.0 on fuse mount. Version-Release number of selected component (if applicable): 3.12.2-5 How reproducible: 100% profile infos are kept at sosreports are placed at sosreport and statedumps at mount -t cifs -oguest //newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase /BZ_sosreports/Karan/Porfile_info/ Steps to Reproduce: 1. Run small-file creates on 6 node setup using 6 client machines below command qeSyncDropCaches.sh; python /small-files/smallfile/smallfile_cli.py --operation create --threads 8 --file-size 64 --files 5000 --top /gluster-mount --host-set "`echo $CLIENT | tr ' ' ','`" 2. now compare the same with RHGS 3.3.1 i.e 3.8.4.54.3 Actual results: Regression seen between release Expected results: No regression should be observed Additional info: not a brick mux setup. The performance numbers weren't consistent i.e after every iterations the regression was more. regression seen on 2x2 volume also
Profile infos at mount -t cifs -oguest //newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase /BZ_sosreports/ksandha/Porfile_info
Hi Nithya, In the profile infos (Create_3.3.1_7.5.profileinfo and Create_3.4_7.5.profileinfo), the no. of STATFS calls for 3.4 is 3 times more than 3.3.1. Are there any changes in DHT between these two versions that causes these extra calls and possibly this regression?
(In reply to Karan Sandha from comment #0) > Description of problem: > Regression seen on smallfile creates from rhgs 3.3.1 to rhgs 3.4.0 on fuse > mount. > > Version-Release number of selected component (if applicable): > 3.12.2-5 > > How reproducible: > 100% > profile infos are kept at sosreports are placed at sosreport and statedumps > at mount -t cifs -oguest > //newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase > /BZ_sosreports/Karan/Porfile_info/ > > Steps to Reproduce: > 1. Run small-file creates on 6 node setup using 6 client machines below > command qeSyncDropCaches.sh; python /small-files/smallfile/smallfile_cli.py > --operation create --threads 8 --file-size 64 --files 5000 --top > /gluster-mount --host-set "`echo $CLIENT | tr ' ' ','`" > 2. now compare the same with RHGS 3.3.1 i.e 3.8.4.54.3 > > > Actual results: > Regression seen between release > It would help if actual numbers were provided as part of the description for performance bugs For example, with 3.3.1 this test finished in x seconds. With 3.4 it takes x+y seconds. Percentage regressions are rather vague and concrete numbers would allow us to visualise the impact better. > Expected results: > No regression should be observed > > > Additional info: > not a brick mux setup. > The performance numbers weren't consistent i.e after every iterations the > regression was more. > regression seen on 2x2 volume also
From the strace at newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase/BZ_sosreports/ksandha/Porfile_info/strace, we have the number of FOPS. FOP 3.3.1 3.4.0 Difference (3.4.0 – 3.3.1) =================================================================== chown 15603 14615 -988 clone 46 42 -4 close 26327 24578 -1749 epoll_ctl 175807 163517 -12290 epoll_wait 175831 163540 -12291 fgetxattr 149342 142270 -7072 fsetxattr 28854 27270 -1584 fstat 33926 33190 -736 fsync 24 23 -1 futex 518916 427435 -91481 getdents 18569 17821 -748 getegid 1158 965 -193 geteuid 1158 965 -193 getgroups 2316 1930 -386 lgetxattr 121896 62944 -58952 linkat 57723 54552 -3171 llistxattr 6432 2330 -4102 lseek 33 26 -7 lsetxattr 16761 29230 12469 lstat 223104 171383 -51721 madvise 2 -2 mkdir 32364 30195 -2169 mmap 46 62 16 mprotect 534 7671 7137 nanosleep 345 327 -18 open 14535 119 -14416 openat 11792 24454 12662 pread64 7 8 1 pwrite64 9704 9732 28 read 65 50 -15 readlink 28 35 7 readv 361316 336764 -24552 restart_syscall 48 58 10 rmdir 6 5 -1 rt_sigaction 12 -12 rt_sigprocmask 128 148 20 rt_sigtimedwait 6 6 0 select 981 819 -162 set_robust_list 46 42 -4 stat 44940 42397 -2543 statfs 504 530 26 symlink 1158 965 -193 unlink 43374 40991 -2383 write 12 47 35 writev 176312 164042 -12270 munmap 10 10 ========================================================= It is seen that the no. of lsetxattr and mprotect calls are more for 3.4.0. In the strace, it was seen that these lsetxattrs corresponded to setting "trusted.gfid2path". This was not there in the 3.3.1 strace. We are suspecting this to be the cause of regressions as gfid2path is enabled by default in 3.4.0 (https://review.gluster.org/#/c/18133/)
I'm not sure yet if setting gfid2path is a client driven operation or it is handled entirely in posix (in which case we should not be seeing such a huge perf drop just for an extra syscall that sets this attribute per file). Karan, as discussed, 1. Could you try if disabling gfid2path improves the performance on 3.4.0? `gluster v set volname gfid2path disable` 2. Since the earlier strace was started after the I/O on was initiated, it might not have captured all the FOPS. So please capture the strace before starting the I/O. Also provide the corresponding profile info for correlation. Thanks!
(all) In the next run, while you get profile, please take the profile of the client side too: (Below is the way to get the profile info). - On the client where the volume is mounted (via FUSE) and tests are to be performed - setfattr -n trusted.io-stats-dump -v <full-path-to-store-io-stats> /the/gluster/fuse/mountpoint This should get us to a total understanding of where we are stuck.
Created attachment 1428669 [details] reverted patch-1
Created attachment 1428670 [details] patch-2
Created attachment 1428671 [details] patch-3
Created attachment 1428672 [details] patch-4
Milind, Lets measure the contention on rpcsvc->queue_lock using mutrace on brick processes. We can use following cmdline: LD_PRELOAD=/usr/local/lib/libmutrace.so mutrace --max=30 --hash-size=50021 /usr/bin/glusterfs -N .... If the contention is high, we can experiment with multiple queues instead of a single queue. regards, Raghavendra
Thanks Milind. Can you also capture the output of cpu utilization of each thread in brick process (On all bricks)? You can use following cmdline: top -bHd $MON_INTERVAL > /tmp/top_thr.${HOSTNAME}.txt # CPU utilization by thread
From rpcsvc_program_register: if (newprog->ownthread) { newprog->eventthreadcount = 1; creates = rpcsvc_spawn_threads (svc, newprog); if (creates < 1) { goto out; } } Why are we setting eventthreadcount to 1? When is that changed to actual number of event threads in the process? Can that codepath be executed in this case? IOW, can this be the bug that prevents scaling of req-handler threads?
Looking at the stats provided in comment #133: For rhgs 3.3.1: $ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep glusterfsd | head -n 20 25179 root 20 0 2304616 478932 4588 S 0.0 1.0 185:15.17 glusterfsd 25179 root 20 0 2304616 478932 4588 S 0.0 1.0 185:15.17 glusterfsd 25179 root 20 0 2304876 478932 4588 S 0.2 1.0 185:15.19 glusterfsd 25179 root 20 0 2307736 479560 4588 S 201.3 1.0 185:35.34 glusterfsd 25179 root 20 0 2307996 480292 4588 S 656.9 1.0 186:41.23 glusterfsd 25179 root 20 0 2308516 480348 4588 S 578.8 1.0 187:39.23 glusterfsd 25179 root 20 0 2308516 480348 4588 S 602.7 1.0 188:39.68 glusterfsd 25179 root 20 0 2308516 480348 4588 S 149.5 1.0 188:54.66 glusterfsd 25179 root 20 0 2308516 480348 4588 S 109.9 1.0 189:05.68 glusterfsd 25179 root 20 0 2308516 480352 4588 S 237.6 1.0 189:29.46 glusterfsd 25179 root 20 0 2308516 480388 4592 S 424.2 1.0 190:11.96 glusterfsd 25179 root 20 0 2308516 480388 4592 S 302.6 1.0 190:42.31 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.1 1.0 190:42.32 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.0 1.0 190:42.32 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.0 1.0 190:42.32 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.0 1.0 190:42.32 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.0 1.0 190:42.32 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.0 1.0 190:42.32 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.0 1.0 190:42.32 glusterfsd 25179 root 20 0 2308516 480388 4592 S 0.0 1.0 190:42.32 glusterfsd For the rhgs3.4 build: $ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep glusterfsd 16244 root 20 0 2386320 421152 4712 S 0.0 0.9 47:28.08 glusterfsd 16244 root 20 0 2386320 422396 4712 S 0.0 0.9 47:28.08 glusterfsd 16244 root 20 0 2386320 422396 4712 S 0.3 0.9 47:28.11 glusterfsd 16244 root 20 0 2386320 423200 4712 S 116.8 0.9 47:39.81 glusterfsd 16244 root 20 0 2386320 421928 4712 S 406.0 0.9 48:20.49 glusterfsd 16244 root 20 0 2386320 421956 4712 S 286.4 0.9 48:49.16 glusterfsd 16244 root 20 0 2386320 422060 4712 S 371.8 0.9 49:26.45 glusterfsd 16244 root 20 0 2386320 422176 4712 S 204.3 0.9 49:46.94 glusterfsd 16244 root 20 0 2386320 421204 4712 S 26.2 0.9 49:49.57 glusterfsd 16244 root 20 0 2386320 422116 4712 S 41.3 0.9 49:53.71 glusterfsd 16244 root 20 0 2386320 421196 4712 S 34.5 0.9 49:57.17 glusterfsd 16244 root 20 0 2386320 421660 4712 S 190.0 0.9 50:16.21 glusterfsd 16244 root 20 0 2386320 421988 4712 S 665.0 0.9 51:22.84 glusterfsd 16244 root 20 0 2386320 420952 4712 S 37.1 0.9 51:26.56 glusterfsd 16244 root 20 0 2386320 421828 4712 S 308.2 0.9 51:57.44 glusterfsd 16244 root 20 0 2386320 421924 4712 S 39.0 0.9 52:01.35 glusterfsd 16244 root 20 0 2386320 422056 4712 S 60.9 0.9 52:07.45 glusterfsd 16244 root 20 0 2386320 422116 4712 S 41.7 0.9 52:11.63 glusterfsd 16244 root 20 0 2386320 422116 4712 S 82.1 0.9 52:19.86 glusterfsd 16244 root 20 0 2386320 421736 4712 S 263.4 0.9 52:46.28 glusterfsd 16244 root 20 0 2386320 421736 4712 S 124.6 0.9 52:58.78 glusterfsd 16244 root 20 0 2386320 421736 4712 S 18.5 0.9 53:00.63 glusterfsd 16244 root 20 0 2386320 421736 4712 S 0.0 0.9 53:00.63 glusterfsd 16244 root 20 0 2386320 421736 4712 S 0.0 0.9 53:00.63 glusterfsd For rhgs 3.3.1, glusterfsd activity is steady. For rhgs 3.4, glusterfsd seems to be working in fits and starts. And it doesn't seem to be a case of requests stuck in some queue locally; network activity shows a similar picture: rhgs 3.3.1: $ cat sar.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep p2p1 | head -n 20 02:35:37 EDT p2p1 59.20 61.50 11.91 12.58 0.00 0.00 0.00 02:35:47 EDT p2p1 16.80 11.40 1.77 1.59 0.00 0.00 0.00 02:35:57 EDT p2p1 6288.10 6132.40 1713.01 1102.48 0.00 0.00 0.00 02:36:07 EDT p2p1 57303.50 36543.80 190471.20 6979.76 0.00 0.00 0.00 02:36:17 EDT p2p1 38808.10 24657.10 129105.82 4726.04 0.00 0.00 0.00 02:36:27 EDT p2p1 52537.60 33221.90 174648.95 6381.13 0.00 0.00 0.00 02:36:37 EDT p2p1 12996.20 8981.80 43115.53 1633.15 0.00 0.00 0.00 02:36:47 EDT p2p1 9019.50 6089.40 30016.74 1126.87 0.00 0.00 0.00 02:36:57 EDT p2p1 10708.00 6899.50 35689.65 1385.62 0.00 0.00 0.00 02:37:07 EDT p2p1 34070.40 22555.90 110824.64 4804.49 0.00 0.00 0.00 02:37:17 EDT p2p1 31056.30 20839.70 101216.36 4608.60 0.00 0.00 0.00 02:37:27 EDT p2p1 10.70 5.80 0.66 0.37 0.00 0.00 0.00 02:37:37 EDT p2p1 9.20 4.20 0.56 0.27 0.00 0.00 0.00 02:37:47 EDT p2p1 10.90 5.90 0.67 0.37 0.00 0.00 0.00 02:37:57 EDT p2p1 9.40 4.60 0.57 0.28 0.00 0.00 0.00 02:38:07 EDT p2p1 10.50 5.70 0.65 0.36 0.00 0.00 0.00 02:38:17 EDT p2p1 9.50 4.50 0.58 0.28 0.00 0.00 0.00 02:38:27 EDT p2p1 11.00 6.00 0.68 0.37 0.00 0.00 0.00 02:38:37 EDT p2p1 9.70 4.70 0.59 0.29 0.00 0.00 0.00 02:38:47 EDT p2p1 10.70 5.70 0.66 0.36 0.00 0.00 0.00 rhgs 3.4: $ cat sar.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep p2p1 02:00:45 EDT p2p1 63.70 66.30 12.20 12.90 0.00 0.00 0.00 02:00:55 EDT p2p1 12.30 7.00 1.08 1.66 0.00 0.00 0.00 02:01:05 EDT p2p1 6618.10 6558.40 1785.05 1107.64 0.00 0.00 0.00 02:01:15 EDT p2p1 28556.70 20945.50 93880.71 3665.47 0.00 0.00 0.00 02:01:25 EDT p2p1 22270.20 16173.70 73892.16 2837.41 0.00 0.00 0.00 02:01:35 EDT p2p1 28223.80 20224.00 93681.53 3578.31 0.00 0.00 0.00 02:01:45 EDT p2p1 20560.50 14616.10 68284.64 2601.21 0.00 0.00 0.00 02:01:55 EDT p2p1 2048.40 1446.40 6742.59 258.02 0.00 0.00 0.00 02:02:05 EDT p2p1 2892.20 2146.40 9584.58 371.65 0.00 0.00 0.00 02:02:15 EDT p2p1 3196.90 2341.00 10485.10 404.89 0.00 0.00 0.00 02:02:25 EDT p2p1 9383.20 6659.90 31096.11 1185.47 0.00 0.00 0.00 02:02:35 EDT p2p1 49026.70 33658.50 162940.17 6126.67 0.00 0.00 0.00 02:02:45 EDT p2p1 8914.50 6045.60 29594.30 1107.27 0.00 0.00 0.00 02:02:55 EDT p2p1 24452.50 17308.70 79795.81 3457.23 0.00 0.00 0.00 02:03:05 EDT p2p1 3129.10 2333.70 10241.54 459.10 0.00 0.00 0.00 02:03:15 EDT p2p1 3902.30 2923.30 12736.39 573.44 0.00 0.00 0.00 02:03:25 EDT p2p1 3813.30 2822.00 12451.65 556.95 0.00 0.00 0.00 02:03:35 EDT p2p1 5214.30 3890.50 16871.65 761.33 0.00 0.00 0.00 02:03:45 EDT p2p1 12097.90 8813.30 39354.17 1759.69 0.00 0.00 0.00 02:03:55 EDT p2p1 17982.40 12382.90 58536.01 2568.12 0.00 0.00 0.00 02:04:05 EDT p2p1 1659.20 1287.50 5454.85 544.03 0.00 0.00 0.00 02:04:15 EDT p2p1 7.70 2.60 0.49 0.16 0.00 0.00 0.00 02:04:25 EDT p2p1 13.00 7.80 0.86 0.49 0.00 0.00 0.00 The difference could be within gluster or in the environment. In working with Milind and Raghavendra on this problem, we have had success steadying rhgs 3.4 runs with vm.dirty* parameter tuning, so I'm looking at whether there is something different in the way writeback from page cache to disk is happening in the rhgs 3.4 run that could explain the above top and sar stats. Let me make that a different comment...
(In reply to Manoj Pillai from comment #145) <..> > The difference could be within gluster or in the environment. In working > with Milind and Raghavendra on this problem, we have had success steadying > rhgs 3.4 runs with vm.dirty* parameter tuning, so I'm looking at whether > there is something different in the way writeback from page cache to disk is > happening in the rhgs 3.4 run that could explain the above top and sar > stats. Let me make that a different comment... In rhgs tuned profiles, we use sysctl vm.dirty*ratio parameters. These parameters are interpreted as a % of available memory. So the question is, if the setup is the same why is there a difference between rhgs 3.3.1 and rhgs 3.4 runs. The kernel documentation seems to stress that available memory is not the same as total memory: https://www.kernel.org/doc/Documentation/sysctl/vm.txt. So is there a difference between available memory in the 2 runs? rhgs 3.3.1: $ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep -i "avail mem" KiB Swap: 24772604 total, 24772604 free, 0 used. 47603072 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47722300 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47725736 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47542716 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47416216 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47357588 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47329272 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47321900 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47304880 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47282664 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47261992 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47240808 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47236060 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47189004 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47203108 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47203584 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47203636 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214276 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214760 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214640 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214936 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47215084 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47215260 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47215612 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47215864 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47215740 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47215776 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47215784 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214432 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214492 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214500 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214684 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214656 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214712 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47214788 avail Mem rhgs 3.4: $ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep -i "avail mem" KiB Swap: 24772604 total, 24772604 free, 0 used. 46463020 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47753164 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47754292 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47655392 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47362680 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47244848 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 47119736 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46983184 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46891960 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46885396 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46881276 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46830848 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46721376 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46636676 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46562968 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46559552 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46533352 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46527992 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46516016 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46464728 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46458504 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46453892 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46433160 avail Mem KiB Swap: 24772604 total, 24772604 free, 0 used. 46425540 avail Mem The relevant column is the last numeric column. Unlike rhgs 3.3.1, available memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are stats only for 1 run, so it is hard say for sure, but this seems like a promising direction of investigation. We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs 3.3.1 as well for comparison), with stats collected for each run, to see if this pattern holds up and how bad it gets.
> The relevant column is the last numeric column. Unlike rhgs 3.3.1, available > memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are > stats only for 1 run, so it is hard say for sure, but this seems like a > promising direction of investigation. > > We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs > 3.3.1 as well for comparison), with stats collected for each run, to see if > this pattern holds up and how bad it gets. Yesterday Karan and I did the exercise of stopping and starting the volume for each run and it still didn't help. It kept going down with each iteration till it reached a point where it kept fluctuating. The only thing that I was not able to conclude RC in yesterday's debugging session with Karan is the %util difference between 331 and 340 for iostat numbers. So do let us know what you think about the iostat info.
(In reply to Pranith Kumar K from comment #148) > > The relevant column is the last numeric column. Unlike rhgs 3.3.1, available > > memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are > > stats only for 1 run, so it is hard say for sure, but this seems like a > > promising direction of investigation. > > > > We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs > > 3.3.1 as well for comparison), with stats collected for each run, to see if > > this pattern holds up and how bad it gets. > > Yesterday Karan and I did the exercise of stopping and starting the volume > for each run and it still didn't help. It kept going down with each > iteration till it reached a point where it kept fluctuating. The only thing > that I was not able to conclude RC in yesterday's debugging session with > Karan is the %util difference between 331 and 340 for iostat numbers. So do > let us know what you think about the iostat info. Yes. On 3.3.1 max %util is around 70%, but on 3.4.0 it hits 100% many times. Though, 100% may not indicate that the device is maxed out (as explained by Shekhar) I am curious, * what caused the difference? * Can this difference in anyways attribute to the fluctuation we are seeing?
(In reply to Pranith Kumar K from comment #148) If the amount of dirty data buffered in the page cache reaches the vm.dirty_ratio threshold, the OS forces a writeback and the writer can stall. The hypothesis I'm working with is that because of available memory levels dropping, this is happening or happening more often in 3.4 runs. That would explain glusterfsd working in "fits and starts" (comment #145). It is possible that IO scheduling is less effective for this writeback, meaning more seeks, less optimization of disk head movement, which will result in higher disk util % for the same amount of I/O. I'd try the easier experiment first: runs with higher values of vm.dirty_ratio (use the setting in the throughput-performance profile i.e 40; but we want to keep vm.dirty_background_ratio low). If these don't lead anywhere, then try to dig deeper into the iostat output.
(In reply to Karan Sandha from comment #163) [...] > > The numbers from create workloads:- > > run1 > 98.04% of requested files processed, minimum is 90.00 > elapsed time = 163.141 > files/sec = 1442.310719 > IOPS = 1442.310719 > MiB/sec = 90.144420 > > run2 > 97.92% of requested files processed, minimum is 90.00 > elapsed time = 154.447 > files/sec = 1521.553301 > IOPS = 1521.553301 > MiB/sec = 95.097081 > > run3 > 97.92% of requested files processed, minimum is 90.00 > elapsed time = 154.712 > files/sec = 1518.947534 > IOPS = 1518.947534 > MiB/sec = 94.934221 > > run4 > 98.00% of requested files processed, minimum is 90.00 > elapsed time = 165.350 > files/sec = 1422.436377 > IOPS = 1422.436377 > MiB/sec = 88.902274 > > > Manoj let me know if you need any other information. Thanks, Karan. In terms of the problem definition, from comment #140 the problem was stated as declining performance from run to run in 3.4. That doesn't seem to be the case here -- the numbers are bad starting from run 1. So unless refined by further tests, I'll use this as the problem statement: poor, and sometimes inconsistent, performance for this workload on 3.4 compared to 3.3.1. That sound OK to everyone? In the stats collected, for 3.4 I'm seeing only empty files for iostat and sar. $ ls -1s run_1528784068 total 16576 0 iostat.gqas001.sbu.lab.eng.bos.redhat.com.txt 0 iostat.gqas004.sbu.lab.eng.bos.redhat.com.txt 0 iostat.gqas010.sbu.lab.eng.bos.redhat.com.txt 0 iostat.gqas012.sbu.lab.eng.bos.redhat.com.txt 0 iostat.gqas014.sbu.lab.eng.bos.redhat.com.txt 0 iostat.gqas015.sbu.lab.eng.bos.redhat.com.txt 0 sar.gqas001.sbu.lab.eng.bos.redhat.com.txt 0 sar.gqas004.sbu.lab.eng.bos.redhat.com.txt 0 sar.gqas010.sbu.lab.eng.bos.redhat.com.txt 0 sar.gqas012.sbu.lab.eng.bos.redhat.com.txt 0 sar.gqas014.sbu.lab.eng.bos.redhat.com.txt 0 sar.gqas015.sbu.lab.eng.bos.redhat.com.txt 1184 top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt 1272 top_proc.gqas004.sbu.lab.eng.bos.redhat.com.txt 1280 top_proc.gqas010.sbu.lab.eng.bos.redhat.com.txt 1196 top_proc.gqas012.sbu.lab.eng.bos.redhat.com.txt 1224 top_proc.gqas014.sbu.lab.eng.bos.redhat.com.txt 1208 top_proc.gqas015.sbu.lab.eng.bos.redhat.com.txt 1484 top_thr.gqas001.sbu.lab.eng.bos.redhat.com.txt 1576 top_thr.gqas004.sbu.lab.eng.bos.redhat.com.txt 1588 top_thr.gqas010.sbu.lab.eng.bos.redhat.com.txt 1504 top_thr.gqas012.sbu.lab.eng.bos.redhat.com.txt 1496 top_thr.gqas014.sbu.lab.eng.bos.redhat.com.txt 1516 top_thr.gqas015.sbu.lab.eng.bos.redhat.com.txt 8 vmstat.gqas001.sbu.lab.eng.bos.redhat.com.txt 8 vmstat.gqas004.sbu.lab.eng.bos.redhat.com.txt 8 vmstat.gqas010.sbu.lab.eng.bos.redhat.com.txt 8 vmstat.gqas012.sbu.lab.eng.bos.redhat.com.txt 8 vmstat.gqas014.sbu.lab.eng.bos.redhat.com.txt 8 vmstat.gqas015.sbu.lab.eng.bos.redhat.com.txt Can you please rectify that? And while you're at it, for the new 3.4 runs you'll be doing, can you please collect sosreport at the end of the runs (not sure if that is when you're collecting it)?
(In reply to Raghavendra G from comment #149) > (In reply to Pranith Kumar K from comment #148) > > > The relevant column is the last numeric column. Unlike rhgs 3.3.1, available > > > memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are > > > stats only for 1 run, so it is hard say for sure, but this seems like a > > > promising direction of investigation. > > > > > > We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs > > > 3.3.1 as well for comparison), with stats collected for each run, to see if > > > this pattern holds up and how bad it gets. > > > > Yesterday Karan and I did the exercise of stopping and starting the volume > > for each run and it still didn't help. It kept going down with each > > iteration till it reached a point where it kept fluctuating. The only thing > > that I was not able to conclude RC in yesterday's debugging session with > > Karan is the %util difference between 331 and 340 for iostat numbers. So do > > let us know what you think about the iostat info. > > Yes. On 3.3.1 max %util is around 70%, but on 3.4.0 it hits 100% many times. > Though, 100% may not indicate that the device is maxed out (as explained by > Shekhar) I am curious, > > * what caused the difference? > * Can this difference in anyways attribute to the fluctuation we are seeing? Probing deeper on my initial observation of max disk utilization in rhgs-3.3.1 was 70% compared to 100% on rhgs-3.4.0 (which Pranith has also reiterated), I find that average request sizes on rhgs-3.4.0 are fairly low compared to rhgs-3.3.1 runs. For example, following are the snippets of higher utilization on rhgs-3.4.0: gluster_vg-pool_tdata 0.00 0.00 1436.60 566.20 13345.10 35898.40 49.17 8.34 4.16 5.76 0.10 0.40 80.31 gluster_vg-pool-tpool 0.00 0.00 1436.60 566.20 13345.10 35898.40 49.17 8.35 4.16 5.76 0.10 0.40 80.25 gluster_vg-lv_bricks 0.00 0.00 1436.60 346.10 13345.10 35898.40 55.25 8.40 4.71 5.78 0.27 0.45 80.53 gluster_vg-pool_tdata 0.00 0.00 794.30 999.30 6764.00 62016.00 76.69 14.60 8.10 8.21 8.01 0.40 71.39 gluster_vg-pool-tpool 0.00 0.00 794.30 999.30 6764.00 62016.00 76.69 14.58 8.11 8.22 8.02 0.40 71.39 gluster_vg-lv_bricks 0.00 0.00 794.30 839.90 6764.00 62028.80 84.19 14.74 8.91 8.23 9.56 0.44 71.37 gluster_vg-pool_tdata 0.00 0.00 810.50 434.40 6698.40 27801.60 55.43 5.51 4.42 6.74 0.10 0.59 73.89 gluster_vg-pool-tpool 0.00 0.00 810.50 434.40 6698.40 27801.60 55.43 5.51 4.43 6.75 0.10 0.59 73.93 gluster_vg-lv_bricks 0.00 0.00 810.50 217.10 6698.40 27788.80 67.12 5.53 5.39 6.76 0.26 0.72 73.93 gluster_vg-pool_tdata 0.00 0.00 976.20 513.40 11926.40 32262.40 59.33 3.84 2.58 3.88 0.12 0.65 96.16 gluster_vg-pool-tpool 0.00 0.00 976.20 513.40 11926.40 32262.40 59.33 3.84 2.58 3.88 0.13 0.65 96.19 gluster_vg-lv_bricks 0.00 0.00 976.30 368.70 11926.40 32262.40 65.71 3.86 2.88 3.89 0.20 0.72 96.61 gluster_vg-pool_tdata 0.00 0.00 568.60 2766.00 8773.60 104288.40 67.81 35.24 10.04 2.84 11.52 0.29 95.43 gluster_vg-pool-tpool 0.00 0.00 568.60 2766.00 8773.60 104288.40 67.81 35.22 10.04 2.84 11.52 0.29 95.52 gluster_vg-lv_bricks 0.00 0.00 568.50 2758.30 8773.60 104288.40 67.97 35.45 10.07 2.85 11.56 0.29 95.97 gluster_vg-pool_tdata 0.00 0.00 92.20 4638.20 955.20 101431.60 43.29 185.77 39.47 92.59 38.41 0.18 87.23 gluster_vg-pool-tpool 0.00 0.00 92.20 4638.20 955.20 101431.60 43.29 185.95 39.47 92.59 38.41 0.18 87.23 gluster_vg-lv_bricks 0.00 0.00 92.20 4629.00 955.20 101431.60 43.37 186.24 39.55 92.61 38.50 0.18 87.28 gluster_vg-pool_tdata 0.00 0.00 79.90 2116.10 804.00 100513.20 92.27 100.58 42.63 49.67 42.36 0.40 88.39 gluster_vg-pool-tpool 0.00 0.00 79.90 2116.10 804.00 100513.20 92.27 100.46 42.63 49.67 42.36 0.40 88.39 gluster_vg-lv_bricks 0.00 0.00 79.90 2105.10 804.00 100513.20 92.74 100.60 42.85 49.68 42.59 0.40 88.42 gluster_vg-pool_tdata 0.00 0.00 260.40 1518.30 2600.00 33040.45 40.07 99.91 59.01 21.65 65.42 0.46 82.17 gluster_vg-pool-tpool 0.00 0.00 260.40 1518.30 2600.00 33040.45 40.07 98.23 59.02 21.66 65.42 0.46 82.17 gluster_vg-lv_bricks 0.00 0.00 260.40 1451.20 2600.00 33040.45 41.65 98.21 61.34 21.67 68.46 0.48 82.31 gluster_vg-pool_tdata 0.00 0.00 780.80 1033.70 6350.40 65785.60 79.51 6.87 3.77 6.54 1.68 0.52 94.12 gluster_vg-pool-tpool 0.00 0.00 780.90 1033.70 6350.40 65785.60 79.51 6.85 3.78 6.54 1.69 0.52 94.11 gluster_vg-lv_bricks 0.00 0.00 780.90 583.00 6351.20 65785.60 105.78 6.90 5.05 6.55 3.04 0.69 94.20 gluster_vg-pool_tdata 0.00 0.00 596.50 2399.80 8740.80 42188.40 33.99 71.08 22.87 3.44 27.70 0.29 85.65 gluster_vg-pool-tpool 0.00 0.00 596.40 2399.80 8740.80 42188.40 34.00 71.09 22.87 3.44 27.70 0.29 85.79 gluster_vg-lv_bricks 0.00 0.00 596.40 2340.50 8740.00 42188.40 34.68 71.11 23.34 3.45 28.41 0.29 86.32 gluster_vg-pool_tdata 0.00 0.00 423.60 2447.60 4549.60 53970.00 40.76 96.58 34.44 21.24 36.72 0.34 98.31 gluster_vg-pool-tpool 0.00 0.00 423.60 2447.60 4549.60 53970.00 40.76 96.33 34.44 21.25 36.72 0.34 98.20 gluster_vg-lv_bricks 0.00 0.00 423.60 2262.90 4549.60 53970.00 43.57 96.36 36.82 21.26 39.73 0.37 98.21 gluster_vg-pool_tdata 0.00 0.00 205.60 3846.90 2993.60 125500.05 63.41 134.11 31.60 10.46 32.73 0.22 89.74 gluster_vg-pool-tpool 0.00 0.00 205.60 3846.90 2993.60 125500.05 63.41 133.90 31.60 10.46 32.73 0.22 89.75 gluster_vg-lv_bricks 0.00 0.00 205.60 3830.30 2993.60 125500.05 63.68 134.62 32.17 10.48 33.33 0.22 89.99 gluster_vg-pool_tdata 0.00 0.00 223.00 2155.20 2652.00 45353.55 40.37 94.27 39.64 12.06 42.49 0.35 82.65 gluster_vg-pool-tpool 0.00 0.00 223.00 2155.20 2652.00 45353.55 40.37 94.27 39.64 12.06 42.49 0.35 82.67 gluster_vg-lv_bricks 0.00 0.00 223.00 2066.30 2652.00 45353.55 41.94 95.16 41.56 12.07 44.75 0.36 82.83 Compare this with higher utiliztion on rhgs-3.3.0: ================================================== gluster_vg-pool_tdata 0.00 0.00 2.70 3667.60 43.20 225324.80 122.81 93.29 25.54 32.63 25.53 0.19 70.11 gluster_vg-pool-tpool 0.00 0.00 2.70 3667.60 43.20 225324.80 122.81 93.29 25.54 32.63 25.53 0.19 70.11 gluster_vg-lv_bricks 0.00 0.00 2.70 3590.10 43.20 225324.80 125.46 93.89 26.26 32.67 26.25 0.20 70.26 Note that much of reads and writes are happening at larger request size on 3.3.0: gluster_vg-pool_tdata 0.00 0.00 278.60 1183.30 4386.45 75065.60 108.70 0.47 0.33 1.26 0.11 0.23 33.01 gluster_vg-pool-tpool 0.00 0.00 278.60 1183.30 4386.45 75065.60 108.70 0.48 0.33 1.26 0.11 0.23 33.30 gluster_vg-lv_bricks 0.00 0.00 278.60 776.80 4386.45 75072.00 150.58 0.51 0.48 1.28 0.20 0.33 34.89 gluster_vg-pool_tdata 0.00 0.00 37.40 2539.10 598.40 126151.55 98.39 42.90 16.65 22.11 16.57 0.20 52.39 gluster_vg-pool-tpool 0.00 0.00 37.40 2539.10 598.40 126151.55 98.39 42.90 16.65 22.11 16.57 0.20 52.49 gluster_vg-lv_bricks 0.00 0.00 37.40 2259.40 598.40 126157.95 110.38 42.94 18.69 22.14 18.63 0.23 53.25 gluster_vg-pool_tdata 0.00 0.00 3.10 662.90 49.60 40953.60 123.13 0.46 0.70 6.87 0.67 0.11 7.24 gluster_vg-pool-tpool 0.00 0.00 3.10 662.90 49.60 40953.60 123.13 0.47 0.70 6.87 0.67 0.11 7.26 gluster_vg-lv_bricks 0.00 0.00 3.10 580.10 49.60 40953.60 140.61 0.48 0.81 6.90 0.78 0.13 7.74 gluster_vg-pool_tdata 0.00 0.00 4.30 2389.60 68.80 111680.05 93.36 55.02 22.77 5.37 22.80 0.16 38.21 gluster_vg-pool-tpool 0.00 0.00 4.30 2389.60 68.80 111680.05 93.36 54.97 22.78 5.37 22.81 0.16 38.24 gluster_vg-lv_bricks 0.00 0.00 4.30 2335.10 68.80 111680.05 95.54 54.99 23.32 5.37 23.35 0.16 38.60 gluster_vg-pool_tdata 0.00 0.00 2.70 3667.60 43.20 225324.80 122.81 93.29 25.54 32.63 25.53 0.19 70.11 gluster_vg-pool-tpool 0.00 0.00 2.70 3667.60 43.20 225324.80 122.81 93.29 25.54 32.63 25.53 0.19 70.11 gluster_vg-lv_bricks 0.00 0.00 2.70 3590.10 43.20 225324.80 125.46 93.89 26.26 32.67 26.25 0.20 70.26 gluster_vg-pool_tdata 0.00 0.00 4.70 1425.10 75.20 88896.40 124.45 14.57 10.19 18.45 10.16 0.16 22.17 gluster_vg-pool-tpool 0.00 0.00 4.70 1425.10 75.20 88896.40 124.45 14.57 10.19 18.45 10.16 0.16 22.29 gluster_vg-lv_bricks 0.00 0.00 4.70 1206.50 75.20 88896.40 146.91 15.90 13.11 18.45 13.09 0.19 23.32 gluster_vg-pool_tdata 0.00 0.00 16.80 1692.90 268.80 70428.25 82.70 40.94 23.94 4.27 24.14 0.15 25.32 gluster_vg-pool-tpool 0.00 0.00 16.80 1692.90 268.80 70428.25 82.70 40.94 23.95 4.28 24.14 0.15 25.48 gluster_vg-lv_bricks 0.00 0.00 16.80 1478.30 268.80 70428.25 94.57 69.20 46.29 4.30 46.76 0.22 32.56 gluster_vg-pool_tdata 0.00 0.00 0.00 279.00 0.00 17120.00 122.72 3.87 13.81 0.00 13.81 0.15 4.07 gluster_vg-pool-tpool 0.00 0.00 0.00 279.00 0.00 17120.00 122.72 3.85 13.82 0.00 13.82 0.15 4.07 gluster_vg-lv_bricks 0.00 0.00 0.00 279.00 0.00 17120.00 122.72 3.88 13.92 0.00 13.92 0.15 4.07 gluster_vg-pool_tdata 0.00 0.00 1.90 2374.50 30.40 137371.00 115.64 49.19 20.70 6.11 20.71 0.19 45.28 gluster_vg-pool-tpool 0.00 0.00 1.90 2374.50 30.40 137371.00 115.64 49.20 20.70 6.11 20.71 0.19 45.28 gluster_vg-lv_bricks 0.00 0.00 1.90 2368.00 30.40 137371.00 115.96 57.27 24.16 6.11 24.18 0.19 45.31 Another observation is that more reads and writes are happening at the same time on rhgs-3.4.0. On rhgs-3.4.0 the workload seem to be either more reads or more writes but seldom both. For eg., On rhgs-3.4.0 the maximum concurrency I found was (only one instance), gluster_vg-pool_tdata 0.00 0.00 278.60 1183.30 4386.45 75065.60 108.70 0.47 0.33 1.26 0.11 0.23 33.01 gluster_vg-pool-tpool 0.00 0.00 278.60 1183.30 4386.45 75065.60 108.70 0.48 0.33 1.26 0.11 0.23 33.30 gluster_vg-lv_bricks 0.00 0.00 278.60 776.80 4386.45 75072.00 150.58 0.51 0.48 1.28 0.20 0.33 34.89 Compare this to maximum concurrency on rhgs-3.4.0 (many more instances than rhgs-3.3.0) gluster_vg-pool_tdata 0.00 0.00 213.36 239.74 6405.30 9532.80 70.35 5.37 11.83 6.65 16.44 0.47 21.20 gluster_vg-pool-tpool 0.00 0.00 213.36 239.74 6405.30 9532.80 70.35 5.37 11.83 6.65 16.45 0.47 21.21 gluster_vg-lv_bricks 0.00 0.00 213.36 211.22 6405.34 9532.80 75.08 18.65 43.90 6.67 81.52 0.50 21.32 gluster_vg-pool_tdata 0.00 0.00 1436.60 566.20 13345.10 35898.40 49.17 8.34 4.16 5.76 0.10 0.40 80.31 gluster_vg-pool-tpool 0.00 0.00 1436.60 566.20 13345.10 35898.40 49.17 8.35 4.16 5.76 0.10 0.40 80.25 gluster_vg-lv_bricks 0.00 0.00 1436.60 346.10 13345.10 35898.40 55.25 8.40 4.71 5.78 0.27 0.45 80.53 gluster_vg-pool_tdata 0.00 0.00 794.30 999.30 6764.00 62016.00 76.69 14.60 8.10 8.21 8.01 0.40 71.39 gluster_vg-pool-tpool 0.00 0.00 794.30 999.30 6764.00 62016.00 76.69 14.58 8.11 8.22 8.02 0.40 71.39 gluster_vg-lv_bricks 0.00 0.00 794.30 839.90 6764.00 62028.80 84.19 14.74 8.91 8.23 9.56 0.44 71.37 gluster_vg-pool_tdata 0.00 0.00 810.50 434.40 6698.40 27801.60 55.43 5.51 4.42 6.74 0.10 0.59 73.89 gluster_vg-pool-tpool 0.00 0.00 810.50 434.40 6698.40 27801.60 55.43 5.51 4.43 6.75 0.10 0.59 73.93 gluster_vg-lv_bricks 0.00 0.00 810.50 217.10 6698.40 27788.80 67.12 5.53 5.39 6.76 0.26 0.72 73.93 gluster_vg-pool_tdata 0.00 0.00 976.20 513.40 11926.40 32262.40 59.33 3.84 2.58 3.88 0.12 0.65 96.16 gluster_vg-pool-tpool 0.00 0.00 976.20 513.40 11926.40 32262.40 59.33 3.84 2.58 3.88 0.13 0.65 96.19 gluster_vg-lv_bricks 0.00 0.00 976.30 368.70 11926.40 32262.40 65.71 3.86 2.88 3.89 0.20 0.72 96.61 gluster_vg-pool_tdata 0.00 0.00 568.60 2766.00 8773.60 104288.40 67.81 35.24 10.04 2.84 11.52 0.29 95.43 gluster_vg-pool-tpool 0.00 0.00 568.60 2766.00 8773.60 104288.40 67.81 35.22 10.04 2.84 11.52 0.29 95.52 gluster_vg-lv_bricks 0.00 0.00 568.50 2758.30 8773.60 104288.40 67.97 35.45 10.07 2.85 11.56 0.29 95.97 gluster_vg-pool_tdata 0.00 0.00 260.40 1518.30 2600.00 33040.45 40.07 99.91 59.01 21.65 65.42 0.46 82.17 gluster_vg-pool-tpool 0.00 0.00 260.40 1518.30 2600.00 33040.45 40.07 98.23 59.02 21.66 65.42 0.46 82.17 gluster_vg-lv_bricks 0.00 0.00 260.40 1451.20 2600.00 33040.45 41.65 98.21 61.34 21.67 68.46 0.48 82.31 gluster_vg-pool_tdata 0.00 0.00 780.80 1033.70 6350.40 65785.60 79.51 6.87 3.77 6.54 1.68 0.52 94.12 gluster_vg-pool-tpool 0.00 0.00 780.90 1033.70 6350.40 65785.60 79.51 6.85 3.78 6.54 1.69 0.52 94.11 gluster_vg-lv_bricks 0.00 0.00 780.90 583.00 6351.20 65785.60 105.78 6.90 5.05 6.55 3.04 0.69 94.20 gluster_vg-pool_tdata 0.00 0.00 596.50 2399.80 8740.80 42188.40 33.99 71.08 22.87 3.44 27.70 0.29 85.65 gluster_vg-pool-tpool 0.00 0.00 596.40 2399.80 8740.80 42188.40 34.00 71.09 22.87 3.44 27.70 0.29 85.79 gluster_vg-lv_bricks 0.00 0.00 596.40 2340.50 8740.00 42188.40 34.68 71.11 23.34 3.45 28.41 0.29 86.32 gluster_vg-pool_tdata 0.00 0.00 423.60 2447.60 4549.60 53970.00 40.76 96.58 34.44 21.24 36.72 0.34 98.31 gluster_vg-pool-tpool 0.00 0.00 423.60 2447.60 4549.60 53970.00 40.76 96.33 34.44 21.25 36.72 0.34 98.20 gluster_vg-lv_bricks 0.00 0.00 423.60 2262.90 4549.60 53970.00 43.57 96.36 36.82 21.26 39.73 0.37 98.21 gluster_vg-pool_tdata 0.00 0.00 205.60 3846.90 2993.60 125500.05 63.41 134.11 31.60 10.46 32.73 0.22 89.74 gluster_vg-pool-tpool 0.00 0.00 205.60 3846.90 2993.60 125500.05 63.41 133.90 31.60 10.46 32.73 0.22 89.75 gluster_vg-lv_bricks 0.00 0.00 205.60 3830.30 2993.60 125500.05 63.68 134.62 32.17 10.48 33.33 0.22 89.99 gluster_vg-pool_tdata 0.00 0.00 223.00 2155.20 2652.00 45353.55 40.37 94.27 39.64 12.06 42.49 0.35 82.65 gluster_vg-pool-tpool 0.00 0.00 223.00 2155.20 2652.00 45353.55 40.37 94.27 39.64 12.06 42.49 0.35 82.67 gluster_vg-lv_bricks 0.00 0.00 223.00 2066.30 2652.00 45353.55 41.94 95.16 41.56 12.07 44.75 0.36 82.83 gluster_vg-pool_tdata 0.00 0.00 130.60 506.80 1057.60 31814.80 103.15 0.80 1.25 5.53 0.15 0.46 29.11 gluster_vg-pool-tpool 0.00 0.00 130.60 506.80 1057.60 31814.80 103.15 0.80 1.26 5.53 0.16 0.46 29.13 gluster_vg-lv_bricks 0.00 0.00 130.60 362.90 1057.60 31814.80 133.22 0.87 1.77 5.55 0.41 0.61 30.19 gluster_vg-pool_tdata 0.00 0.00 132.50 2220.40 2120.00 45467.95 40.45 79.09 33.61 1.53 35.53 0.24 55.62 gluster_vg-pool-tpool 0.00 0.00 132.50 2220.40 2120.00 45467.95 40.45 79.09 33.62 1.53 35.53 0.24 55.66 gluster_vg-lv_bricks 0.00 0.00 132.50 2219.00 2120.00 45467.95 40.47 79.11 33.64 1.54 35.56 0.24 55.83 I also wonder whether more reads and writes are happening on rhgs-3.4.0, though I need to dig into more data to confirm this. @Manoj, Do you think these findings are in anyways significant? These findings are on stats from comment #133
Data and observations in comment #170 were from gqas001. Bricks from other nodes reveal similar pattern.
(In reply to Raghavendra G from comment #169) > (In reply to Pranith Kumar K from comment #141) > > This is the profile after the revert, 4 digit avg latency for create even in > > this run, so io-threads patch is not the reason for regression: > > 3.54 185.92 us 68.00 us 602428.00 us 131624 > > LOOKUP > > 6.21 356.92 us 72.00 us 202841.00 us 120405 > > WRITE > > 38.45 1108.09 us 139.00 us 1698682.00 us 240114 > > FXATTROP > > 43.33 2497.64 us 189.00 us 1538022.00 us 120062 > > CREATE > > Would it make sense to capture strace of bricks? That way we can isolate the > problem to operations on fs or to glusterfs stack. I think getting strace output of 3.3.1 and 3.4.0 would be a good idea. Reasons are: * Shekhar pointed that there are more reads to disks in 3.4.0 compared to 3.3.1. So comparing the syscalls will give a fair idea on where those reads are happening. * On irc, Sunil pointed out that ec86167 is a commit where regression is not seen and commit 4700c5b has regression. On doing a git log of commits between these two, I found suspect patches like: commit 68f2192df570b5ee615d440c2e0c88d49a75a34f Author: Soumya Koduri <skoduri> Date: Thu Jun 8 22:19:17 2017 +0530 posix-acl: Whitelist virtual ACL xattrs Similar to system.posix_acl_* xattrs, all users should get permission to be able to read glusterfs.posix.acl* xattrs too. Change-Id: I1fc2b67c8a12113910e4ec57cd114e4baefe0d38 BUG: 1459971 Signed-off-by: Soumya Koduri <skoduri> Reviewed-on: https://review.gluster.org/17493 This patch lets operations on posix-acls reach bricks. If posix acls are used, the traffic can be huge. Note that this patch won't change the number of fops as witnessed by brick profile as fops used to be wound all the way till posix-acl, which used to fail them with EPERM. With this patch, these fops will hit backend fs. commit f0fb166078d59cab2a33583591b6448326247c40 Author: Kotresh HR <khiremat> Date: Thu Jun 15 08:31:06 2017 -0400 feature/bitrot: Fix ondemand scrub The flag which keeps tracks of whether the scrub frequency is changed from previous value should not be considered for on-demand scrubbing. It should be considered only for 'scrub-frequency' where it should not be re-scheduled if it is set to same value again. But in case ondemand scrub, it should start the scrub immediately no matter what the scrub-frequency. Reproducer: 1. Enable bitrot 2. Set scrub-throttle 3. Set ondemand scrub Make sure glusterd is not restarted while doing below steps Change-Id: Ice5feaece7fff1579fb009d1a59d2b8292e23e0b BUG: 1461845 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: https://review.gluster.org/17552 Not sure whether this can increase operations on disk. Would leave it to Kotresh on the effect of this patch.
(In reply to Raghavendra G from comment #174) > (In reply to Raghavendra G from comment #169) > > (In reply to Pranith Kumar K from comment #141) > > > This is the profile after the revert, 4 digit avg latency for create even in > > > this run, so io-threads patch is not the reason for regression: > > > 3.54 185.92 us 68.00 us 602428.00 us 131624 > > > LOOKUP > > > 6.21 356.92 us 72.00 us 202841.00 us 120405 > > > WRITE > > > 38.45 1108.09 us 139.00 us 1698682.00 us 240114 > > > FXATTROP > > > 43.33 2497.64 us 189.00 us 1538022.00 us 120062 > > > CREATE > > > > Would it make sense to capture strace of bricks? That way we can isolate the > > problem to operations on fs or to glusterfs stack. > > I think getting strace output of 3.3.1 and 3.4.0 would be a good idea. > Reasons are: > > * Shekhar pointed that there are more reads to disks in 3.4.0 compared to > 3.3.1. So comparing the syscalls will give a fair idea on where those reads > are happening. > > * On irc, Sunil pointed out that ec86167 is a commit where regression is not > seen and commit 4700c5b has regression. On doing a git log of commits > between these two, I found suspect patches like: > > commit 68f2192df570b5ee615d440c2e0c88d49a75a34f > Author: Soumya Koduri <skoduri> > Date: Thu Jun 8 22:19:17 2017 +0530 > > posix-acl: Whitelist virtual ACL xattrs > > Similar to system.posix_acl_* xattrs, all users should get > permission to be able to read glusterfs.posix.acl* xattrs too. > > Change-Id: I1fc2b67c8a12113910e4ec57cd114e4baefe0d38 > BUG: 1459971 > Signed-off-by: Soumya Koduri <skoduri> > Reviewed-on: https://review.gluster.org/17493 > > This patch lets operations on posix-acls reach bricks. If posix acls are > used, the traffic can be huge. Note that this patch won't change the number > of fops as witnessed by brick profile as fops used to be wound all the way > till posix-acl, which used to fail them with EPERM. With this patch, these > fops will hit backend fs. This looks to be used only in NFS-Ganesha. Nevertheless leaving the comment just in case I missed out anything. > > commit f0fb166078d59cab2a33583591b6448326247c40 > Author: Kotresh HR <khiremat> > Date: Thu Jun 15 08:31:06 2017 -0400 > > feature/bitrot: Fix ondemand scrub > > The flag which keeps tracks of whether the scrub > frequency is changed from previous value should > not be considered for on-demand scrubbing. It > should be considered only for 'scrub-frequency' > where it should not be re-scheduled if it is > set to same value again. But in case ondemand > scrub, it should start the scrub immediately > no matter what the scrub-frequency. > > Reproducer: > 1. Enable bitrot > 2. Set scrub-throttle > 3. Set ondemand scrub > Make sure glusterd is not restarted while doing > below steps > > Change-Id: Ice5feaece7fff1579fb009d1a59d2b8292e23e0b > BUG: 1461845 > Signed-off-by: Kotresh HR <khiremat> > Reviewed-on: https://review.gluster.org/17552 > > Not sure whether this can increase operations on disk. Would leave it to > Kotresh on the effect of this patch.
Created attachment 1452822 [details] Comment#182_regression_free_run
Created attachment 1452823 [details] Comment#168_regression_run
Created attachment 1453015 [details] comment#185_regression_free_run
stats in comment #133 also indicate a better network utilization rhgs-3.3.1. Note that how I/O on network is concentrated in few consecutive lines of higher rxKB/s compared to more spread out and lesser rxKB/s on rhgs-3.4.1 (left is v3.12.2-7/rhgs-3.4.0 | right is rhgs-3.3.1) 02:00:45 EDT p2p1 63.70 66.30 12.20 12.90 0.00 0.00 0.00 |02:35:37 EDT p2p1 59.20 61.50 11.91 12.58 0.00 0.00 0.00 02:00:55 EDT p2p1 12.30 7.00 1.08 1.66 0.00 0.00 0.00 |02:35:47 EDT p2p1 16.80 11.40 1.77 1.59 0.00 0.00 0.00 02:01:05 EDT p2p1 6618.10 6558.40 1785.05 1107.64 0.00 0.00 0.00 |02:35:57 EDT p2p1 6288.10 6132.40 1713.01 1102.48 0.00 0.00 0.00 02:01:15 EDT p2p1 28556.70 20945.50 93880.71 3665.47 0.00 0.00 0.00 |02:36:07 EDT p2p1 57303.50 36543.80 190471.20 6979.76 0.00 0.00 0.00 02:01:25 EDT p2p1 22270.20 16173.70 73892.16 2837.41 0.00 0.00 0.00 |02:36:17 EDT p2p1 38808.10 24657.10 129105.82 4726.04 0.00 0.00 0.00 02:01:35 EDT p2p1 28223.80 20224.00 93681.53 3578.31 0.00 0.00 0.00 |02:36:27 EDT p2p1 52537.60 33221.90 174648.95 6381.13 0.00 0.00 0.00 02:01:45 EDT p2p1 20560.50 14616.10 68284.64 2601.21 0.00 0.00 0.00 |02:36:37 EDT p2p1 12996.20 8981.80 43115.53 1633.15 0.00 0.00 0.00 02:01:55 EDT p2p1 2048.40 1446.40 6742.59 258.02 0.00 0.00 0.00 |02:36:47 EDT p2p1 9019.50 6089.40 30016.74 1126.87 0.00 0.00 0.00 02:02:05 EDT p2p1 2892.20 2146.40 9584.58 371.65 0.00 0.00 0.00 |02:36:57 EDT p2p1 10708.00 6899.50 35689.65 1385.62 0.00 0.00 0.00 02:02:15 EDT p2p1 3196.90 2341.00 10485.10 404.89 0.00 0.00 0.00 |02:37:07 EDT p2p1 34070.40 22555.90 110824.64 4804.49 0.00 0.00 0.00 02:02:25 EDT p2p1 9383.20 6659.90 31096.11 1185.47 0.00 0.00 0.00 |02:37:17 EDT p2p1 31056.30 20839.70 101216.36 4608.60 0.00 0.00 0.00 02:02:35 EDT p2p1 49026.70 33658.50 162940.17 6126.67 0.00 0.00 0.00 |02:37:27 EDT p2p1 10.70 5.80 0.66 0.37 0.00 0.00 0.00 02:02:45 EDT p2p1 8914.50 6045.60 29594.30 1107.27 0.00 0.00 0.00 |02:37:37 EDT p2p1 9.20 4.20 0.56 0.27 0.00 0.00 0.00 02:02:55 EDT p2p1 24452.50 17308.70 79795.81 3457.23 0.00 0.00 0.00 |02:37:47 EDT p2p1 10.90 5.90 0.67 0.37 0.00 0.00 0.00 02:03:05 EDT p2p1 3129.10 2333.70 10241.54 459.10 0.00 0.00 0.00 |02:37:57 EDT p2p1 9.40 4.60 0.57 0.28 0.00 0.00 0.00 02:03:15 EDT p2p1 3902.30 2923.30 12736.39 573.44 0.00 0.00 0.00 |02:38:07 EDT p2p1 10.50 5.70 0.65 0.36 0.00 0.00 0.00 02:03:25 EDT p2p1 3813.30 2822.00 12451.65 556.95 0.00 0.00 0.00 |02:38:17 EDT p2p1 9.50 4.50 0.58 0.28 0.00 0.00 0.00 02:03:35 EDT p2p1 5214.30 3890.50 16871.65 761.33 0.00 0.00 0.00 |02:38:27 EDT p2p1 11.00 6.00 0.68 0.37 0.00 0.00 0.00 02:03:45 EDT p2p1 12097.90 8813.30 39354.17 1759.69 0.00 0.00 0.00 |02:38:37 EDT p2p1 9.70 4.70 0.59 0.29 0.00 0.00 0.00 02:03:55 EDT p2p1 17982.40 12382.90 58536.01 2568.12 0.00 0.00 0.00 |02:38:47 EDT p2p1 10.70 5.70 0.66 0.36 0.00 0.00 0.00 02:04:05 EDT p2p1 1659.20 1287.50 5454.85 544.03 0.00 0.00 0.00 |02:38:57 EDT p2p1 9.90 4.70 0.66 0.29 0.00 0.00 0.00 02:04:15 EDT p2p1 7.70 2.60 0.49 0.16 0.00 0.00 0.00 |02:39:07 EDT p2p1 11.70 6.30 0.83 0.38 0.00 0.00 0.00
Trimming down output of previous comment to just rxpck/s txpck/s rxKB/s and txKB/s for better readability: (v3.12.2-7 | rhgs-3.3.1) 63.70 66.30 12.20 12.90 | 59.20 61.50 11.91 12.58 12.30 7.00 1.08 1.66 | 16.80 11.40 1.77 1.59 6618.10 6558.40 1785.05 1107.64 | 6288.10 6132.40 1713.01 1102.48 28556.70 20945.50 93880.71 3665.47 | 57303.50 36543.80 190471.20 6979.76 22270.20 16173.70 73892.16 2837.41 | 38808.10 24657.10 129105.82 4726.04 28223.80 20224.00 93681.53 3578.31 | 52537.60 33221.90 174648.95 6381.13 20560.50 14616.10 68284.64 2601.21 | 12996.20 8981.80 43115.53 1633.15 2048.40 1446.40 6742.59 258.02 | 9019.50 6089.40 30016.74 1126.87 2892.20 2146.40 9584.58 371.65 | 10708.00 6899.50 35689.65 1385.62 3196.90 2341.00 10485.10 404.89 | 34070.40 22555.90 110824.64 4804.49 9383.20 6659.90 31096.11 1185.47 | 31056.30 20839.70 101216.36 4608.60 49026.70 33658.50 162940.17 6126.67 | 10.70 5.80 0.66 0.37 8914.50 6045.60 29594.30 1107.27 | 9.20 4.20 0.56 0.27 24452.50 17308.70 79795.81 3457.23 | 10.90 5.90 0.67 0.37 3129.10 2333.70 10241.54 459.10 | 9.40 4.60 0.57 0.28 3902.30 2923.30 12736.39 573.44 | 10.50 5.70 0.65 0.36 3813.30 2822.00 12451.65 556.95 | 9.50 4.50 0.58 0.28 5214.30 3890.50 16871.65 761.33 | 11.00 6.00 0.68 0.37 12097.90 8813.30 39354.17 1759.69 | 9.70 4.70 0.59 0.29 17982.40 12382.90 58536.01 2568.12 | 10.70 5.70 0.66 0.36 1659.20 1287.50 5454.85 544.03 | 9.90 4.70 0.66 0.29 7.70 2.60 0.49 0.16 | 11.70 6.30 0.83 0.38 13.00 7.80 0.86 0.49 | 9.40 4.30 0.60 0.27
(In reply to Raghavendra G from comment #190) > stats in comment #133 also indicate a better network utilization rhgs-3.3.1. > Note that how I/O on network is concentrated in few consecutive lines of higher > rxKB/s compared to more spread out and lesser rxKB/s on rhgs-3.4.1 > Trimming down output of previous comment to just rxpck/s txpck/s rxKB/s and > txKB/s for better readability: > Does this indicate a problem in client stack? It looks like client is relatively sending data sporadically to bricks
(In reply to Raghavendra G from comment #191) > (In reply to Raghavendra G from comment #190) > > > stats in comment #133 also indicate a better network utilization rhgs-3.3.1. > > Note that how I/O on network is concentrated in few consecutive lines of higher > rxKB/s compared to more spread out and lesser rxKB/s on rhgs-3.4.1 > > Trimming down output of previous comment to just rxpck/s txpck/s rxKB/s and > > txKB/s for better readability: > > > > Does this indicate a problem in client stack? It looks like client is > relatively sending data sporadically to bricks Even the data written by brick is sporadic. So, it could be brick stack as well. Also comment #145 has the same observation.
(In reply to Raghavendra G from comment #188) > Restoring needinfo on Manoj for comment #170 I had looked at the last set of stats and sosreports that Karan provided, didn't get a change to update the bz. Some of my earlier suspicions turned out NOT to be true, and haven't found time to dig deeper to form any new hypotheses. I was under the impression that the rhgs-random-io tuned profile was in effect, but from the sosreports the profile used is throughput-performance. So the dirty*ratio limits are high enough that forced page-cache writeback should not happen. There is a significant difference in dirty data buildup between the two runs though (3.3.1 is accumulating more data before writing it out). You can see that by running this on the collected stats file: cat sar.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep -A 1 kbdirty kbdirty stays fairly low in 3.4 runs. But that is probably the effect of something else, as is probably the case with top and sar network stats pointed out in comment #145 and comment #190. Planning to analyze the iostat results more for clues...
I've had a discussion with Raghavendra G about stats on 3.3.1 and 3.4.0 and we have found something that seems an anomaly: The test consists on 6 clients, with 8 threads in each client. Each thread is creating 5000 files of 64 KiB (I'm assuming this is KiB (1024) and not KB (1000)). This means a total of 6 * 8 * 5000 * 64 = 14.65 GiB. Since this is a replica 2 + arbiter distributed in 2 replica sets, each brick (data brick should receive 7.32 GiB). Checking the data, we can see that bricks receive 7.85 GiB on 3.3.1, and 7.80 GiB on 3.4.0 for the same test. This seems ok. Transmitted data also seems ok: 350 MiB on 3.3.1 and 335 MiB on 3.4.0. The differences appear on data read and written to disk: Writes: 8.3 GiB on 3.3.1, 10.2 GiB on 3.4.0 (x1.2) Reads: 94 MiB on 3.3.1, 671 MiB on 3.4.0 (x7.1) This shows a big difference between versions. Since the amount of data received and transmitted to client side is the same, it seems that these extra reads and writes come from the brick stack. Another interesting thing is that on 3.4.0, arbiter bricks read about 100 MiB more than regular bricks (i.e. ~770 MiB). On 3.3.1 they read the same amount. However it's not clear what are those extra reads and writes. From the comments, it seems that strace didn't reveal any significant difference, so I can't explain how the same requests could cause this increase in volume of data accessed.
Continuing the analysis from previous comments, following patches look suspect: commit 0499970747a7897bd9190484b5ab6868b19f393f Author: Poornima G <pgurusid> Date: Thu Jan 4 19:38:05 2018 +0530 posix: In getxattr, honor the wildcard '*' Currently, the posix_xattr_fill performas a sys_getxattr on all the keys requested, there are requirements where the keys could contain a wildcard, in which case sys_getxattr would return ENODATA, eg: if the xattr requested is user.* all the xattrs with prefix user. should be returned, with their values. This patch, changes posix_xattr_fill, to honor wildcard in the keys requested. Updates #297 > Signed-off-by: Poornima G <pgurusid> > Change-Id: I3d52da2957ac386fca3c156e26ff4cdf0b2c79a9 > Reviewed-on: https://review.gluster.org/19170 > Smoke: Gluster Build System <jenkins.org> > Reviewed-by: Amar Tumballi > Tested-by: Poornima G <pgurusid> > CentOS-regression: Gluster Build System <jenkins.org> > (cherry picked from commit 8fc9c6a8fc7c73b2b4c65a8ddbe988bca10e89b6) BUG: 1446125 Change-Id: I3d52da2957ac386fca3c156e26ff4cdf0b2c79a9 Signed-off-by: Poornima G <pgurusid> Reviewed-on: https://code.engineering.redhat.com/gerrit/128480 Tested-by: RHGS Build Bot <nigelb> Reviewed-by: Sunil Kumar Heggodu Gopala Acharya <sheggodu> commit c8c56b9a53471792a126f8e46f32ce7190eac10c (tag: v3.12.0) Author: Kotresh HR <khiremat> Date: Tue Aug 1 07:51:44 2017 -0400 feature/posix: Enabled gfid2path by default Enable gfid2path feature by default. The basic performance tests are carried out and it doesn't show significant depreciation. The results are updated in issue. Updates: #139 Change-Id: I5f1949a608d0827018ef9d548d5d69f3bb7744fd > Signed-off-by: Kotresh HR <khiremat> > Reviewed-on: https://review.gluster.org/17950 > Smoke: Gluster Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Aravinda VK <avishwan> > Reviewed-by: Amar Tumballi <amarts> (cherry picked from commit 3ec63650bb7fd874a5013e7be4a2def3b519c9b2) Reviewed-on: https://review.gluster.org/18133 Reviewed-by: Amar Tumballi <amarts> Smoke: Gluster Build System <jenkins.org> Tested-by: Shyamsundar Ranganathan <srangana> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana> commit c8c56b9a53471792a126f8e46f32ce7190eac10c (tag: v3.12.0) Author: Kotresh HR <khiremat> Date: Tue Aug 1 07:51:44 2017 -0400 feature/posix: Enabled gfid2path by default Enable gfid2path feature by default. The basic performance tests are carried out and it doesn't show significant depreciation. The results are updated in issue. Updates: #139 Change-Id: I5f1949a608d0827018ef9d548d5d69f3bb7744fd > Signed-off-by: Kotresh HR <khiremat> > Reviewed-on: https://review.gluster.org/17950 > Smoke: Gluster Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Aravinda VK <avishwan> > Reviewed-by: Amar Tumballi <amarts> (cherry picked from commit 3ec63650bb7fd874a5013e7be4a2def3b519c9b2) Reviewed-on: https://review.gluster.org/18133 Reviewed-by: Amar Tumballi <amarts> Smoke: Gluster Build System <jenkins.org> Tested-by: Shyamsundar Ranganathan <srangana> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana> commit 32e366bad947926343fdb389f054ca3051e44774 Author: Kotresh HR <khiremat> Date: Sun Jul 16 15:16:56 2017 -0400 storage/posix: Add virtual xattr to fetch path from gfid The gfid2path infra stores the "pargfid/bname" as on xattr value for each non directory entry. Hardlinks would have a separate xattr. This xattr key is internal and is not exposed to applications. A virtual xattr is exposed for the applications to fetch the path from gfid. Internal xattr: trusted.gfid2path.<xxhash> Virtual xattr: glusterfs.gfidtopath getfattr -h -n glusterfs.gfidtopath /<aux-mnt>/.gfid/<gfid> If there are hardlinks, it returns all the paths separated by ':'. A volume set option is introduced to change the delimiter to required string of max length 7. gluster vol set gfid2path-separator ":::" > Updates: #139 > Change-Id: Ie3b0c3fd8bd5333c4a27410011e608333918c02a > Signed-off-by: Kotresh HR <khiremat> > Reviewed-on: https://review.gluster.org/17785 > Smoke: Gluster Build System <jenkins.org> > CentOS-regression: Gluster Build System <jenkins.org> > Reviewed-by: Krutika Dhananjay <kdhananj> Updates: #139 Change-Id: Ie3b0c3fd8bd5333c4a27410011e608333918c02a Signed-off-by: Kotresh HR <khiremat> Reviewed-on: https://review.gluster.org/17921 CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Shyamsundar Ranganathan <srangana> commit a62b16b72b03f2b7c25c24ea9ac5968453a92009 Author: Kotresh HR <khiremat> Date: Thu Jun 29 05:45:34 2017 -0400 storage/posix: New gfid2path infra With this infra, a new xattr is stored on each entry creation as below. trusted.gfid2path.<xxhash> = <pargfid>/<basename> If there are hardlinks, multiple xattrs would be present. Fops which are impacted: create, mknod, link, symlink, rename, unlink Option to enable: gluster vol set <VOLNAME> storage.gfid2path on Updates: #139 Change-Id: I369974cd16703c45ee87f82e6c2ff5a987a6cc6a Signed-off-by: Kotresh HR <khiremat> Reviewed-on: https://review.gluster.org/17488 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Aravinda VK <avishwan> Reviewed-by: Amar Tumballi <amarts> commit 960ecba01a057ff8a2fc356624720904419e7f49 Author: Ravishankar N <ravishankar> Date: Tue Jan 16 10:16:41 2018 +0530 posix: delete stale gfid handles in nameless lookup ..in order for self-heal of symlinks to work properly (see BZ for details). Backport of https://review.gluster.org/#/c/19070/ Signed-off-by: Ravishankar N <ravishankar> Change-Id: I9a011d00b07a690446f7fd3589e96f840e8b7501 BUG: 1527309 Reviewed-on: https://code.engineering.redhat.com/gerrit/127739 Tested-by: Ravishankar Narayanankutty <ravishankar> Tested-by: RHGS Build Bot <nigelb> Reviewed-by: Ashish Pandey <aspandey> Reviewed-by: Ravishankar Narayanankutty <ravishankar> there are some gfid cleanup patches from Ravi too. We need to inspect them a bit deeper. My suspicion is on gfid2path being default and commit 0499970747a7897b to honour wildcards in getxattr.
One consideration... we normally configure XFS with an inode size of 512 bytes. Is it possible that the addition of some extra extended attributes have exhausted the available space and a second inode is being used for all/many files ? Maybe this could explain part of the increment in reads and writes without sending more requests from user land.
Created attachment 1453586 [details] comment#198_regression_free_run
I've compared straces from 3.3.1 and 3.4.0, and the system calls they send are the same and in the same order. The only difference is an additional lsetxattr to set gfid2path on 3.4.0. However it seems that disabling storage.gfid2path doesn't improve performance, based on previous comments. However the same set of operations seem to cause different workloads on disk. On 3.3.1 we can see few reads, and almost all of them concentrated at the beginning of the test. But on 3.4.0 we see that there are reads constantly. Mixing reads with writes is generally bad for performance. It seems as if cache were flushed more frequently (causing multiple writes of the same blocks after being updated, increasing amount of data written) and invalidated (causing additional reads to cache inodes). If I understood it correctly, the environment (kernel, xfs, config, ...) of the test is the same for 3.3.1 and 3.4.0, so the difference must come from gluster. Testing locally I can reproduce the additional reads and the increment of writes when system cache is full and kernel starts flushing and invalidating cached data. At this time performance drops drastically, most probably caused by the mixed reads and writes that happen. I see two possibilities here: * Some change related to inode invalidation * Changes in memory management that could cause earlier flush/invalidation of inodes. I'll check commits related with these possibilities.
(In reply to Xavi Hernandez from comment #200) > I've compared straces from 3.3.1 and 3.4.0, and the system calls they send > are the same and in the same order. The only difference is an additional > lsetxattr to set gfid2path on 3.4.0. However it seems that disabling > storage.gfid2path doesn't improve performance, based on previous comments. From comment #28 <snip> While debugging with Karan, we also found disabling gfid to path improved the regression but not completely. (Note that the exercise done in comment #18 was not accurate due to BZ 1564027 which needs to be addressed). After properly disabling gfid2path, we got a throughput in one of the runs as 2311 files/sec for 3.4.0.(compared to 3743 files/sec in 3.3.1). </snip> That's around 12% improvement with gfid2path disabled. Note that with enabled, we get around 1858 files/sec (comment #18)
(In reply to Xavi Hernandez from comment #200) > I've compared straces from 3.3.1 and 3.4.0, and the system calls they send > are the same and in the same order. The only difference is an additional > lsetxattr to set gfid2path on 3.4.0. However it seems that disabling > storage.gfid2path doesn't improve performance, based on previous comments. > > However the same set of operations seem to cause different workloads on > disk. On 3.3.1 we can see few reads, and almost all of them concentrated at > the beginning of the test. But on 3.4.0 we see that there are reads > constantly. Mixing reads with writes is generally bad for performance. > > It seems as if cache were flushed more frequently (causing multiple writes > of the same blocks after being updated, increasing amount of data written) > and invalidated (causing additional reads to cache inodes). > > If I understood it correctly, the environment (kernel, xfs, config, ...) of > the test is the same for 3.3.1 and 3.4.0, so the difference must come from > gluster. Karan:- Yes Xavi, the tests performed on 3.3.1 and 3.4.0 are completely the same. We use the same RHEL iso on both the tests. > > Testing locally I can reproduce the additional reads and the increment of > writes when system cache is full and kernel starts flushing and invalidating > cached data. At this time performance drops drastically, most probably > caused by the mixed reads and writes that happen. > > I see two possibilities here: > > * Some change related to inode invalidation > * Changes in memory management that could cause earlier flush/invalidation > of inodes. > > I'll check commits related with these possibilities.
Upstream patch: https://review.gluster.org/#/c/20392/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607