1558989 – 60% regression on small-file creates from 3.3.1

Bug 1558989 - 60% regression on small-file creates from 3.3.1

Summary: 60% regression on small-file creates from 3.3.1

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rpc
Sub Component:
Version:	rhgs-3.4
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Milind Changire
QA Contact:	Karan Sandha
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137 1577750 1610751 1651508
TreeView+	depends on / blocked

Reported:	2018-03-21 13:09 UTC by Karan Sandha
Modified:	2018-11-20 09:04 UTC (History)
CC List:	21 users (show)
Fixed In Version:	glusterfs-3.12.2-13
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 06:44:20 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
reverted patch-1 (13.30 KB, patch) 2018-04-30 05:58 UTC, Pranith Kumar K	no flags	Details \| Diff
patch-2 (1.66 KB, application/mbox) 2018-04-30 05:59 UTC, Pranith Kumar K	no flags	Details
patch-3 (9.06 KB, application/mbox) 2018-04-30 05:59 UTC, Pranith Kumar K	no flags	Details
patch-4 (8.46 KB, application/mbox) 2018-04-30 06:00 UTC, Pranith Kumar K	no flags	Details
Comment#182_regression_free_run (304.55 KB, text/plain) 2018-06-19 06:42 UTC, Karan Sandha	no flags	Details
Comment#168_regression_run (304.57 KB, text/plain) 2018-06-19 06:46 UTC, Karan Sandha	no flags	Details
comment#185_regression_free_run (182.89 KB, text/plain) 2018-06-19 19:13 UTC, Karan Sandha	no flags	Details
comment#198_regression_free_run (365.50 KB, text/plain) 2018-06-21 19:42 UTC, Karan Sandha	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:45:35 UTC

Description Karan Sandha 2018-03-21 13:09:01 UTC

Description of problem:
Regression seen on smallfile creates from rhgs 3.3.1 to rhgs 3.4.0 on fuse mount. 

Version-Release number of selected component (if applicable):
3.12.2-5

How reproducible:
100%
profile infos are kept at sosreports are placed at sosreport and statedumps at mount -t cifs -oguest
//newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase /BZ_sosreports/Karan/Porfile_info/

Steps to Reproduce:
1. Run small-file creates on 6 node setup using 6 client machines below command qeSyncDropCaches.sh; python /small-files/smallfile/smallfile_cli.py --operation create --threads 8 --file-size 64 --files 5000 --top /gluster-mount  --host-set "`echo $CLIENT | tr ' ' ','`"
2. now compare the same with RHGS 3.3.1 i.e 3.8.4.54.3 


Actual results:
Regression seen between release  

Expected results:
No regression should be observed


Additional info:
not a brick mux setup.
The performance numbers weren't consistent i.e after every iterations the regression was more. 
regression seen on 2x2 volume also

Comment 3 Karan Sandha 2018-03-22 07:20:42 UTC

Profile infos at

mount -t cifs -oguest //newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase

/BZ_sosreports/ksandha/Porfile_info

Comment 4 Ravishankar N 2018-03-22 07:24:00 UTC

Hi Nithya,
In the profile infos (Create_3.3.1_7.5.profileinfo and Create_3.4_7.5.profileinfo), the no. of STATFS calls for 3.4 is 3 times more than 3.3.1. Are there any changes in DHT between these two versions that causes these extra calls and possibly this regression?

Comment 14 Nithya Balachandran 2018-03-23 15:17:18 UTC

(In reply to Karan Sandha from comment #0)
> Description of problem:
> Regression seen on smallfile creates from rhgs 3.3.1 to rhgs 3.4.0 on fuse
> mount. 
> 
> Version-Release number of selected component (if applicable):
> 3.12.2-5
> 
> How reproducible:
> 100%
> profile infos are kept at sosreports are placed at sosreport and statedumps
> at mount -t cifs -oguest
> //newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase
> /BZ_sosreports/Karan/Porfile_info/
> 
> Steps to Reproduce:
> 1. Run small-file creates on 6 node setup using 6 client machines below
> command qeSyncDropCaches.sh; python /small-files/smallfile/smallfile_cli.py
> --operation create --threads 8 --file-size 64 --files 5000 --top
> /gluster-mount  --host-set "`echo $CLIENT | tr ' ' ','`"
> 2. now compare the same with RHGS 3.3.1 i.e 3.8.4.54.3 
> 
> 
> Actual results:
> Regression seen between release  
> 

It would help if actual numbers were provided as part of the description for performance bugs For example, with 3.3.1 this test finished in x seconds. With 3.4 it takes x+y seconds.  Percentage regressions are rather vague and concrete numbers would allow us to visualise the impact better.


> Expected results:
> No regression should be observed
> 
> 
> Additional info:
> not a brick mux setup.
> The performance numbers weren't consistent i.e after every iterations the
> regression was more. 
> regression seen on 2x2 volume also

Comment 16 Ravishankar N 2018-03-26 11:10:45 UTC

From the strace at newtssuitcase.lab.eng.blr.redhat.com/newtssuitcase/BZ_sosreports/ksandha/Porfile_info/strace, we have  the number of FOPS.

FOP                  3.3.1    3.4.0       Difference (3.4.0 – 3.3.1)
===================================================================
chown                15603    14615       -988
clone                46       42          -4
close                26327    24578       -1749
epoll_ctl            175807   163517      -12290
epoll_wait           175831   163540      -12291
fgetxattr            149342   142270      -7072
fsetxattr            28854    27270       -1584
fstat                33926    33190       -736
fsync                24       23          -1
futex                518916   427435      -91481
getdents             18569    17821       -748
getegid              1158     965         -193
geteuid              1158     965         -193
getgroups            2316     1930        -386
lgetxattr            121896   62944       -58952
linkat               57723    54552       -3171
llistxattr           6432     2330        -4102
lseek                33       26          -7
lsetxattr            16761    29230       12469
lstat                223104   171383      -51721
madvise              2                    -2
mkdir                32364    30195       -2169
mmap                 46       62          16
mprotect             534      7671        7137
nanosleep            345      327         -18
open                 14535    119         -14416
openat               11792    24454       12662
pread64              7        8           1
pwrite64             9704     9732        28
read                 65       50          -15
readlink             28       35          7
readv                361316   336764      -24552
restart_syscall      48       58          10
rmdir                6        5           -1
rt_sigaction         12                   -12
rt_sigprocmask       128      148         20
rt_sigtimedwait      6        6           0
select               981      819         -162
set_robust_list      46       42          -4
stat                 44940    42397       -2543
statfs               504      530         26
symlink              1158     965         -193
unlink               43374    40991       -2383
write                12       47          35
writev               176312   164042      -12270
munmap                        10          10
=========================================================
It is seen that the no. of lsetxattr and mprotect calls are more for 3.4.0.

In the strace, it was seen that these lsetxattrs corresponded to setting "trusted.gfid2path". This was not there in the 3.3.1 strace. We are suspecting this to be the cause of regressions as gfid2path is enabled by default in 3.4.0 (https://review.gluster.org/#/c/18133/)

Comment 17 Ravishankar N 2018-03-26 12:01:30 UTC

I'm not sure yet if setting gfid2path is a client driven operation or it is handled entirely in posix (in which case we should not be seeing such a huge perf drop just for an extra syscall that sets this attribute per file).

Karan, as discussed,

1. Could you try if disabling gfid2path improves the performance on 3.4.0?
`gluster v set volname gfid2path disable`

2. Since the earlier strace was started after the I/O on was initiated, it might not have captured all the FOPS. So please capture the strace before starting the I/O. Also provide the corresponding profile info for correlation. 

Thanks!

Comment 33 Amar Tumballi 2018-04-13 05:52:50 UTC

(all) In the next run, while you get profile, please take the profile of the client side too: (Below is the way to get the profile info).

   - On the client where the volume is mounted (via FUSE) and tests  are to be performed 
      - setfattr -n trusted.io-stats-dump -v  <full-path-to-store-io-stats> /the/gluster/fuse/mountpoint 


This should get us to a total understanding of where we are stuck.

Comment 68 Pranith Kumar K 2018-04-30 05:58:50 UTC

Created attachment 1428669 [details]
reverted patch-1

Comment 69 Pranith Kumar K 2018-04-30 05:59:14 UTC

Created attachment 1428670 [details]
patch-2

Comment 70 Pranith Kumar K 2018-04-30 05:59:37 UTC

Created attachment 1428671 [details]
patch-3

Comment 71 Pranith Kumar K 2018-04-30 06:00:08 UTC

Created attachment 1428672 [details]
patch-4

Comment 74 Raghavendra G 2018-05-02 15:02:48 UTC

Milind,

Lets measure the contention on rpcsvc->queue_lock using mutrace on brick processes. We can use following cmdline:

LD_PRELOAD=/usr/local/lib/libmutrace.so mutrace --max=30 --hash-size=50021 /usr/bin/glusterfs -N ....

If the contention is high, we can experiment with multiple queues instead of a single queue.

regards,
Raghavendra

Comment 78 Raghavendra G 2018-05-08 13:47:54 UTC

Thanks Milind. Can you also capture the output of cpu utilization of each thread in brick process (On all bricks)?

You can use following cmdline:

top -bHd $MON_INTERVAL > /tmp/top_thr.${HOSTNAME}.txt # CPU utilization by thread

Comment 85 Raghavendra G 2018-05-09 04:53:51 UTC

From rpcsvc_program_register:

 if (newprog->ownthread) {
                newprog->eventthreadcount = 1;
                creates = rpcsvc_spawn_threads (svc, newprog);

                if (creates < 1) {
                        goto out;
                }
  }

Why are we setting eventthreadcount to 1? When is that changed to actual number of event threads in the process? Can that codepath be executed in this case? IOW, can this be the bug that prevents scaling of req-handler threads?

Comment 145 Manoj Pillai 2018-06-02 03:26:34 UTC

Looking at the stats provided in comment #133:

For rhgs 3.3.1:
$ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep glusterfsd | head -n 20
25179 root      20   0 2304616 478932   4588 S   0.0  1.0 185:15.17 glusterfsd
25179 root      20   0 2304616 478932   4588 S   0.0  1.0 185:15.17 glusterfsd
25179 root      20   0 2304876 478932   4588 S   0.2  1.0 185:15.19 glusterfsd
25179 root      20   0 2307736 479560   4588 S 201.3  1.0 185:35.34 glusterfsd
25179 root      20   0 2307996 480292   4588 S 656.9  1.0 186:41.23 glusterfsd
25179 root      20   0 2308516 480348   4588 S 578.8  1.0 187:39.23 glusterfsd
25179 root      20   0 2308516 480348   4588 S 602.7  1.0 188:39.68 glusterfsd
25179 root      20   0 2308516 480348   4588 S 149.5  1.0 188:54.66 glusterfsd
25179 root      20   0 2308516 480348   4588 S 109.9  1.0 189:05.68 glusterfsd
25179 root      20   0 2308516 480352   4588 S 237.6  1.0 189:29.46 glusterfsd
25179 root      20   0 2308516 480388   4592 S 424.2  1.0 190:11.96 glusterfsd
25179 root      20   0 2308516 480388   4592 S 302.6  1.0 190:42.31 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.1  1.0 190:42.32 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.0  1.0 190:42.32 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.0  1.0 190:42.32 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.0  1.0 190:42.32 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.0  1.0 190:42.32 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.0  1.0 190:42.32 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.0  1.0 190:42.32 glusterfsd
25179 root      20   0 2308516 480388   4592 S   0.0  1.0 190:42.32 glusterfsd

For the rhgs3.4 build:
$ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep glusterfsd 
16244 root      20   0 2386320 421152   4712 S   0.0  0.9  47:28.08 glusterfsd
16244 root      20   0 2386320 422396   4712 S   0.0  0.9  47:28.08 glusterfsd
16244 root      20   0 2386320 422396   4712 S   0.3  0.9  47:28.11 glusterfsd
16244 root      20   0 2386320 423200   4712 S 116.8  0.9  47:39.81 glusterfsd
16244 root      20   0 2386320 421928   4712 S 406.0  0.9  48:20.49 glusterfsd
16244 root      20   0 2386320 421956   4712 S 286.4  0.9  48:49.16 glusterfsd
16244 root      20   0 2386320 422060   4712 S 371.8  0.9  49:26.45 glusterfsd
16244 root      20   0 2386320 422176   4712 S 204.3  0.9  49:46.94 glusterfsd
16244 root      20   0 2386320 421204   4712 S  26.2  0.9  49:49.57 glusterfsd
16244 root      20   0 2386320 422116   4712 S  41.3  0.9  49:53.71 glusterfsd
16244 root      20   0 2386320 421196   4712 S  34.5  0.9  49:57.17 glusterfsd
16244 root      20   0 2386320 421660   4712 S 190.0  0.9  50:16.21 glusterfsd
16244 root      20   0 2386320 421988   4712 S 665.0  0.9  51:22.84 glusterfsd
16244 root      20   0 2386320 420952   4712 S  37.1  0.9  51:26.56 glusterfsd
16244 root      20   0 2386320 421828   4712 S 308.2  0.9  51:57.44 glusterfsd
16244 root      20   0 2386320 421924   4712 S  39.0  0.9  52:01.35 glusterfsd
16244 root      20   0 2386320 422056   4712 S  60.9  0.9  52:07.45 glusterfsd
16244 root      20   0 2386320 422116   4712 S  41.7  0.9  52:11.63 glusterfsd
16244 root      20   0 2386320 422116   4712 S  82.1  0.9  52:19.86 glusterfsd
16244 root      20   0 2386320 421736   4712 S 263.4  0.9  52:46.28 glusterfsd
16244 root      20   0 2386320 421736   4712 S 124.6  0.9  52:58.78 glusterfsd
16244 root      20   0 2386320 421736   4712 S  18.5  0.9  53:00.63 glusterfsd
16244 root      20   0 2386320 421736   4712 S   0.0  0.9  53:00.63 glusterfsd
16244 root      20   0 2386320 421736   4712 S   0.0  0.9  53:00.63 glusterfsd

For rhgs 3.3.1, glusterfsd activity is steady. For rhgs 3.4, glusterfsd seems to be working in fits and starts. And it doesn't seem to be a case of requests stuck in some queue locally; network activity shows a similar picture:

rhgs 3.3.1:
$ cat sar.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep p2p1 | head -n 20
02:35:37  EDT      p2p1     59.20     61.50     11.91     12.58      0.00      0.00      0.00
02:35:47  EDT      p2p1     16.80     11.40      1.77      1.59      0.00      0.00      0.00
02:35:57  EDT      p2p1   6288.10   6132.40   1713.01   1102.48      0.00      0.00      0.00
02:36:07  EDT      p2p1  57303.50  36543.80 190471.20   6979.76      0.00      0.00      0.00
02:36:17  EDT      p2p1  38808.10  24657.10 129105.82   4726.04      0.00      0.00      0.00
02:36:27  EDT      p2p1  52537.60  33221.90 174648.95   6381.13      0.00      0.00      0.00
02:36:37  EDT      p2p1  12996.20   8981.80  43115.53   1633.15      0.00      0.00      0.00
02:36:47  EDT      p2p1   9019.50   6089.40  30016.74   1126.87      0.00      0.00      0.00
02:36:57  EDT      p2p1  10708.00   6899.50  35689.65   1385.62      0.00      0.00      0.00
02:37:07  EDT      p2p1  34070.40  22555.90 110824.64   4804.49      0.00      0.00      0.00
02:37:17  EDT      p2p1  31056.30  20839.70 101216.36   4608.60      0.00      0.00      0.00
02:37:27  EDT      p2p1     10.70      5.80      0.66      0.37      0.00      0.00      0.00
02:37:37  EDT      p2p1      9.20      4.20      0.56      0.27      0.00      0.00      0.00
02:37:47  EDT      p2p1     10.90      5.90      0.67      0.37      0.00      0.00      0.00
02:37:57  EDT      p2p1      9.40      4.60      0.57      0.28      0.00      0.00      0.00
02:38:07  EDT      p2p1     10.50      5.70      0.65      0.36      0.00      0.00      0.00
02:38:17  EDT      p2p1      9.50      4.50      0.58      0.28      0.00      0.00      0.00
02:38:27  EDT      p2p1     11.00      6.00      0.68      0.37      0.00      0.00      0.00
02:38:37  EDT      p2p1      9.70      4.70      0.59      0.29      0.00      0.00      0.00
02:38:47  EDT      p2p1     10.70      5.70      0.66      0.36      0.00      0.00      0.00

rhgs 3.4:
$ cat sar.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep p2p1
02:00:45  EDT      p2p1     63.70     66.30     12.20     12.90      0.00      0.00      0.00
02:00:55  EDT      p2p1     12.30      7.00      1.08      1.66      0.00      0.00      0.00
02:01:05  EDT      p2p1   6618.10   6558.40   1785.05   1107.64      0.00      0.00      0.00
02:01:15  EDT      p2p1  28556.70  20945.50  93880.71   3665.47      0.00      0.00      0.00
02:01:25  EDT      p2p1  22270.20  16173.70  73892.16   2837.41      0.00      0.00      0.00
02:01:35  EDT      p2p1  28223.80  20224.00  93681.53   3578.31      0.00      0.00      0.00
02:01:45  EDT      p2p1  20560.50  14616.10  68284.64   2601.21      0.00      0.00      0.00
02:01:55  EDT      p2p1   2048.40   1446.40   6742.59    258.02      0.00      0.00      0.00
02:02:05  EDT      p2p1   2892.20   2146.40   9584.58    371.65      0.00      0.00      0.00
02:02:15  EDT      p2p1   3196.90   2341.00  10485.10    404.89      0.00      0.00      0.00
02:02:25  EDT      p2p1   9383.20   6659.90  31096.11   1185.47      0.00      0.00      0.00
02:02:35  EDT      p2p1  49026.70  33658.50 162940.17   6126.67      0.00      0.00      0.00
02:02:45  EDT      p2p1   8914.50   6045.60  29594.30   1107.27      0.00      0.00      0.00
02:02:55  EDT      p2p1  24452.50  17308.70  79795.81   3457.23      0.00      0.00      0.00
02:03:05  EDT      p2p1   3129.10   2333.70  10241.54    459.10      0.00      0.00      0.00
02:03:15  EDT      p2p1   3902.30   2923.30  12736.39    573.44      0.00      0.00      0.00
02:03:25  EDT      p2p1   3813.30   2822.00  12451.65    556.95      0.00      0.00      0.00
02:03:35  EDT      p2p1   5214.30   3890.50  16871.65    761.33      0.00      0.00      0.00
02:03:45  EDT      p2p1  12097.90   8813.30  39354.17   1759.69      0.00      0.00      0.00
02:03:55  EDT      p2p1  17982.40  12382.90  58536.01   2568.12      0.00      0.00      0.00
02:04:05  EDT      p2p1   1659.20   1287.50   5454.85    544.03      0.00      0.00      0.00
02:04:15  EDT      p2p1      7.70      2.60      0.49      0.16      0.00      0.00      0.00
02:04:25  EDT      p2p1     13.00      7.80      0.86      0.49      0.00      0.00      0.00

The difference could be within gluster or in the environment. In working with Milind and Raghavendra on this problem, we have had success steadying rhgs 3.4 runs with vm.dirty* parameter tuning, so I'm looking at whether there is something different in the way writeback from page cache to disk is happening in the rhgs 3.4 run that could explain the above top and sar stats. Let me make that a different comment...

Comment 146 Manoj Pillai 2018-06-02 03:49:04 UTC

(In reply to Manoj Pillai from comment #145)
<..>
> The difference could be within gluster or in the environment. In working
> with Milind and Raghavendra on this problem, we have had success steadying
> rhgs 3.4 runs with vm.dirty* parameter tuning, so I'm looking at whether
> there is something different in the way writeback from page cache to disk is
> happening in the rhgs 3.4 run that could explain the above top and sar
> stats. Let me make that a different comment...

In rhgs tuned profiles, we use sysctl vm.dirty*ratio parameters. These parameters are interpreted as a % of available memory. So the question is, if the setup is the same why is there a difference between rhgs 3.3.1 and rhgs 3.4 runs. The kernel documentation seems to stress that available memory is not the same as total memory: https://www.kernel.org/doc/Documentation/sysctl/vm.txt. So is there a difference between available memory in the 2 runs?

rhgs 3.3.1:
$ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep -i "avail mem"
KiB Swap: 24772604 total, 24772604 free,        0 used. 47603072 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47722300 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47725736 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47542716 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47416216 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47357588 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47329272 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47321900 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47304880 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47282664 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47261992 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47240808 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47236060 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47189004 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47203108 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47203584 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47203636 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214276 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214760 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214640 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214936 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47215084 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47215260 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47215612 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47215864 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47215740 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47215776 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47215784 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214432 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214492 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214500 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214684 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214656 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214712 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47214788 avail Mem

rhgs 3.4:
$ cat top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep -i "avail mem"
KiB Swap: 24772604 total, 24772604 free,        0 used. 46463020 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47753164 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47754292 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47655392 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47362680 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47244848 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 47119736 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46983184 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46891960 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46885396 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46881276 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46830848 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46721376 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46636676 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46562968 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46559552 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46533352 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46527992 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46516016 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46464728 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46458504 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46453892 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46433160 avail Mem 
KiB Swap: 24772604 total, 24772604 free,        0 used. 46425540 avail Mem

The relevant column is the last numeric column. Unlike rhgs 3.3.1, available memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are stats only for 1 run, so it is hard say for sure, but this seems like a promising direction of investigation.

We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs 3.3.1 as well for comparison), with stats collected for each run, to see if this pattern holds up and how bad it gets.

Comment 148 Pranith Kumar K 2018-06-02 07:32:55 UTC

> The relevant column is the last numeric column. Unlike rhgs 3.3.1, available
> memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are
> stats only for 1 run, so it is hard say for sure, but this seems like a
> promising direction of investigation.
> 
> We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs
> 3.3.1 as well for comparison), with stats collected for each run, to see if
> this pattern holds up and how bad it gets.

Yesterday Karan and I did the exercise of stopping and starting the volume for each run and it still didn't help. It kept going down with each iteration till it reached a point where it kept fluctuating. The only thing that I was not able to conclude RC in yesterday's debugging session with Karan is the %util difference between 331 and 340 for iostat numbers. So do let us know what you think about the iostat info.

Comment 149 Raghavendra G 2018-06-02 08:37:18 UTC

(In reply to Pranith Kumar K from comment #148)
> > The relevant column is the last numeric column. Unlike rhgs 3.3.1, available
> > memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are
> > stats only for 1 run, so it is hard say for sure, but this seems like a
> > promising direction of investigation.
> > 
> > We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs
> > 3.3.1 as well for comparison), with stats collected for each run, to see if
> > this pattern holds up and how bad it gets.
> 
> Yesterday Karan and I did the exercise of stopping and starting the volume
> for each run and it still didn't help. It kept going down with each
> iteration till it reached a point where it kept fluctuating. The only thing
> that I was not able to conclude RC in yesterday's debugging session with
> Karan is the %util difference between 331 and 340 for iostat numbers. So do
> let us know what you think about the iostat info.

Yes. On 3.3.1 max %util is around 70%, but on 3.4.0 it hits 100% many times. Though, 100% may not indicate that the device is maxed out (as explained by Shekhar) I am curious,

* what caused the difference?
* Can this difference in anyways attribute to the fluctuation we are seeing?

Comment 151 Manoj Pillai 2018-06-02 14:07:12 UTC

(In reply to Pranith Kumar K from comment #148)

If the amount of dirty data buffered in the page cache reaches the vm.dirty_ratio threshold, the OS forces a writeback and the writer can stall. The hypothesis I'm working with is that because of available memory levels dropping, this is happening or happening more often in 3.4 runs. That would explain glusterfsd working in "fits and starts" (comment #145).

It is possible that IO scheduling is less effective for this writeback, meaning more seeks, less optimization of disk head movement, which will result in higher disk util % for the same amount of I/O.

I'd try the easier experiment first: runs with higher values of vm.dirty_ratio (use the setting in the throughput-performance profile i.e 40; but we want to keep vm.dirty_background_ratio low).

If these don't lead anywhere, then try to dig deeper into the iostat output.

Comment 164 Manoj Pillai 2018-06-12 11:37:39 UTC

(In reply to Karan Sandha from comment #163)

[...]

> 
> The numbers from create workloads:-
> 
> run1
>  98.04% of requested files processed, minimum is  90.00
> elapsed time =   163.141
> files/sec = 1442.310719
> IOPS = 1442.310719
> MiB/sec = 90.144420
> 
> run2 
>  97.92% of requested files processed, minimum is  90.00
> elapsed time =   154.447
> files/sec = 1521.553301
> IOPS = 1521.553301
> MiB/sec = 95.097081
> 
> run3
>  97.92% of requested files processed, minimum is  90.00
> elapsed time =   154.712
> files/sec = 1518.947534
> IOPS = 1518.947534
> MiB/sec = 94.934221
> 
> run4
>  98.00% of requested files processed, minimum is  90.00
> elapsed time =   165.350
> files/sec = 1422.436377
> IOPS = 1422.436377
> MiB/sec = 88.902274
> 
> 
> Manoj let me know if you need any other information.

Thanks, Karan.

In terms of the problem definition, from comment #140 the problem was stated as declining performance from run to run in 3.4. That doesn't seem to be the case here -- the numbers are bad starting from run 1. So unless refined by further tests, I'll use this as the problem statement: poor, and sometimes inconsistent, performance for this workload on 3.4 compared to 3.3.1. 
That sound OK to everyone?

In the stats collected, for 3.4 I'm seeing only empty files for iostat and sar.

$ ls -1s run_1528784068
total 16576
   0 iostat.gqas001.sbu.lab.eng.bos.redhat.com.txt
   0 iostat.gqas004.sbu.lab.eng.bos.redhat.com.txt
   0 iostat.gqas010.sbu.lab.eng.bos.redhat.com.txt
   0 iostat.gqas012.sbu.lab.eng.bos.redhat.com.txt
   0 iostat.gqas014.sbu.lab.eng.bos.redhat.com.txt
   0 iostat.gqas015.sbu.lab.eng.bos.redhat.com.txt
   0 sar.gqas001.sbu.lab.eng.bos.redhat.com.txt
   0 sar.gqas004.sbu.lab.eng.bos.redhat.com.txt
   0 sar.gqas010.sbu.lab.eng.bos.redhat.com.txt
   0 sar.gqas012.sbu.lab.eng.bos.redhat.com.txt
   0 sar.gqas014.sbu.lab.eng.bos.redhat.com.txt
   0 sar.gqas015.sbu.lab.eng.bos.redhat.com.txt
1184 top_proc.gqas001.sbu.lab.eng.bos.redhat.com.txt
1272 top_proc.gqas004.sbu.lab.eng.bos.redhat.com.txt
1280 top_proc.gqas010.sbu.lab.eng.bos.redhat.com.txt
1196 top_proc.gqas012.sbu.lab.eng.bos.redhat.com.txt
1224 top_proc.gqas014.sbu.lab.eng.bos.redhat.com.txt
1208 top_proc.gqas015.sbu.lab.eng.bos.redhat.com.txt
1484 top_thr.gqas001.sbu.lab.eng.bos.redhat.com.txt
1576 top_thr.gqas004.sbu.lab.eng.bos.redhat.com.txt
1588 top_thr.gqas010.sbu.lab.eng.bos.redhat.com.txt
1504 top_thr.gqas012.sbu.lab.eng.bos.redhat.com.txt
1496 top_thr.gqas014.sbu.lab.eng.bos.redhat.com.txt
1516 top_thr.gqas015.sbu.lab.eng.bos.redhat.com.txt
   8 vmstat.gqas001.sbu.lab.eng.bos.redhat.com.txt
   8 vmstat.gqas004.sbu.lab.eng.bos.redhat.com.txt
   8 vmstat.gqas010.sbu.lab.eng.bos.redhat.com.txt
   8 vmstat.gqas012.sbu.lab.eng.bos.redhat.com.txt
   8 vmstat.gqas014.sbu.lab.eng.bos.redhat.com.txt
   8 vmstat.gqas015.sbu.lab.eng.bos.redhat.com.txt

Can you please rectify that? And while you're at it, for the new 3.4 runs you'll be doing, can you please collect sosreport at the end of the runs (not sure if that is when you're collecting it)?

Comment 170 Raghavendra G 2018-06-17 15:33:31 UTC

(In reply to Raghavendra G from comment #149)
> (In reply to Pranith Kumar K from comment #148)
> > > The relevant column is the last numeric column. Unlike rhgs 3.3.1, available
> > > memory seems to be steadily dropping in rhgs 3.4. Memory leaks? There are
> > > stats only for 1 run, so it is hard say for sure, but this seems like a
> > > promising direction of investigation.
> > > 
> > > We'll need multiple runs with an rhgs 3.4 build (and preferably with rhgs
> > > 3.3.1 as well for comparison), with stats collected for each run, to see if
> > > this pattern holds up and how bad it gets.
> > 
> > Yesterday Karan and I did the exercise of stopping and starting the volume
> > for each run and it still didn't help. It kept going down with each
> > iteration till it reached a point where it kept fluctuating. The only thing
> > that I was not able to conclude RC in yesterday's debugging session with
> > Karan is the %util difference between 331 and 340 for iostat numbers. So do
> > let us know what you think about the iostat info.
> 
> Yes. On 3.3.1 max %util is around 70%, but on 3.4.0 it hits 100% many times.
> Though, 100% may not indicate that the device is maxed out (as explained by
> Shekhar) I am curious,
> 
> * what caused the difference?
> * Can this difference in anyways attribute to the fluctuation we are seeing?

Probing deeper on my initial observation of max disk utilization in rhgs-3.3.1 was 70% compared to 100% on rhgs-3.4.0 (which Pranith has also reiterated), I find that average request sizes on rhgs-3.4.0 are fairly low compared to rhgs-3.3.1 runs.

For example, following are the snippets of higher utilization on rhgs-3.4.0:
gluster_vg-pool_tdata     0.00     0.00 1436.60  566.20 13345.10 35898.40    49.17     8.34    4.16    5.76    0.10   0.40  80.31
gluster_vg-pool-tpool     0.00     0.00 1436.60  566.20 13345.10 35898.40    49.17     8.35    4.16    5.76    0.10   0.40  80.25
gluster_vg-lv_bricks     0.00     0.00 1436.60  346.10 13345.10 35898.40    55.25     8.40    4.71    5.78    0.27   0.45  80.53


gluster_vg-pool_tdata     0.00     0.00  794.30  999.30  6764.00 62016.00    76.69    14.60    8.10    8.21    8.01   0.40  71.39
gluster_vg-pool-tpool     0.00     0.00  794.30  999.30  6764.00 62016.00    76.69    14.58    8.11    8.22    8.02   0.40  71.39
gluster_vg-lv_bricks     0.00     0.00  794.30  839.90  6764.00 62028.80    84.19    14.74    8.91    8.23    9.56   0.44  71.37

gluster_vg-pool_tdata     0.00     0.00  810.50  434.40  6698.40 27801.60    55.43     5.51    4.42    6.74    0.10   0.59  73.89
gluster_vg-pool-tpool     0.00     0.00  810.50  434.40  6698.40 27801.60    55.43     5.51    4.43    6.75    0.10   0.59  73.93
gluster_vg-lv_bricks     0.00     0.00  810.50  217.10  6698.40 27788.80    67.12     5.53    5.39    6.76    0.26   0.72  73.93

gluster_vg-pool_tdata     0.00     0.00  976.20  513.40 11926.40 32262.40    59.33     3.84    2.58    3.88    0.12   0.65  96.16
gluster_vg-pool-tpool     0.00     0.00  976.20  513.40 11926.40 32262.40    59.33     3.84    2.58    3.88    0.13   0.65  96.19
gluster_vg-lv_bricks     0.00     0.00  976.30  368.70 11926.40 32262.40    65.71     3.86    2.88    3.89    0.20   0.72  96.61

gluster_vg-pool_tdata     0.00     0.00  568.60 2766.00  8773.60 104288.40    67.81    35.24   10.04    2.84   11.52   0.29  95.43
gluster_vg-pool-tpool     0.00     0.00  568.60 2766.00  8773.60 104288.40    67.81    35.22   10.04    2.84   11.52   0.29  95.52
gluster_vg-lv_bricks     0.00     0.00  568.50 2758.30  8773.60 104288.40    67.97    35.45   10.07    2.85   11.56   0.29  95.97

gluster_vg-pool_tdata     0.00     0.00   92.20 4638.20   955.20 101431.60    43.29   185.77   39.47   92.59   38.41   0.18  87.23
gluster_vg-pool-tpool     0.00     0.00   92.20 4638.20   955.20 101431.60    43.29   185.95   39.47   92.59   38.41   0.18  87.23
gluster_vg-lv_bricks     0.00     0.00   92.20 4629.00   955.20 101431.60    43.37   186.24   39.55   92.61   38.50   0.18  87.28

gluster_vg-pool_tdata     0.00     0.00   79.90 2116.10   804.00 100513.20    92.27   100.58   42.63   49.67   42.36   0.40  88.39
gluster_vg-pool-tpool     0.00     0.00   79.90 2116.10   804.00 100513.20    92.27   100.46   42.63   49.67   42.36   0.40  88.39
gluster_vg-lv_bricks     0.00     0.00   79.90 2105.10   804.00 100513.20    92.74   100.60   42.85   49.68   42.59   0.40  88.42

gluster_vg-pool_tdata     0.00     0.00  260.40 1518.30  2600.00 33040.45    40.07    99.91   59.01   21.65   65.42   0.46  82.17
gluster_vg-pool-tpool     0.00     0.00  260.40 1518.30  2600.00 33040.45    40.07    98.23   59.02   21.66   65.42   0.46  82.17
gluster_vg-lv_bricks     0.00     0.00  260.40 1451.20  2600.00 33040.45    41.65    98.21   61.34   21.67   68.46   0.48  82.31

gluster_vg-pool_tdata     0.00     0.00  780.80 1033.70  6350.40 65785.60    79.51     6.87    3.77    6.54    1.68   0.52  94.12
gluster_vg-pool-tpool     0.00     0.00  780.90 1033.70  6350.40 65785.60    79.51     6.85    3.78    6.54    1.69   0.52  94.11
gluster_vg-lv_bricks     0.00     0.00  780.90  583.00  6351.20 65785.60   105.78     6.90    5.05    6.55    3.04   0.69  94.20

gluster_vg-pool_tdata     0.00     0.00  596.50 2399.80  8740.80 42188.40    33.99    71.08   22.87    3.44   27.70   0.29  85.65
gluster_vg-pool-tpool     0.00     0.00  596.40 2399.80  8740.80 42188.40    34.00    71.09   22.87    3.44   27.70   0.29  85.79
gluster_vg-lv_bricks     0.00     0.00  596.40 2340.50  8740.00 42188.40    34.68    71.11   23.34    3.45   28.41   0.29  86.32

gluster_vg-pool_tdata     0.00     0.00  423.60 2447.60  4549.60 53970.00    40.76    96.58   34.44   21.24   36.72   0.34  98.31
gluster_vg-pool-tpool     0.00     0.00  423.60 2447.60  4549.60 53970.00    40.76    96.33   34.44   21.25   36.72   0.34  98.20
gluster_vg-lv_bricks     0.00     0.00  423.60 2262.90  4549.60 53970.00    43.57    96.36   36.82   21.26   39.73   0.37  98.21

gluster_vg-pool_tdata     0.00     0.00  205.60 3846.90  2993.60 125500.05    63.41   134.11   31.60   10.46   32.73   0.22  89.74
gluster_vg-pool-tpool     0.00     0.00  205.60 3846.90  2993.60 125500.05    63.41   133.90   31.60   10.46   32.73   0.22  89.75
gluster_vg-lv_bricks     0.00     0.00  205.60 3830.30  2993.60 125500.05    63.68   134.62   32.17   10.48   33.33   0.22  89.99

gluster_vg-pool_tdata     0.00     0.00  223.00 2155.20  2652.00 45353.55    40.37    94.27   39.64   12.06   42.49   0.35  82.65
gluster_vg-pool-tpool     0.00     0.00  223.00 2155.20  2652.00 45353.55    40.37    94.27   39.64   12.06   42.49   0.35  82.67
gluster_vg-lv_bricks     0.00     0.00  223.00 2066.30  2652.00 45353.55    41.94    95.16   41.56   12.07   44.75   0.36  82.83


Compare this with higher utiliztion on rhgs-3.3.0:
==================================================
gluster_vg-pool_tdata     0.00     0.00    2.70 3667.60    43.20 225324.80   122.81    93.29   25.54   32.63   25.53   0.19  70.11
gluster_vg-pool-tpool     0.00     0.00    2.70 3667.60    43.20 225324.80   122.81    93.29   25.54   32.63   25.53   0.19  70.11
gluster_vg-lv_bricks     0.00     0.00    2.70 3590.10    43.20 225324.80   125.46    93.89   26.26   32.67   26.25   0.20  70.26


Note that much of reads and writes are happening at larger request size on 3.3.0:

gluster_vg-pool_tdata     0.00     0.00  278.60 1183.30  4386.45 75065.60   108.70     0.47    0.33    1.26    0.11   0.23  33.01
gluster_vg-pool-tpool     0.00     0.00  278.60 1183.30  4386.45 75065.60   108.70     0.48    0.33    1.26    0.11   0.23  33.30
gluster_vg-lv_bricks     0.00     0.00  278.60  776.80  4386.45 75072.00   150.58     0.51    0.48    1.28    0.20   0.33  34.89

gluster_vg-pool_tdata     0.00     0.00   37.40 2539.10   598.40 126151.55    98.39    42.90   16.65   22.11   16.57   0.20  52.39
gluster_vg-pool-tpool     0.00     0.00   37.40 2539.10   598.40 126151.55    98.39    42.90   16.65   22.11   16.57   0.20  52.49
gluster_vg-lv_bricks     0.00     0.00   37.40 2259.40   598.40 126157.95   110.38    42.94   18.69   22.14   18.63   0.23  53.25

gluster_vg-pool_tdata     0.00     0.00    3.10  662.90    49.60 40953.60   123.13     0.46    0.70    6.87    0.67   0.11   7.24
gluster_vg-pool-tpool     0.00     0.00    3.10  662.90    49.60 40953.60   123.13     0.47    0.70    6.87    0.67   0.11   7.26
gluster_vg-lv_bricks     0.00     0.00    3.10  580.10    49.60 40953.60   140.61     0.48    0.81    6.90    0.78   0.13   7.74

gluster_vg-pool_tdata     0.00     0.00    4.30 2389.60    68.80 111680.05    93.36    55.02   22.77    5.37   22.80   0.16  38.21
gluster_vg-pool-tpool     0.00     0.00    4.30 2389.60    68.80 111680.05    93.36    54.97   22.78    5.37   22.81   0.16  38.24
gluster_vg-lv_bricks     0.00     0.00    4.30 2335.10    68.80 111680.05    95.54    54.99   23.32    5.37   23.35   0.16  38.60

gluster_vg-pool_tdata     0.00     0.00    2.70 3667.60    43.20 225324.80   122.81    93.29   25.54   32.63   25.53   0.19  70.11
gluster_vg-pool-tpool     0.00     0.00    2.70 3667.60    43.20 225324.80   122.81    93.29   25.54   32.63   25.53   0.19  70.11
gluster_vg-lv_bricks     0.00     0.00    2.70 3590.10    43.20 225324.80   125.46    93.89   26.26   32.67   26.25   0.20  70.26

gluster_vg-pool_tdata     0.00     0.00    4.70 1425.10    75.20 88896.40   124.45    14.57   10.19   18.45   10.16   0.16  22.17
gluster_vg-pool-tpool     0.00     0.00    4.70 1425.10    75.20 88896.40   124.45    14.57   10.19   18.45   10.16   0.16  22.29
gluster_vg-lv_bricks     0.00     0.00    4.70 1206.50    75.20 88896.40   146.91    15.90   13.11   18.45   13.09   0.19  23.32

gluster_vg-pool_tdata     0.00     0.00   16.80 1692.90   268.80 70428.25    82.70    40.94   23.94    4.27   24.14   0.15  25.32
gluster_vg-pool-tpool     0.00     0.00   16.80 1692.90   268.80 70428.25    82.70    40.94   23.95    4.28   24.14   0.15  25.48
gluster_vg-lv_bricks     0.00     0.00   16.80 1478.30   268.80 70428.25    94.57    69.20   46.29    4.30   46.76   0.22  32.56

gluster_vg-pool_tdata     0.00     0.00    0.00  279.00     0.00 17120.00   122.72     3.87   13.81    0.00   13.81   0.15   4.07
gluster_vg-pool-tpool     0.00     0.00    0.00  279.00     0.00 17120.00   122.72     3.85   13.82    0.00   13.82   0.15   4.07
gluster_vg-lv_bricks     0.00     0.00    0.00  279.00     0.00 17120.00   122.72     3.88   13.92    0.00   13.92   0.15   4.07

gluster_vg-pool_tdata     0.00     0.00    1.90 2374.50    30.40 137371.00   115.64    49.19   20.70    6.11   20.71   0.19  45.28
gluster_vg-pool-tpool     0.00     0.00    1.90 2374.50    30.40 137371.00   115.64    49.20   20.70    6.11   20.71   0.19  45.28
gluster_vg-lv_bricks     0.00     0.00    1.90 2368.00    30.40 137371.00   115.96    57.27   24.16    6.11   24.18   0.19  45.31


Another observation is that more reads and writes are happening at the same time on rhgs-3.4.0. On rhgs-3.4.0 the workload seem to be either more reads or more writes but seldom both. For eg., On rhgs-3.4.0 the maximum concurrency I found was (only one instance),
gluster_vg-pool_tdata     0.00     0.00  278.60 1183.30  4386.45 75065.60   108.70     0.47    0.33    1.26    0.11   0.23  33.01
gluster_vg-pool-tpool     0.00     0.00  278.60 1183.30  4386.45 75065.60   108.70     0.48    0.33    1.26    0.11   0.23  33.30
gluster_vg-lv_bricks     0.00     0.00  278.60  776.80  4386.45 75072.00   150.58     0.51    0.48    1.28    0.20   0.33  34.89

Compare this to maximum concurrency on rhgs-3.4.0 (many more instances than rhgs-3.3.0)

gluster_vg-pool_tdata     0.00     0.00  213.36  239.74  6405.30  9532.80    70.35     5.37   11.83    6.65   16.44   0.47  21.20
gluster_vg-pool-tpool     0.00     0.00  213.36  239.74  6405.30  9532.80    70.35     5.37   11.83    6.65   16.45   0.47  21.21
gluster_vg-lv_bricks     0.00     0.00  213.36  211.22  6405.34  9532.80    75.08    18.65   43.90    6.67   81.52   0.50  21.32


gluster_vg-pool_tdata     0.00     0.00 1436.60  566.20 13345.10 35898.40    49.17     8.34    4.16    5.76    0.10   0.40  80.31
gluster_vg-pool-tpool     0.00     0.00 1436.60  566.20 13345.10 35898.40    49.17     8.35    4.16    5.76    0.10   0.40  80.25
gluster_vg-lv_bricks     0.00     0.00 1436.60  346.10 13345.10 35898.40    55.25     8.40    4.71    5.78    0.27   0.45  80.53

gluster_vg-pool_tdata     0.00     0.00  794.30  999.30  6764.00 62016.00    76.69    14.60    8.10    8.21    8.01   0.40  71.39
gluster_vg-pool-tpool     0.00     0.00  794.30  999.30  6764.00 62016.00    76.69    14.58    8.11    8.22    8.02   0.40  71.39
gluster_vg-lv_bricks     0.00     0.00  794.30  839.90  6764.00 62028.80    84.19    14.74    8.91    8.23    9.56   0.44  71.37

gluster_vg-pool_tdata     0.00     0.00  810.50  434.40  6698.40 27801.60    55.43     5.51    4.42    6.74    0.10   0.59  73.89
gluster_vg-pool-tpool     0.00     0.00  810.50  434.40  6698.40 27801.60    55.43     5.51    4.43    6.75    0.10   0.59  73.93
gluster_vg-lv_bricks     0.00     0.00  810.50  217.10  6698.40 27788.80    67.12     5.53    5.39    6.76    0.26   0.72  73.93

gluster_vg-pool_tdata     0.00     0.00  976.20  513.40 11926.40 32262.40    59.33     3.84    2.58    3.88    0.12   0.65  96.16
gluster_vg-pool-tpool     0.00     0.00  976.20  513.40 11926.40 32262.40    59.33     3.84    2.58    3.88    0.13   0.65  96.19
gluster_vg-lv_bricks     0.00     0.00  976.30  368.70 11926.40 32262.40    65.71     3.86    2.88    3.89    0.20   0.72  96.61

gluster_vg-pool_tdata     0.00     0.00  568.60 2766.00  8773.60 104288.40    67.81    35.24   10.04    2.84   11.52   0.29  95.43
gluster_vg-pool-tpool     0.00     0.00  568.60 2766.00  8773.60 104288.40    67.81    35.22   10.04    2.84   11.52   0.29  95.52
gluster_vg-lv_bricks     0.00     0.00  568.50 2758.30  8773.60 104288.40    67.97    35.45   10.07    2.85   11.56   0.29  95.97

gluster_vg-pool_tdata     0.00     0.00  260.40 1518.30  2600.00 33040.45    40.07    99.91   59.01   21.65   65.42   0.46  82.17
gluster_vg-pool-tpool     0.00     0.00  260.40 1518.30  2600.00 33040.45    40.07    98.23   59.02   21.66   65.42   0.46  82.17
gluster_vg-lv_bricks     0.00     0.00  260.40 1451.20  2600.00 33040.45    41.65    98.21   61.34   21.67   68.46   0.48  82.31

gluster_vg-pool_tdata     0.00     0.00  780.80 1033.70  6350.40 65785.60    79.51     6.87    3.77    6.54    1.68   0.52  94.12
gluster_vg-pool-tpool     0.00     0.00  780.90 1033.70  6350.40 65785.60    79.51     6.85    3.78    6.54    1.69   0.52  94.11
gluster_vg-lv_bricks     0.00     0.00  780.90  583.00  6351.20 65785.60   105.78     6.90    5.05    6.55    3.04   0.69  94.20

gluster_vg-pool_tdata     0.00     0.00  596.50 2399.80  8740.80 42188.40    33.99    71.08   22.87    3.44   27.70   0.29  85.65
gluster_vg-pool-tpool     0.00     0.00  596.40 2399.80  8740.80 42188.40    34.00    71.09   22.87    3.44   27.70   0.29  85.79
gluster_vg-lv_bricks     0.00     0.00  596.40 2340.50  8740.00 42188.40    34.68    71.11   23.34    3.45   28.41   0.29  86.32

gluster_vg-pool_tdata     0.00     0.00  423.60 2447.60  4549.60 53970.00    40.76    96.58   34.44   21.24   36.72   0.34  98.31
gluster_vg-pool-tpool     0.00     0.00  423.60 2447.60  4549.60 53970.00    40.76    96.33   34.44   21.25   36.72   0.34  98.20
gluster_vg-lv_bricks     0.00     0.00  423.60 2262.90  4549.60 53970.00    43.57    96.36   36.82   21.26   39.73   0.37  98.21

gluster_vg-pool_tdata     0.00     0.00  205.60 3846.90  2993.60 125500.05    63.41   134.11   31.60   10.46   32.73   0.22  89.74
gluster_vg-pool-tpool     0.00     0.00  205.60 3846.90  2993.60 125500.05    63.41   133.90   31.60   10.46   32.73   0.22  89.75
gluster_vg-lv_bricks     0.00     0.00  205.60 3830.30  2993.60 125500.05    63.68   134.62   32.17   10.48   33.33   0.22  89.99

gluster_vg-pool_tdata     0.00     0.00  223.00 2155.20  2652.00 45353.55    40.37    94.27   39.64   12.06   42.49   0.35  82.65
gluster_vg-pool-tpool     0.00     0.00  223.00 2155.20  2652.00 45353.55    40.37    94.27   39.64   12.06   42.49   0.35  82.67
gluster_vg-lv_bricks     0.00     0.00  223.00 2066.30  2652.00 45353.55    41.94    95.16   41.56   12.07   44.75   0.36  82.83

gluster_vg-pool_tdata     0.00     0.00  130.60  506.80  1057.60 31814.80   103.15     0.80    1.25    5.53    0.15   0.46  29.11
gluster_vg-pool-tpool     0.00     0.00  130.60  506.80  1057.60 31814.80   103.15     0.80    1.26    5.53    0.16   0.46  29.13
gluster_vg-lv_bricks     0.00     0.00  130.60  362.90  1057.60 31814.80   133.22     0.87    1.77    5.55    0.41   0.61  30.19

gluster_vg-pool_tdata     0.00     0.00  132.50 2220.40  2120.00 45467.95    40.45    79.09   33.61    1.53   35.53   0.24  55.62
gluster_vg-pool-tpool     0.00     0.00  132.50 2220.40  2120.00 45467.95    40.45    79.09   33.62    1.53   35.53   0.24  55.66
gluster_vg-lv_bricks     0.00     0.00  132.50 2219.00  2120.00 45467.95    40.47    79.11   33.64    1.54   35.56   0.24  55.83


I also wonder whether more reads and writes are happening on rhgs-3.4.0, though I need to dig into more data to confirm this.

@Manoj,

Do you think these findings are in anyways significant? These findings are on stats from comment #133

Comment 171 Raghavendra G 2018-06-17 17:21:41 UTC

Data and observations in comment #170 were from gqas001. Bricks from other nodes reveal similar pattern.

Comment 174 Raghavendra G 2018-06-18 07:40:55 UTC

(In reply to Raghavendra G from comment #169)
> (In reply to Pranith Kumar K from comment #141)
> > This is the profile after the revert, 4 digit avg latency for create even in
> > this run, so io-threads patch is not the reason for regression:
> >       3.54     185.92 us      68.00 us  602428.00 us         131624     
> > LOOKUP
> >       6.21     356.92 us      72.00 us  202841.00 us         120405      
> > WRITE
> >      38.45    1108.09 us     139.00 us 1698682.00 us         240114   
> > FXATTROP
> >      43.33    2497.64 us     189.00 us 1538022.00 us         120062     
> > CREATE
> 
> Would it make sense to capture strace of bricks? That way we can isolate the
> problem to operations on fs or to glusterfs stack.

I think getting strace output of 3.3.1 and 3.4.0 would be a good idea. Reasons are:

* Shekhar pointed that there are more reads to disks in 3.4.0 compared to 3.3.1. So comparing the syscalls will give a fair idea on where those reads are happening.

* On irc, Sunil pointed out that ec86167 is a commit where regression is not seen and commit 4700c5b has regression. On doing a git log of commits between these two, I found suspect patches like:

commit 68f2192df570b5ee615d440c2e0c88d49a75a34f
Author: Soumya Koduri <skoduri>
Date:   Thu Jun 8 22:19:17 2017 +0530

    posix-acl: Whitelist virtual ACL xattrs
    
    Similar to system.posix_acl_* xattrs, all users should get
    permission to be able to read glusterfs.posix.acl* xattrs too.
    
    Change-Id: I1fc2b67c8a12113910e4ec57cd114e4baefe0d38
    BUG: 1459971
    Signed-off-by: Soumya Koduri <skoduri>
    Reviewed-on: https://review.gluster.org/17493

This patch lets operations on posix-acls reach bricks. If posix acls are used, the traffic can be huge. Note that this patch won't change the number of fops as witnessed by brick profile as fops used to be wound all the way till posix-acl, which used to fail them with EPERM. With this patch, these fops will hit backend fs.

commit f0fb166078d59cab2a33583591b6448326247c40
Author: Kotresh HR <khiremat>
Date:   Thu Jun 15 08:31:06 2017 -0400

    feature/bitrot: Fix ondemand scrub
    
    The flag which keeps tracks of whether the scrub
    frequency is changed from previous value should
    not be considered for on-demand scrubbing. It
    should be considered only for 'scrub-frequency'
    where it should not be re-scheduled if it is
    set to same value again. But in case ondemand
    scrub, it should start the scrub immediately
    no matter what the scrub-frequency.
    
    Reproducer:
    1. Enable bitrot
    2. Set scrub-throttle
    3. Set ondemand scrub
    Make sure glusterd is not restarted while doing
    below steps
    
    Change-Id: Ice5feaece7fff1579fb009d1a59d2b8292e23e0b
    BUG: 1461845
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: https://review.gluster.org/17552

Not sure whether this can increase operations on disk. Would leave it to Kotresh on the effect of this patch.

Comment 176 Raghavendra G 2018-06-18 07:47:55 UTC

(In reply to Raghavendra G from comment #174)
> (In reply to Raghavendra G from comment #169)
> > (In reply to Pranith Kumar K from comment #141)
> > > This is the profile after the revert, 4 digit avg latency for create even in
> > > this run, so io-threads patch is not the reason for regression:
> > >       3.54     185.92 us      68.00 us  602428.00 us         131624     
> > > LOOKUP
> > >       6.21     356.92 us      72.00 us  202841.00 us         120405      
> > > WRITE
> > >      38.45    1108.09 us     139.00 us 1698682.00 us         240114   
> > > FXATTROP
> > >      43.33    2497.64 us     189.00 us 1538022.00 us         120062     
> > > CREATE
> > 
> > Would it make sense to capture strace of bricks? That way we can isolate the
> > problem to operations on fs or to glusterfs stack.
> 
> I think getting strace output of 3.3.1 and 3.4.0 would be a good idea.
> Reasons are:
> 
> * Shekhar pointed that there are more reads to disks in 3.4.0 compared to
> 3.3.1. So comparing the syscalls will give a fair idea on where those reads
> are happening.
> 
> * On irc, Sunil pointed out that ec86167 is a commit where regression is not
> seen and commit 4700c5b has regression. On doing a git log of commits
> between these two, I found suspect patches like:
> 
> commit 68f2192df570b5ee615d440c2e0c88d49a75a34f
> Author: Soumya Koduri <skoduri>
> Date:   Thu Jun 8 22:19:17 2017 +0530
> 
>     posix-acl: Whitelist virtual ACL xattrs
>     
>     Similar to system.posix_acl_* xattrs, all users should get
>     permission to be able to read glusterfs.posix.acl* xattrs too.
>     
>     Change-Id: I1fc2b67c8a12113910e4ec57cd114e4baefe0d38
>     BUG: 1459971
>     Signed-off-by: Soumya Koduri <skoduri>
>     Reviewed-on: https://review.gluster.org/17493
> 
> This patch lets operations on posix-acls reach bricks. If posix acls are
> used, the traffic can be huge. Note that this patch won't change the number
> of fops as witnessed by brick profile as fops used to be wound all the way
> till posix-acl, which used to fail them with EPERM. With this patch, these
> fops will hit backend fs.

This looks to be used only in NFS-Ganesha. Nevertheless leaving the comment just in case I missed out anything.

> 
> commit f0fb166078d59cab2a33583591b6448326247c40
> Author: Kotresh HR <khiremat>
> Date:   Thu Jun 15 08:31:06 2017 -0400
> 
>     feature/bitrot: Fix ondemand scrub
>     
>     The flag which keeps tracks of whether the scrub
>     frequency is changed from previous value should
>     not be considered for on-demand scrubbing. It
>     should be considered only for 'scrub-frequency'
>     where it should not be re-scheduled if it is
>     set to same value again. But in case ondemand
>     scrub, it should start the scrub immediately
>     no matter what the scrub-frequency.
>     
>     Reproducer:
>     1. Enable bitrot
>     2. Set scrub-throttle
>     3. Set ondemand scrub
>     Make sure glusterd is not restarted while doing
>     below steps
>     
>     Change-Id: Ice5feaece7fff1579fb009d1a59d2b8292e23e0b
>     BUG: 1461845
>     Signed-off-by: Kotresh HR <khiremat>
>     Reviewed-on: https://review.gluster.org/17552
> 
> Not sure whether this can increase operations on disk. Would leave it to
> Kotresh on the effect of this patch.

Comment 183 Karan Sandha 2018-06-19 06:42:39 UTC

Created attachment 1452822 [details]
Comment#182_regression_free_run

Comment 184 Karan Sandha 2018-06-19 06:46:36 UTC

Created attachment 1452823 [details]
Comment#168_regression_run

Comment 186 Karan Sandha 2018-06-19 19:13:37 UTC

Created attachment 1453015 [details]
comment#185_regression_free_run

Comment 189 Raghavendra G 2018-06-21 07:21:17 UTC

stats in comment #133 also indicate a better network utilization rhgs-3.3.1. Note that how I/O on network is concentrated in few consecutive lines of higher rxKB/s compared to more spread out and lesser rxKB/s on rhgs-3.4.1

(left is v3.12.2-7/rhgs-3.4.0 | right is rhgs-3.3.1)

02:00:45  EDT      p2p1     63.70     66.30     12.20     12.90      0.00      0.00      0.00            |02:35:37  EDT      p2p1     59.20     61.50     11.91     12.58      0.00      0.00      0.00
02:00:55  EDT      p2p1     12.30      7.00      1.08      1.66      0.00      0.00      0.00            |02:35:47  EDT      p2p1     16.80     11.40      1.77      1.59      0.00      0.00      0.00
02:01:05  EDT      p2p1   6618.10   6558.40   1785.05   1107.64      0.00      0.00      0.00            |02:35:57  EDT      p2p1   6288.10   6132.40   1713.01   1102.48      0.00      0.00      0.00
02:01:15  EDT      p2p1  28556.70  20945.50  93880.71   3665.47      0.00      0.00      0.00            |02:36:07  EDT      p2p1  57303.50  36543.80 190471.20   6979.76      0.00      0.00      0.00
02:01:25  EDT      p2p1  22270.20  16173.70  73892.16   2837.41      0.00      0.00      0.00            |02:36:17  EDT      p2p1  38808.10  24657.10 129105.82   4726.04      0.00      0.00      0.00
02:01:35  EDT      p2p1  28223.80  20224.00  93681.53   3578.31      0.00      0.00      0.00            |02:36:27  EDT      p2p1  52537.60  33221.90 174648.95   6381.13      0.00      0.00      0.00
02:01:45  EDT      p2p1  20560.50  14616.10  68284.64   2601.21      0.00      0.00      0.00            |02:36:37  EDT      p2p1  12996.20   8981.80  43115.53   1633.15      0.00      0.00      0.00
02:01:55  EDT      p2p1   2048.40   1446.40   6742.59    258.02      0.00      0.00      0.00            |02:36:47  EDT      p2p1   9019.50   6089.40  30016.74   1126.87      0.00      0.00      0.00
02:02:05  EDT      p2p1   2892.20   2146.40   9584.58    371.65      0.00      0.00      0.00            |02:36:57  EDT      p2p1  10708.00   6899.50  35689.65   1385.62      0.00      0.00      0.00
02:02:15  EDT      p2p1   3196.90   2341.00  10485.10    404.89      0.00      0.00      0.00            |02:37:07  EDT      p2p1  34070.40  22555.90 110824.64   4804.49      0.00      0.00      0.00
02:02:25  EDT      p2p1   9383.20   6659.90  31096.11   1185.47      0.00      0.00      0.00            |02:37:17  EDT      p2p1  31056.30  20839.70 101216.36   4608.60      0.00      0.00      0.00
02:02:35  EDT      p2p1  49026.70  33658.50 162940.17   6126.67      0.00      0.00      0.00            |02:37:27  EDT      p2p1     10.70      5.80      0.66      0.37      0.00      0.00      0.00
02:02:45  EDT      p2p1   8914.50   6045.60  29594.30   1107.27      0.00      0.00      0.00            |02:37:37  EDT      p2p1      9.20      4.20      0.56      0.27      0.00      0.00      0.00
02:02:55  EDT      p2p1  24452.50  17308.70  79795.81   3457.23      0.00      0.00      0.00            |02:37:47  EDT      p2p1     10.90      5.90      0.67      0.37      0.00      0.00      0.00
02:03:05  EDT      p2p1   3129.10   2333.70  10241.54    459.10      0.00      0.00      0.00            |02:37:57  EDT      p2p1      9.40      4.60      0.57      0.28      0.00      0.00      0.00
02:03:15  EDT      p2p1   3902.30   2923.30  12736.39    573.44      0.00      0.00      0.00            |02:38:07  EDT      p2p1     10.50      5.70      0.65      0.36      0.00      0.00      0.00
02:03:25  EDT      p2p1   3813.30   2822.00  12451.65    556.95      0.00      0.00      0.00            |02:38:17  EDT      p2p1      9.50      4.50      0.58      0.28      0.00      0.00      0.00
02:03:35  EDT      p2p1   5214.30   3890.50  16871.65    761.33      0.00      0.00      0.00            |02:38:27  EDT      p2p1     11.00      6.00      0.68      0.37      0.00      0.00      0.00
02:03:45  EDT      p2p1  12097.90   8813.30  39354.17   1759.69      0.00      0.00      0.00            |02:38:37  EDT      p2p1      9.70      4.70      0.59      0.29      0.00      0.00      0.00
02:03:55  EDT      p2p1  17982.40  12382.90  58536.01   2568.12      0.00      0.00      0.00            |02:38:47  EDT      p2p1     10.70      5.70      0.66      0.36      0.00      0.00      0.00
02:04:05  EDT      p2p1   1659.20   1287.50   5454.85    544.03      0.00      0.00      0.00            |02:38:57  EDT      p2p1      9.90      4.70      0.66      0.29      0.00      0.00      0.00
02:04:15  EDT      p2p1      7.70      2.60      0.49      0.16      0.00      0.00      0.00            |02:39:07  EDT      p2p1     11.70      6.30      0.83      0.38      0.00      0.00      0.00

Comment 190 Raghavendra G 2018-06-21 07:31:05 UTC

Trimming down output of previous comment to just rxpck/s txpck/s rxKB/s and txKB/s for better readability:

(v3.12.2-7 | rhgs-3.3.1)

    63.70     66.30     12.20     12.90 |    59.20     61.50     11.91     12.58
    12.30      7.00      1.08      1.66 |    16.80     11.40      1.77      1.59
  6618.10   6558.40   1785.05   1107.64 |  6288.10   6132.40   1713.01   1102.48
 28556.70  20945.50  93880.71   3665.47 | 57303.50  36543.80 190471.20   6979.76
 22270.20  16173.70  73892.16   2837.41 | 38808.10  24657.10 129105.82   4726.04
 28223.80  20224.00  93681.53   3578.31 | 52537.60  33221.90 174648.95   6381.13
 20560.50  14616.10  68284.64   2601.21 | 12996.20   8981.80  43115.53   1633.15
  2048.40   1446.40   6742.59    258.02 |  9019.50   6089.40  30016.74   1126.87
  2892.20   2146.40   9584.58    371.65 | 10708.00   6899.50  35689.65   1385.62
  3196.90   2341.00  10485.10    404.89 | 34070.40  22555.90 110824.64   4804.49
  9383.20   6659.90  31096.11   1185.47 | 31056.30  20839.70 101216.36   4608.60
 49026.70  33658.50 162940.17   6126.67 |    10.70      5.80      0.66      0.37
  8914.50   6045.60  29594.30   1107.27 |     9.20      4.20      0.56      0.27
 24452.50  17308.70  79795.81   3457.23 |    10.90      5.90      0.67      0.37
  3129.10   2333.70  10241.54    459.10 |     9.40      4.60      0.57      0.28
  3902.30   2923.30  12736.39    573.44 |    10.50      5.70      0.65      0.36
  3813.30   2822.00  12451.65    556.95 |     9.50      4.50      0.58      0.28
  5214.30   3890.50  16871.65    761.33 |    11.00      6.00      0.68      0.37
 12097.90   8813.30  39354.17   1759.69 |     9.70      4.70      0.59      0.29
 17982.40  12382.90  58536.01   2568.12 |    10.70      5.70      0.66      0.36
  1659.20   1287.50   5454.85    544.03 |     9.90      4.70      0.66      0.29
     7.70      2.60      0.49      0.16 |    11.70      6.30      0.83      0.38
    13.00      7.80      0.86      0.49 |     9.40      4.30      0.60      0.27

Comment 191 Raghavendra G 2018-06-21 07:35:50 UTC

(In reply to Raghavendra G from comment #190)

> stats in comment #133 also indicate a better network utilization rhgs-3.3.1.
> Note that how I/O on network is concentrated in few consecutive lines of higher  > rxKB/s compared to more spread out and lesser rxKB/s on rhgs-3.4.1
> Trimming down output of previous comment to just rxpck/s txpck/s rxKB/s and
> txKB/s for better readability:
> 

Does this indicate a problem in client stack? It looks like client is relatively sending data sporadically to bricks

Comment 192 Raghavendra G 2018-06-21 08:16:04 UTC

(In reply to Raghavendra G from comment #191)
> (In reply to Raghavendra G from comment #190)
> 
> > stats in comment #133 also indicate a better network utilization rhgs-3.3.1.
> > Note that how I/O on network is concentrated in few consecutive lines of higher  > rxKB/s compared to more spread out and lesser rxKB/s on rhgs-3.4.1
> > Trimming down output of previous comment to just rxpck/s txpck/s rxKB/s and
> > txKB/s for better readability:
> > 
> 
> Does this indicate a problem in client stack? It looks like client is
> relatively sending data sporadically to bricks

Even the data written by brick is sporadic. So, it could be brick stack as well. Also comment #145 has the same observation.

Comment 193 Manoj Pillai 2018-06-21 10:58:55 UTC

(In reply to Raghavendra G from comment #188)
> Restoring needinfo on Manoj for comment #170

I had looked at the last set of stats and sosreports that Karan provided, didn't get a change to update the bz. Some of my earlier suspicions turned out NOT to be true, and haven't found time to dig deeper to form any new hypotheses. 

I was under the impression that the rhgs-random-io tuned profile was in effect, but from the sosreports the profile used is throughput-performance. So the dirty*ratio limits are high enough that forced page-cache writeback should not happen.

There is a significant difference in dirty data buildup between the two runs though (3.3.1 is accumulating more data before writing it out). You can see that by running this on the collected stats file: 
cat sar.gqas001.sbu.lab.eng.bos.redhat.com.txt | grep -A 1 kbdirty

kbdirty stays fairly low in 3.4 runs. But that is probably the effect of something else, as is probably the case with top and sar network stats pointed out in comment #145 and comment #190.

Planning to analyze the iostat results more for clues...

Comment 194 Xavi Hernandez 2018-06-21 13:23:44 UTC

I've had a discussion with Raghavendra G about stats on 3.3.1 and 3.4.0 and we have found something that seems an anomaly:

The test consists on 6 clients, with 8 threads in each client. Each thread is creating 5000 files of 64 KiB (I'm assuming this is KiB (1024) and not KB (1000)). This means a total of 6 * 8 * 5000 * 64 = 14.65 GiB. Since this is a replica 2 + arbiter distributed in 2 replica sets, each brick (data brick should receive 7.32 GiB).

Checking the data, we can see that bricks receive 7.85 GiB on 3.3.1, and 7.80 GiB on 3.4.0 for the same test. This seems ok.

Transmitted data also seems ok: 350 MiB on 3.3.1 and 335 MiB on 3.4.0.

The differences appear on data read and written to disk:

Writes: 8.3 GiB on 3.3.1, 10.2 GiB on 3.4.0 (x1.2)
Reads: 94 MiB on 3.3.1, 671 MiB on 3.4.0 (x7.1)

This shows a big difference between versions. Since the amount of data received and transmitted to client side is the same, it seems that these extra reads and writes come from the brick stack.

Another interesting thing is that on 3.4.0, arbiter bricks read about 100 MiB more than regular bricks (i.e. ~770 MiB). On 3.3.1 they read the same amount.

However it's not clear what are those extra reads and writes. From the comments, it seems that strace didn't reveal any significant difference, so I can't explain how the same requests could cause this increase in volume of data accessed.

Comment 195 Raghavendra G 2018-06-21 13:51:49 UTC

Continuing the analysis from previous comments, following patches look suspect:

commit 0499970747a7897bd9190484b5ab6868b19f393f
Author: Poornima G <pgurusid>
Date:   Thu Jan 4 19:38:05 2018 +0530

    posix: In getxattr, honor the wildcard '*'
    
    Currently, the posix_xattr_fill performas a sys_getxattr
    on all the keys requested, there are requirements where
    the keys could contain a wildcard, in which case sys_getxattr
    would return ENODATA, eg: if the xattr requested is user.*
    all the xattrs with prefix user. should be returned, with their
    values.
    
    This patch, changes posix_xattr_fill, to honor wildcard in the keys
    requested.
    
    Updates #297
    
    > Signed-off-by: Poornima G <pgurusid>
    > Change-Id: I3d52da2957ac386fca3c156e26ff4cdf0b2c79a9
    > Reviewed-on: https://review.gluster.org/19170
    > Smoke: Gluster Build System <jenkins.org>
    > Reviewed-by: Amar Tumballi
    > Tested-by: Poornima G <pgurusid>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > (cherry picked from commit 8fc9c6a8fc7c73b2b4c65a8ddbe988bca10e89b6)
    
    BUG: 1446125
    Change-Id: I3d52da2957ac386fca3c156e26ff4cdf0b2c79a9
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: https://code.engineering.redhat.com/gerrit/128480
    Tested-by: RHGS Build Bot <nigelb>
    Reviewed-by: Sunil Kumar Heggodu Gopala Acharya <sheggodu>

commit c8c56b9a53471792a126f8e46f32ce7190eac10c (tag: v3.12.0)
Author: Kotresh HR <khiremat>
Date:   Tue Aug 1 07:51:44 2017 -0400

    feature/posix: Enabled gfid2path by default
    
    Enable gfid2path feature by default. The basic
    performance tests are carried out and it doesn't
    show significant depreciation. The results are
    updated in issue.
    
    Updates: #139
    Change-Id: I5f1949a608d0827018ef9d548d5d69f3bb7744fd
    > Signed-off-by: Kotresh HR <khiremat>
    > Reviewed-on: https://review.gluster.org/17950
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Aravinda VK <avishwan>
    > Reviewed-by: Amar Tumballi <amarts>
    (cherry picked from commit 3ec63650bb7fd874a5013e7be4a2def3b519c9b2)
    Reviewed-on: https://review.gluster.org/18133
    Reviewed-by: Amar Tumballi <amarts>
    Smoke: Gluster Build System <jenkins.org>
    Tested-by: Shyamsundar Ranganathan <srangana>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>


commit c8c56b9a53471792a126f8e46f32ce7190eac10c (tag: v3.12.0)
Author: Kotresh HR <khiremat>
Date:   Tue Aug 1 07:51:44 2017 -0400

    feature/posix: Enabled gfid2path by default
    
    Enable gfid2path feature by default. The basic
    performance tests are carried out and it doesn't
    show significant depreciation. The results are
    updated in issue.
    
    Updates: #139
    Change-Id: I5f1949a608d0827018ef9d548d5d69f3bb7744fd
    > Signed-off-by: Kotresh HR <khiremat>
    > Reviewed-on: https://review.gluster.org/17950
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Aravinda VK <avishwan>
    > Reviewed-by: Amar Tumballi <amarts>
    (cherry picked from commit 3ec63650bb7fd874a5013e7be4a2def3b519c9b2)
    Reviewed-on: https://review.gluster.org/18133
    Reviewed-by: Amar Tumballi <amarts>
    Smoke: Gluster Build System <jenkins.org>
    Tested-by: Shyamsundar Ranganathan <srangana>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

commit 32e366bad947926343fdb389f054ca3051e44774
Author: Kotresh HR <khiremat>
Date:   Sun Jul 16 15:16:56 2017 -0400

    storage/posix: Add virtual xattr to fetch path from gfid
    
    The gfid2path infra stores the "pargfid/bname" as on xattr
    value for each non directory entry. Hardlinks would have a
    separate xattr. This xattr key is internal and is not
    exposed to applications.  A virtual xattr is exposed for
    the applications to fetch the path from gfid.
    
    Internal xattr:
    trusted.gfid2path.<xxhash>
    
    Virtual xattr:
    glusterfs.gfidtopath
    
    getfattr -h -n glusterfs.gfidtopath /<aux-mnt>/.gfid/<gfid>
    
    If there are hardlinks, it returns all the paths separated
    by ':'.
    
    A volume set option is introduced to change the delimiter
    to required string of max length 7.
    
    gluster vol set gfid2path-separator ":::"
    
    > Updates: #139
    > Change-Id: Ie3b0c3fd8bd5333c4a27410011e608333918c02a
    > Signed-off-by: Kotresh HR <khiremat>
    > Reviewed-on: https://review.gluster.org/17785
    > Smoke: Gluster Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Krutika Dhananjay <kdhananj>
    
    Updates: #139
    Change-Id: Ie3b0c3fd8bd5333c4a27410011e608333918c02a
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: https://review.gluster.org/17921
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>

commit a62b16b72b03f2b7c25c24ea9ac5968453a92009
Author: Kotresh HR <khiremat>
Date:   Thu Jun 29 05:45:34 2017 -0400

    storage/posix: New gfid2path infra
    
    With this infra, a new xattr is stored on each entry
    creation as below.
    
    trusted.gfid2path.<xxhash> = <pargfid>/<basename>
    
    If there are hardlinks, multiple xattrs would be present.
    
    Fops which are impacted:
    create, mknod, link, symlink, rename, unlink
    
    Option to enable:
    gluster vol set <VOLNAME> storage.gfid2path on
    
    Updates: #139
    Change-Id: I369974cd16703c45ee87f82e6c2ff5a987a6cc6a
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: https://review.gluster.org/17488
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Aravinda VK <avishwan>
    Reviewed-by: Amar Tumballi <amarts>

commit 960ecba01a057ff8a2fc356624720904419e7f49
Author: Ravishankar N <ravishankar>
Date:   Tue Jan 16 10:16:41 2018 +0530

    posix: delete stale gfid handles in nameless lookup
    
    ..in order for self-heal of symlinks to work properly (see BZ for
    details).
    
    Backport of https://review.gluster.org/#/c/19070/
    Signed-off-by: Ravishankar N <ravishankar>
    
    Change-Id: I9a011d00b07a690446f7fd3589e96f840e8b7501
    BUG: 1527309
    Reviewed-on: https://code.engineering.redhat.com/gerrit/127739
    Tested-by: Ravishankar Narayanankutty <ravishankar>
    Tested-by: RHGS Build Bot <nigelb>
    Reviewed-by: Ashish Pandey <aspandey>
    Reviewed-by: Ravishankar Narayanankutty <ravishankar>

there are some gfid cleanup patches from Ravi too. We need to inspect them a bit deeper.

My suspicion is on gfid2path being default and commit 0499970747a7897b to honour wildcards in getxattr.

Comment 197 Xavi Hernandez 2018-06-21 15:14:21 UTC

One consideration... we normally configure XFS with an inode size of 512 bytes. Is it possible that the addition of some extra extended attributes have exhausted the available space and a second inode is being used for all/many files ?

Maybe this could explain part of the increment in reads and writes without sending more requests from user land.

Comment 199 Karan Sandha 2018-06-21 19:42:42 UTC

Created attachment 1453586 [details]
comment#198_regression_free_run

Comment 200 Xavi Hernandez 2018-06-22 22:51:12 UTC

I've compared straces from 3.3.1 and 3.4.0, and the system calls they send are the same and in the same order. The only difference is an additional lsetxattr to set gfid2path on 3.4.0. However it seems that disabling storage.gfid2path doesn't improve performance, based on previous comments.

However the same set of operations seem to cause different workloads on disk. On 3.3.1 we can see few reads, and almost all of them concentrated at the beginning of the test. But on 3.4.0 we see that there are reads constantly. Mixing reads with writes is generally bad for performance.

It seems as if cache were flushed more frequently (causing multiple writes of the same blocks after being updated, increasing amount of data written) and invalidated (causing additional reads to cache inodes).

If I understood it correctly, the environment (kernel, xfs, config, ...) of the test is the same for 3.3.1 and 3.4.0, so the difference must come from gluster.

Testing locally I can reproduce the additional reads and the increment of writes when system cache is full and kernel starts flushing and invalidating cached data. At this time performance drops drastically, most probably caused by the mixed reads and writes that happen.

I see two possibilities here:

* Some change related to inode invalidation
* Changes in memory management that could cause earlier flush/invalidation of inodes.

I'll check commits related with these possibilities.

Comment 201 Raghavendra G 2018-06-23 09:22:31 UTC

(In reply to Xavi Hernandez from comment #200)
> I've compared straces from 3.3.1 and 3.4.0, and the system calls they send
> are the same and in the same order. The only difference is an additional
> lsetxattr to set gfid2path on 3.4.0. However it seems that disabling
> storage.gfid2path doesn't improve performance, based on previous comments.

From comment #28

<snip>

While debugging with Karan, we also found disabling gfid to path improved the regression but not completely. (Note that the exercise done in comment #18 was not accurate due to BZ 1564027 which needs to be addressed). After properly disabling gfid2path, we got a throughput in one of the runs as 2311 files/sec for 3.4.0.(compared to 3743 files/sec in 3.3.1).

</snip>

That's around 12% improvement with gfid2path disabled. Note that with enabled, we get around 1858 files/sec (comment #18)

Comment 202 Karan Sandha 2018-06-24 09:24:00 UTC

(In reply to Xavi Hernandez from comment #200)
> I've compared straces from 3.3.1 and 3.4.0, and the system calls they send
> are the same and in the same order. The only difference is an additional
> lsetxattr to set gfid2path on 3.4.0. However it seems that disabling
> storage.gfid2path doesn't improve performance, based on previous comments.
> 
> However the same set of operations seem to cause different workloads on
> disk. On 3.3.1 we can see few reads, and almost all of them concentrated at
> the beginning of the test. But on 3.4.0 we see that there are reads
> constantly. Mixing reads with writes is generally bad for performance.
> 
> It seems as if cache were flushed more frequently (causing multiple writes
> of the same blocks after being updated, increasing amount of data written)
> and invalidated (causing additional reads to cache inodes).
> 
> If I understood it correctly, the environment (kernel, xfs, config, ...) of
> the test is the same for 3.3.1 and 3.4.0, so the difference must come from
> gluster.

Karan:- Yes Xavi, the tests performed on 3.3.1 and 3.4.0 are completely the same. We use the same RHEL iso on both the tests. 
> 
> Testing locally I can reproduce the additional reads and the increment of
> writes when system cache is full and kernel starts flushing and invalidating
> cached data. At this time performance drops drastically, most probably
> caused by the mixed reads and writes that happen.
> 
> I see two possibilities here:
> 
> * Some change related to inode invalidation
> * Changes in memory management that could cause earlier flush/invalidation
> of inodes.
> 
> I'll check commits related with these possibilities.

Comment 219 Sunil Kumar Acharya 2018-06-27 03:05:55 UTC

Upstream patch: https://review.gluster.org/#/c/20392/

Comment 229 errata-xmlrpc 2018-09-04 06:44:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.

amukherj
aspandey
atumball
bfoster
jahernan
ksandha
mchangir
mpillai
nbalacha
nchilaka
pkarampu
ravishankar
rcyriac
rgowdapp
rhinduja
rhs-bugs
sankarshan
shberry
sheggodu
smali
storage-qa-internal