Bug 1406723

Summary:

[Perf] : significant Performance regression seen with disperse volume when compared with 3.1.3

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Nag Pavan Chilakam <nchilaka>

Component:

disperse

Assignee:

Ashish Pandey <aspandey>

Status:

CLOSED ERRATA

QA Contact:

Ambarish <asoman>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.2

CC:

amukherj, asoman, aspandey, pkarampu, rcyriac, rhinduja, rhs-bugs, storage-qa-internal

Target Milestone:

---

Keywords:

Regression

Target Release:

RHGS 3.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

glusterfs-3.8.4-17

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1408809 (view as bug list)

Environment:

Last Closed:

2017-03-23 05:58:45 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1351528, 1408809

Attachments:

Description	Flags
3.2 numbers	none
3.1.3 numbers	none

Description Nag Pavan Chilakam 2016-12-21 10:37:56 UTC

I did a comparison b/w 3.1.3's 3.7.9-12 build and 3.2's 3.8.4-9 build on EC volume
I observed a significant drop in performance and the numbers are as below:






Setup Info:
creation of 10k zero byte files
3.1.3==>took 1min 12sec
3.2==>took 3min 09 sec
--->that is almost 60% drop


More numbers will be published soon

Comment 2 Nag Pavan Chilakam 2016-12-21 10:59:23 UTC

below are the numbers
(in both cases, quotas was enabled, uss was turned on)
2x(4+2) volume on rhel7.3
File Operation		3.1.3	3.2
touch to create new file 10000 files		1min 12sec	3min 9sec
linux untar of kernel image 4.9		25min 23Sec	43min 15sec
ls -lRt of untarred directory		51Sec	59.3Sec
rm -rf of the 10k files		50sec	1min 15sec
Stat * of the folder hosting 10000 files		8sec	14sec

Comment 3 Nag Pavan Chilakam 2016-12-21 11:26:16 UTC

Operation	3.1.3	3.2	Drop in Performance
touch	72	189	61.9047619048
untar	1523	2595	41.3102119461
ls	51	59	13.5593220339
rm -rf	50	75	33.3333333333
stat	8	14	42.8571428571

Comment 4 Nag Pavan Chilakam 2016-12-21 11:26:53 UTC

3.1.3 numbers:
Was run in 3.2 time frame
Setup info
Ec volume build (3.7.9-12) 2nd async build after GA
rhel 7.3 
6 VMs each of 8GB

Client:
16GB RHEL7.3
 
root@dhcp35-37 ~]# gluster v info
 
Volume Name: disperse
Type: Distributed-Disperse
Volume ID: ccede272-2cde-4b55-be94-51581289eb56
Status: Started
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.35.37:/rhs/brick2/disperse
Brick2: 10.70.35.116:/rhs/brick2/disperse
Brick3: 10.70.35.239:/rhs/brick2/disperse
Brick4: 10.70.35.135:/rhs/brick2/disperse
Brick5: 10.70.35.8:/rhs/brick2/disperse
Brick6: 10.70.35.196:/rhs/brick2/disperse
Brick7: 10.70.35.37:/rhs/brick3/disperse
Brick8: 10.70.35.116:/rhs/brick3/disperse
Brick9: 10.70.35.239:/rhs/brick3/disperse
Brick10: 10.70.35.135:/rhs/brick3/disperse
Brick11: 10.70.35.8:/rhs/brick3/disperse
Brick12: 10.70.35.196:/rhs/brick3/disperse
Options Reconfigured:
features.uss: enable
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
[root@dhcp35-37 ~]# gluster v status
Status of volume: disperse
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.37:/rhs/brick2/disperse      49154     0          Y       22200
Brick 10.70.35.116:/rhs/brick2/disperse     49154     0          Y       21974
Brick 10.70.35.239:/rhs/brick2/disperse     49154     0          Y       21982
Brick 10.70.35.135:/rhs/brick2/disperse     49154     0          Y       21966
Brick 10.70.35.8:/rhs/brick2/disperse       49154     0          Y       21998
Brick 10.70.35.196:/rhs/brick2/disperse     49154     0          Y       21999
Brick 10.70.35.37:/rhs/brick3/disperse      49155     0          Y       22219
Brick 10.70.35.116:/rhs/brick3/disperse     49155     0          Y       21993
Brick 10.70.35.239:/rhs/brick3/disperse     49155     0          Y       22001
Brick 10.70.35.135:/rhs/brick3/disperse     49155     0          Y       21985
Brick 10.70.35.8:/rhs/brick3/disperse       49155     0          Y       22017
Brick 10.70.35.196:/rhs/brick3/disperse     49155     0          Y       22018
Snapshot Daemon on localhost                49156     0          Y       22343
NFS Server on localhost                     2049      0          Y       22353
Self-heal Daemon on localhost               N/A       N/A        Y       22244
Quota Daemon on localhost                   N/A       N/A        Y       22298
Snapshot Daemon on 10.70.35.135             49156     0          Y       22089
NFS Server on 10.70.35.135                  2049      0          Y       22097
Self-heal Daemon on 10.70.35.135            N/A       N/A        Y       22012
Quota Daemon on 10.70.35.135                N/A       N/A        Y       22055
Snapshot Daemon on 10.70.35.196             49156     0          Y       22123
NFS Server on 10.70.35.196                  2049      0          Y       22131
Self-heal Daemon on 10.70.35.196            N/A       N/A        Y       22045
Quota Daemon on 10.70.35.196                N/A       N/A        Y       22086
Snapshot Daemon on 10.70.35.239             49156     0          Y       22105
NFS Server on 10.70.35.239                  2049      0          Y       22114
Self-heal Daemon on 10.70.35.239            N/A       N/A        Y       22028
Quota Daemon on 10.70.35.239                N/A       N/A        Y       22069
Snapshot Daemon on 10.70.35.116             49156     0          Y       22097
NFS Server on 10.70.35.116                  2049      0          Y       22105
Self-heal Daemon on 10.70.35.116            N/A       N/A        Y       22020
Quota Daemon on 10.70.35.116                N/A       N/A        Y       22061
Snapshot Daemon on 10.70.35.8               49156     0          Y       22121
NFS Server on 10.70.35.8                    2049      0          Y       22129
Self-heal Daemon on 10.70.35.8              N/A       N/A        Y       22048
Quota Daemon on 10.70.35.8                  N/A       N/A        Y       22088
 
Task Status of Volume disperse
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@dhcp35-37 ~]# cat /etc/redhat-*
cat: /etc/redhat-access-insights: Is a directory
Red Hat Enterprise Linux Server release 7.3 (Maipo)
Red Hat Gluster Storage Server 3.1 Update 3
[root@dhcp35-37 ~]# rpm -qa|grep gluster
glusterfs-libs-3.7.9-12.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-fuse-3.7.9-12.el7rhgs.x86_64
glusterfs-client-xlators-3.7.9-12.el7rhgs.x86_64
glusterfs-server-3.7.9-12.el7rhgs.x86_64
python-gluster-3.7.9-12.el7rhgs.noarch
glusterfs-api-3.7.9-12.el7rhgs.x86_64
glusterfs-geo-replication-3.7.9-12.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.el7rhgs.noarch
glusterfs-3.7.9-12.el7rhgs.x86_64
glusterfs-cli-3.7.9-12.el7rhgs.x86_64
gluster-nagios-addons-0.2.7-1.el7rhgs.x86_64
[root@dhcp35-37 ~]# 

####### FINDINGS###########################
Fuse mount:
(all bricks are up)
touch of 10k files in /rootOFvol/dir1/ ===>took 1min 12sec
[root@rhs-client45 dir1]# date;touch file{1..10000};date
Wed Dec 21 15:40:31 IST 2016
Wed Dec 21 15:41:43 IST 2016

post above step immediate ls -lRt on root ===>took less than 1sec  to start displaying and o/p was completed in 1sec
real	0m0.920s
user	0m0.084s
sys	0m0.126s

post above step immediate find * on root ===>took less than 1sec to start displaying and o/p was completed in 1sec
real	0m0.863s
user	0m0.014s
sys	0m0.021s


post above step immediate stat * on /rootOfvol/dir1/===>took about 1 sec to respond and 8 sec for completing the total o/p
real	0m8.358s
user	0m0.482s
sys	0m0.651s

post above step rm -rf on root ===>took about 50.720s
[root@rhs-client45 disperse]# ls
dir1
[root@rhs-client45 disperse]# time rm -rf *
real	0m50.711s
user	0m0.045s
sys	0m0.841s


Linux Untar:
downloaded 4.9 kernel image(size was 89MB) in /rootOFvol/dir2/ ===>the untar folder size was 695MB===>untar took about 25m  23sec as below
real	25m23.259s
user	0m14.129s
sys	0m23.510s


ls -lRt of the untarred folder took in total to complete as below 
real	0m50.975s
user	0m0.701s
sys	0m1.292s

Comment 5 Nag Pavan Chilakam 2016-12-21 11:27:08 UTC

3.2 numbers:
Setup info
Ec volume build (3.8.4-9)
rhel 7.3 
6 VMs each of 8GB

Client:
16GB RHEL7.3
 
Volume Name: disperse
Type: Distributed-Disperse
Volume ID: ef4f768e-4b10-4c81-8053-adafaa1183db
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.35.37:/rhs/brick2/disperse
Brick2: 10.70.35.116:/rhs/brick2/disperse
Brick3: 10.70.35.239:/rhs/brick2/disperse
Brick4: 10.70.35.135:/rhs/brick2/disperse
Brick5: 10.70.35.8:/rhs/brick2/disperse
Brick6: 10.70.35.196:/rhs/brick2/disperse
Brick7: 10.70.35.37:/rhs/brick3/disperse
Brick8: 10.70.35.116:/rhs/brick3/disperse
Brick9: 10.70.35.239:/rhs/brick3/disperse
Brick10: 10.70.35.135:/rhs/brick3/disperse
Brick11: 10.70.35.8:/rhs/brick3/disperse
Brick12: 10.70.35.196:/rhs/brick3/disperse
Options Reconfigured:
features.uss: enable
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

####### FINDINGS###########################
Fuse mount:
(all bricks are up)
touch of 10k files in /rootOFvol/dir1/ ===>took 3min 09Sec
post above step immediate ls -lRt on root ===>took less than 1sec  to start displaying and o/p was completed in 1sec
post above step immediate find * on root ===>took less than 1sec to start displaying and o/p was completed in 1sec
post above step immediate stat * on /rootOfvol/dir1/===>took about 1 sec to respond and 14 sec for completing the total o/p
post above step rm -rf on root ===>took about 1min 15.720s
Linux Untar:
downloaded 4.9 kernel image(size was 89MB) in /rootOFvol/dir2/ ===>the untar folder size was 695MB===>untar took about 43m 15 sec as below
Real  43m15.552s
user    0m16.014s
sys     0m35.983s

ls -lRt of the untarred folder took in total to complete as below
real    0m59.373s

Comment 7 Nag Pavan Chilakam 2016-12-21 13:15:47 UTC

For actual numbers you can refer to 
https://docs.google.com/spreadsheets/d/1T0pqXuL8mnIMMATwNGVWVTuJLSwziQvwi7LGr0LSgnk/edit#gid=0

Also attaching 3.2 and 3.1.3 numbers seperately

Comment 8 Nag Pavan Chilakam 2016-12-21 13:16:12 UTC

Created attachment 1234395 [details]
3.2 numbers

Comment 9 Nag Pavan Chilakam 2016-12-21 13:16:44 UTC

Created attachment 1234396 [details]
3.1.3 numbers

Comment 10 Nag Pavan Chilakam 2016-12-21 13:29:04 UTC

I have rerun the measurements without enabling quota/uss and have been updated in 
https://docs.google.com/spreadsheets/d/1T0pqXuL8mnIMMATwNGVWVTuJLSwziQvwi7LGr0LSgnk/edit#gid=0
I still see the degradation

Comment 11 Ambarish 2016-12-26 09:32:09 UTC

**Perf Data on physical machines** :

*Setup/Environment details*: 


Testbed : 12*(4+2),6 servers,6 workload generating clients.

Benchmark : 3.1.3 with io-threads enabled.

3.2 testing was done with io-threads enabled and mdcache parameters set


***********
OBSERVATION
***********

**FUSE** :

1. Creates :

3.1.3 : **Perf Data on physical machines** :

*Setup/Environment details*: 


Testbed : 12*(4+2),6 servers,6 workload generating clients.

Benchmark : 3.1.3 with io-threads enabled.

3.2 testing was done with io-threads enabled and mdcache parameters set


***********
OBSERVATION
***********

---------
Creates 
---------

3.1.3 : 3445 files/sec
3.2   : 1841 files/sec 

Regression -  -46%

--------
Renames 
--------

3.1.3 : 724 files/sec
3.2   : 592 files/sec

Regression : -18%

mkdir regression on FUSE and gNFS is tracked via https://bugzilla.redhat.com/show_bug.cgi?id=1408655

Comment 14 Atin Mukherjee 2016-12-27 12:10:30 UTC

Upstream patch : http://review.gluster.org/#/c/16298/

Comment 15 Atin Mukherjee 2017-03-03 06:48:11 UTC

A new upstream patch https://review.gluster.org/#/c/16821/ is posted with a different alternative.

Comment 18 Ambarish 2017-03-09 15:42:22 UTC

Small file workloads are well within the regression threshold my runs are allowed to have.

Verified on 3.8.4-18.

Comment 20 errata-xmlrpc 2017-03-23 05:58:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html