1408639 – [Perf] : Sequential Writes are off target by 12% on EC backed volumes over FUSE

Bug 1408639 - [Perf] : Sequential Writes are off target by 12% on EC backed volumes over FUSE

Summary: [Perf] : Sequential Writes are off target by 12% on EC backed volumes over FUSE

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Ashish Pandey
QA Contact:	Ambarish
Docs Contact:
URL:
Whiteboard:
Depends On:	1409191 1415160
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-12-26 07:28 UTC by Ambarish
Modified:	2017-03-28 06:51 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8.4-13
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1409191 1419549 (view as bug list)
Environment:
Last Closed:	2017-03-23 06:00:15 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1408641	0	unspecified	CLOSED	[Perf] : Sequential Writes have regressed by ~25% on EC backed volumes over SMB3	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Internal Links: 1408641

Description Ambarish 2016-12-26 07:28:17 UTC

Description of problem:
------------------------

Testbed : 12*(4+2),6 servers,6 workload generating clients.

Benchmark : 3.1.3 with io-threads enabled.

3.2 testing was done with io-threads enabled and mdcache parameters set.

It looks like we have regressed with 3.2 on large file writes/rand writes :

******************
Sequential Writes
******************

3.1.3 : 2838601.16 kB/sec
3.2   : 2506687.55 kB/sec

Regression : ~ 12%


******************
Random Writes
******************

3.1.3 : 617384.17 kB/sec
3.2   : 480226.17 kB/sec

Regression : ~22%

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.8.4-10.el7rhgs.x86_64


How reproducible:
------------------

100%

Actual results:
----------------

Regressions on sequential and random large file writes.

Expected results:
-----------------

Regression Threshold is within +-10%

Additional info:
----------------

> Server Profile will be attached.

> Client and Server OS : RHEL 7.3

> *Vol config* :

Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 4a377ad9-0c87-4553-b45f-95ab0590c055
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas001.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas016.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick8: gqas009.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick9: gqas010.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick10: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick11: gqas001.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick12: gqas016.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick13: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick14: gqas009.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick15: gqas010.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick16: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick17: gqas001.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick18: gqas016.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick19: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick20: gqas009.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick21: gqas010.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick22: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick23: gqas001.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick24: gqas016.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick25: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick26: gqas009.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick27: gqas010.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick28: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick29: gqas001.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick30: gqas016.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick31: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick32: gqas009.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick33: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick34: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick35: gqas001.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick36: gqas016.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick37: gqas008.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick38: gqas009.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick39: gqas010.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick40: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick41: gqas001.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick42: gqas016.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick43: gqas008.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick44: gqas009.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick45: gqas010.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick46: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick47: gqas001.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick48: gqas016.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick49: gqas008.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick50: gqas009.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick51: gqas010.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick52: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick53: gqas001.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick54: gqas016.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick55: gqas008.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick56: gqas009.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick57: gqas010.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick58: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick59: gqas001.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick60: gqas016.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick61: gqas008.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick62: gqas009.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick63: gqas010.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas001.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas016.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas008.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick68: gqas009.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick69: gqas010.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas001.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas016.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
cluster.lookup-optimize: on
performance.client-io-threads: on
nfs.disable: off
performance.readdir-ahead: on
transport.address-family: inet
server.event-threads: 4
client.event-threads: 4
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-samba-metadata: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
[root@gqas008 ~]#

Comment 2 Ambarish 2016-12-26 07:36:07 UTC

After setting io-threads to off,I do not see any improvement(unlike on a *2 deployment).

Seq Writes(io-threads off) : 2581707.91 kB/sec

Random Writes(io-threads off) : 485993.3 kB/sec

Comment 9 Ashish Pandey 2016-12-28 08:20:32 UTC

I tried on my laptop and could see that there is a regression of around 22% for seq writes using dd command.


======================================
master - 

[root@apandey testvol]# time dd if=/dev/urandom of=file count=1024000 bs=1024 conv=fdatasync
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 117.527 s, 8.9 MB/s

real	1m57.591s
user	0m0.680s
sys	0m18.235s


======================================
3.1.3 -


[root@apandey glusterfs]# 
[root@apandey glusterfs]# cd /mnt/testvol/
[root@apandey testvol]# time dd if=/dev/urandom of=file count=1024000 bs=1024 conv=fdatasync
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 92.1061 s, 11.4 MB/s

real	1m32.116s
user	0m0.602s
sys	0m16.018s

======================================


profile for master -

   Block Size:              32768b+               65536b+
 No. of Reads:                    0                     0
No. of Writes:                19179                     8
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us              3      FORGET
      0.00       0.00 us       0.00 us       0.00 us             27     RELEASE
      0.00       0.00 us       0.00 us       0.00 us             50  RELEASEDIR
      0.00       2.17 us       2.00 us       3.00 us              6     OPENDIR
      0.00      31.00 us      31.00 us      31.00 us              1      STATFS
      0.00      42.67 us      28.00 us      58.00 us              3       FLUSH
      0.00     102.00 us      61.00 us     143.00 us              2     XATTROP
      0.00     270.00 us     270.00 us     270.00 us              1      CREATE
      0.00      30.08 us      23.00 us      45.00 us             12       FSTAT
      0.00     143.46 us      66.00 us     800.00 us             13        OPEN
      0.00      94.31 us      18.00 us     163.00 us             26    GETXATTR
      0.01     141.17 us      19.00 us     573.00 us             36      LOOKUP
      0.01     100.73 us      35.00 us     658.00 us             62    FXATTROP
      0.02     811.00 us     508.00 us    2052.00 us             12     READDIR
      0.10   58962.00 us   58962.00 us   58962.00 us              1    TRUNCATE
      0.63    2355.08 us      20.00 us   30162.00 us            158     INODELK
      5.82      51.46 us      31.00 us    5417.00 us          66437        READ
     18.37      51.48 us      28.00 us  433140.00 us         209666       WRITE
     75.03 22041646.50 us 4844215.00 us 39239078.00 us              2       FSYNC



Profile for 3.1.3 - 


   Block Size:              32768b+
 No. of Reads:                    0
No. of Writes:                   81
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us              1     RELEASE
      0.00       0.00 us       0.00 us       0.00 us              3  RELEASEDIR
      0.00       2.00 us       2.00 us       2.00 us              2     OPENDIR
      0.00      35.00 us      35.00 us      35.00 us              1      STATFS
      0.00      52.00 us      52.00 us      52.00 us              1     XATTROP
      0.00      39.00 us      33.00 us      45.00 us              2       FLUSH
      0.00      42.75 us      27.00 us      51.00 us              4     INODELK
      0.00      88.50 us      40.00 us     137.00 us              2    FXATTROP
      0.00      48.25 us      19.00 us      96.00 us              4     READDIR
      0.00     349.00 us     349.00 us     349.00 us              1      CREATE
      0.00     103.25 us      25.00 us     238.00 us              4    GETXATTR
      0.00     102.45 us       9.00 us     291.00 us             11      LOOKUP
      5.51      46.41 us      28.00 us    5835.00 us          41210        READ
     23.69 8223891.00 us 8223891.00 us 8223891.00 us              1       FSYNC
     70.79     196.77 us      25.00 us 19247646.00 us         124899       WRITE

    Duration: 159 seconds
   Data Read: 21098496 bytes
Data Written: 293858304 bytes


Clearly we can see the number of inodelk and fxattrop and other fop has 
increased considerably.

Comment 10 Ashish Pandey 2016-12-29 11:38:41 UTC

Possible RCA -  

After implementing patch http://review.gluster.org/#/c/13733/,
before writing on a file we set dirty flag and at the end we remove this flag.
This creates an index entry in .glusterfs/indices/xattrop/ .
which remains there through out write fop. every 60 seconds shd will come up and scan this entry and starts heal, Heal in turn takes a lot of locks to FIND and heal the file. 

Which raises the number of inodelk fop count and could be a possible culprit.

I disabled the shd and  wrote a file -

time dd if=/dev/urandom of=a1 count=1024 bs=1M conv=fdatasync
Profile shows only 4 calls inodelk.

Brick: apandey:/brick/gluster/testvol-6
---------------------------------------
Cumulative Stats:
   Block Size:              32768b+               65536b+ 
 No. of Reads:                    0                     0 
No. of Writes:                 8188                     2 
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us              1     RELEASE
      0.00      47.00 us      47.00 us      47.00 us              1      STATFS
      0.00      49.50 us      46.00 us      53.00 us              2       FLUSH
      0.00      38.00 us      26.00 us      52.00 us              4     INODELK
      0.00      92.50 us      85.00 us     100.00 us              2     XATTROP
      0.00     305.00 us     305.00 us     305.00 us              1      CREATE
      0.00     138.00 us      32.00 us     395.00 us              4    FXATTROP
      0.00     164.14 us     119.00 us     212.00 us              7      LOOKUP
      0.92      72.73 us      43.00 us    8431.00 us           8190       WRITE
     99.08 64142355.00 us 64142355.00 us 64142355.00 us              1       FSYNC


With shd enable it is around 54- 


Brick: apandey:/brick/gluster/testvol-1
---------------------------------------
Cumulative Stats:
   Block Size:              32768b+               65536b+ 
 No. of Reads:                    0                     0 
No. of Writes:                 8190                     1 
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop
 ---------   -----------   -----------   -----------   ------------        ----
      0.00       0.00 us       0.00 us       0.00 us              7     RELEASE
      0.00       0.00 us       0.00 us       0.00 us             21  RELEASEDIR
      0.00      30.00 us      30.00 us      30.00 us              1      STATFS
      0.00       5.76 us       2.00 us       9.00 us             21     OPENDIR
      0.00      64.50 us      30.00 us      99.00 us              2       FLUSH
      0.00      23.17 us      20.00 us      27.00 us              6       FSTAT
      0.00      95.50 us      89.00 us     102.00 us              2     XATTROP
      0.00     272.00 us     272.00 us     272.00 us              1      CREATE
      0.00      61.67 us      42.00 us      85.00 us              6        OPEN
      0.00      98.94 us      31.00 us     428.00 us             16    FXATTROP
      0.00      79.92 us      22.00 us     190.00 us             38      LOOKUP
      0.12    2379.48 us    1376.00 us    4600.00 us             42     READDIR
      0.74      74.70 us      42.00 us   49556.00 us           8191       WRITE
     10.29  163490.19 us      19.00 us 1405941.00 us             52     INODELK
     19.02  320668.04 us      26.00 us 15705174.00 us             49    GETXATTR
     69.83 57700430.00 us 57700430.00 us 57700430.00 us              1       FSY


-------

Comment 11 Ashish Pandey 2017-01-11 17:46:52 UTC

http://review.gluster.org/16377

Comment 12 Atin Mukherjee 2017-01-23 09:46:54 UTC

downstream patch : https://code.engineering.redhat.com/gerrit/#/c/95870/

Comment 14 Ambarish 2017-02-06 13:31:17 UTC

Sequential Write  on 3.8.4-13 - 2709973.61 kB/sec,which is well within the Perf Regression threshold,Verified.

Raised a separate BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1419549) for Rand Write regression post discussion with Atin,Ashish and Rahul.

Comment 16 errata-xmlrpc 2017-03-23 06:00:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.