Bug 1559084

Summary:	[EC] Read performance of EC volume exported over gNFS is significantly lower than write performance
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ashish Pandey <aspandey>
Component:	disperse	Assignee:	Ashish Pandey <aspandey>
Status:	CLOSED ERRATA	QA Contact:	Nag Pavan Chilakam <nchilaka>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.4	CC:	bturner, bugs, jahernan, pkarampu, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, ubansal
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.12.2-6	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1557906	Environment:
Last Closed:	2018-09-04 06:44:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1554743, 1557906, 1558352
Bug Blocks:	1503137, 1557904

Description Ashish Pandey 2018-03-21 16:39:25 UTC

+++ This bug was initially created as a clone of Bug #1557906 +++

+++ This bug was initially created as a clone of Bug #1554743 +++

Description of problem:
Reads are only at 47MB/s while writes are at 219MB/s:



dd if=/dev/zero of=/media1/results/results/test-toberemoved/test.bin bs=1M count=1000 conv=fdatasync
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.785 s, 219 MB/s

echo 3 > /proc/sys/vm/drop_caches

dd if=/media1/results/results/test-toberemoved/test.bin of=/dev/null bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 22.1433 s, 47.4 MB/s
================================================================================

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from Worker Ant on 2018-03-13 05:47:14 EDT ---

REVIEW: https://review.gluster.org/19703 (cluster/ec: Change default read policy to gfid-hash) posted (#2) for review on master by Ashish Pandey

--- Additional comment from Worker Ant on 2018-03-14 06:10:44 EDT ---

COMMIT: https://review.gluster.org/19703 committed in master by "Ashish Pandey" <aspandey> with a commit message- cluster/ec: Change default read policy to gfid-hash

Problem:
Whenever we read data from file over NFS, NFS reads
more data then requested and caches it. Based on the
stat information it makes sure that the cached/pre-read
data is valid or not.

Consider 4 + 2 EC volume and all the bricks are on
differnt nodes.

In EC, with round-robin read policy, reads are sent on
different set of data bricks. This way, it balances the
read fops to go on all the bricks and avoid heating UP
(overloading) same set of bricks.

Due to small difference in clock speed, it is possible
that we get minor difference for atime, mtime or ctime
for different bricks. That might cause a different stat
returned to NFS based on which NFS will discard
cached/pre-read data which is actually not changed and
could be used.

Solution:
Change read policy for EC as gfid-hash. That will force
all the read to go to same set of bricks.

Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84
BUG: 1554743
Signed-off-by: Ashish Pandey <aspandey>

--- Additional comment from Worker Ant on 2018-03-19 04:53:29 EDT ---

REVIEW: https://review.gluster.org/19739 (cluster/ec: Change default read policy to gfid-hash) posted (#1) for review on release-4.0 by Ashish Pandey

--- Additional comment from Worker Ant on 2018-03-20 07:00:03 EDT ---

COMMIT: https://review.gluster.org/19739 committed in release-4.0 by "Ashish Pandey" <aspandey> with a commit message- cluster/ec: Change default read policy to gfid-hash

Problem:
Whenever we read data from file over NFS, NFS reads
more data then requested and caches it. Based on the
stat information it makes sure that the cached/pre-read
data is valid or not.

Consider 4 + 2 EC volume and all the bricks are on
differnt nodes.

In EC, with round-robin read policy, reads are sent on
different set of data bricks. This way, it balances the
read fops to go on all the bricks and avoid heating UP
(overloading) same set of bricks.

Due to small difference in clock speed, it is possible
that we get minor difference for atime, mtime or ctime
for different bricks. That might cause a different stat
returned to NFS based on which NFS will discard
cached/pre-read data which is actually not changed and
could be used.

Solution:
Change read policy for EC as gfid-hash. That will force
all the read to go to same set of bricks.

>Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84
>BUG: 1554743
>Signed-off-by: Ashish Pandey <aspandey>

Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84
BUG: 1557906
Signed-off-by: Ashish Pandey <aspandey>

Comment 8 Nag Pavan Chilakam 2018-05-29 15:46:09 UTC

onqa validation on 3.12.2-11
as P0 cases(that is cases developed for testing the bug) are passing moving to verified

##### TEST PLAN ###########
due to changing read-policy to gfid-hash, the read perf has improved
(below is for a 10mb file)


tc#1-->PASS (P0)
md5sum took below time for congruent files(they are copies of each other)

[root@dhcp35-72 dd]# time md5sum file.3  --------->with now default gfid-hash based
e84853d61440dada29a64406f17de488  file.3

real	0m7.080s
user	0m0.195s
sys	0m0.080s
[root@dhcp35-72 dd]# time md5sum file.4  ---->with round-robin
e84853d61440dada29a64406f17de488  file.4

real	0m43.652s
user	0m0.207s
sys	0m0.297s


tc#2:PASS (P0)
check for the default of read-policy, it must be gfid-hash

tc#3:PASS  (P1)
try setting read-policy to different values, must allow either of round-robin or gfid-hash

[root@dhcp35-9 glusterfs]# gluster v get general all|grep gfid
cluster.randomize-hash-range-by-gfid    off                                     
storage.build-pgfid                     off                                     
storage.gfid2path                       on                                      
storage.gfid2path-separator             :                                       
disperse.read-policy                    gfid-hash                               
[root@dhcp35-9 glusterfs]# gluster v gset general disperse.read-policy
unrecognized word: gset (position 1)
[root@dhcp35-9 glusterfs]# gluster v gset general disperse.read-policy add
unrecognized word: gset (position 1)
[root@dhcp35-9 glusterfs]# gluster v set general disperse.read-policy add
volume set: failed: option read-policy add: 'add' is not valid (possible options are round-robin, gfid-hash.)



tc#4: ->PASS (P1) however raised an RFE BZ#1583662 - RFE: load-balance reads even when the read-policy is set to gfid-hash when multiple clients read same file
read same file from multiple clients, should not impact, both clients from read from same set of bricks


tc#5-->PASS (P2)
softlink to a file and read it?
no problem, as it still reads from source file 

tc#6->PASS (P0)
have a file being read and when one of the hashed bricks goes down, no EIO must be seen, as the non-hashed brick must start to serve data
Test above even by disabling nfs client cache(passed)
checked even with 2 bricks down

tc#7->Pass but can be improved (P2)
Once the hashed brick comesup check if the hashed brick starts to serve the data
Result->yes, for this reason, i raised a bz#1583643 - avoid switching back to the gfid-hashed brick once it is online(up) and instead continue reads from non-hashed brick 

[root@dhcp35-126 dispersevol1]# dd if=big-dd//10mb of=/dev/null bs=1024 count=10000000
10000000+0 records in
10000000+0 records out
10240000000 bytes (10 GB) copied, 570.747 s, 17.9 MB/s


tc#8:->PASS (P2)
if brick which is not hashed is brought down should not impact the read

tc#9:
raised bz#1583643 - avoid switching back to the gfid-hashed brick once it is online(up) and instead continue reads from non-hashed brick
	
Also raised below BZ
1583667 - nfs logs flooded with "Connection refused); disconnecting socket" even after the brick is up due to stale sockets

Comment 10 errata-xmlrpc 2018-09-04 06:44:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607