+++ This bug was initially created as a clone of Bug #1557906 +++ +++ This bug was initially created as a clone of Bug #1554743 +++ Description of problem: Reads are only at 47MB/s while writes are at 219MB/s: dd if=/dev/zero of=/media1/results/results/test-toberemoved/test.bin bs=1M count=1000 conv=fdatasync 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 4.785 s, 219 MB/s echo 3 > /proc/sys/vm/drop_caches dd if=/media1/results/results/test-toberemoved/test.bin of=/dev/null bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 22.1433 s, 47.4 MB/s ================================================================================ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Worker Ant on 2018-03-13 05:47:14 EDT --- REVIEW: https://review.gluster.org/19703 (cluster/ec: Change default read policy to gfid-hash) posted (#2) for review on master by Ashish Pandey --- Additional comment from Worker Ant on 2018-03-14 06:10:44 EDT --- COMMIT: https://review.gluster.org/19703 committed in master by "Ashish Pandey" <aspandey> with a commit message- cluster/ec: Change default read policy to gfid-hash Problem: Whenever we read data from file over NFS, NFS reads more data then requested and caches it. Based on the stat information it makes sure that the cached/pre-read data is valid or not. Consider 4 + 2 EC volume and all the bricks are on differnt nodes. In EC, with round-robin read policy, reads are sent on different set of data bricks. This way, it balances the read fops to go on all the bricks and avoid heating UP (overloading) same set of bricks. Due to small difference in clock speed, it is possible that we get minor difference for atime, mtime or ctime for different bricks. That might cause a different stat returned to NFS based on which NFS will discard cached/pre-read data which is actually not changed and could be used. Solution: Change read policy for EC as gfid-hash. That will force all the read to go to same set of bricks. Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 BUG: 1554743 Signed-off-by: Ashish Pandey <aspandey> --- Additional comment from Worker Ant on 2018-03-19 04:53:29 EDT --- REVIEW: https://review.gluster.org/19739 (cluster/ec: Change default read policy to gfid-hash) posted (#1) for review on release-4.0 by Ashish Pandey --- Additional comment from Worker Ant on 2018-03-20 07:00:03 EDT --- COMMIT: https://review.gluster.org/19739 committed in release-4.0 by "Ashish Pandey" <aspandey> with a commit message- cluster/ec: Change default read policy to gfid-hash Problem: Whenever we read data from file over NFS, NFS reads more data then requested and caches it. Based on the stat information it makes sure that the cached/pre-read data is valid or not. Consider 4 + 2 EC volume and all the bricks are on differnt nodes. In EC, with round-robin read policy, reads are sent on different set of data bricks. This way, it balances the read fops to go on all the bricks and avoid heating UP (overloading) same set of bricks. Due to small difference in clock speed, it is possible that we get minor difference for atime, mtime or ctime for different bricks. That might cause a different stat returned to NFS based on which NFS will discard cached/pre-read data which is actually not changed and could be used. Solution: Change read policy for EC as gfid-hash. That will force all the read to go to same set of bricks. >Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 >BUG: 1554743 >Signed-off-by: Ashish Pandey <aspandey> Change-Id: I825441cc519e94bf3dc3aa0bd4cb7c6ae6392c84 BUG: 1557906 Signed-off-by: Ashish Pandey <aspandey>
onqa validation on 3.12.2-11 as P0 cases(that is cases developed for testing the bug) are passing moving to verified ##### TEST PLAN ########### due to changing read-policy to gfid-hash, the read perf has improved (below is for a 10mb file) tc#1-->PASS (P0) md5sum took below time for congruent files(they are copies of each other) [root@dhcp35-72 dd]# time md5sum file.3 --------->with now default gfid-hash based e84853d61440dada29a64406f17de488 file.3 real 0m7.080s user 0m0.195s sys 0m0.080s [root@dhcp35-72 dd]# time md5sum file.4 ---->with round-robin e84853d61440dada29a64406f17de488 file.4 real 0m43.652s user 0m0.207s sys 0m0.297s tc#2:PASS (P0) check for the default of read-policy, it must be gfid-hash tc#3:PASS (P1) try setting read-policy to different values, must allow either of round-robin or gfid-hash [root@dhcp35-9 glusterfs]# gluster v get general all|grep gfid cluster.randomize-hash-range-by-gfid off storage.build-pgfid off storage.gfid2path on storage.gfid2path-separator : disperse.read-policy gfid-hash [root@dhcp35-9 glusterfs]# gluster v gset general disperse.read-policy unrecognized word: gset (position 1) [root@dhcp35-9 glusterfs]# gluster v gset general disperse.read-policy add unrecognized word: gset (position 1) [root@dhcp35-9 glusterfs]# gluster v set general disperse.read-policy add volume set: failed: option read-policy add: 'add' is not valid (possible options are round-robin, gfid-hash.) tc#4: ->PASS (P1) however raised an RFE BZ#1583662 - RFE: load-balance reads even when the read-policy is set to gfid-hash when multiple clients read same file read same file from multiple clients, should not impact, both clients from read from same set of bricks tc#5-->PASS (P2) softlink to a file and read it? no problem, as it still reads from source file tc#6->PASS (P0) have a file being read and when one of the hashed bricks goes down, no EIO must be seen, as the non-hashed brick must start to serve data Test above even by disabling nfs client cache(passed) checked even with 2 bricks down tc#7->Pass but can be improved (P2) Once the hashed brick comesup check if the hashed brick starts to serve the data Result->yes, for this reason, i raised a bz#1583643 - avoid switching back to the gfid-hashed brick once it is online(up) and instead continue reads from non-hashed brick [root@dhcp35-126 dispersevol1]# dd if=big-dd//10mb of=/dev/null bs=1024 count=10000000 10000000+0 records in 10000000+0 records out 10240000000 bytes (10 GB) copied, 570.747 s, 17.9 MB/s tc#8:->PASS (P2) if brick which is not hashed is brought down should not impact the read tc#9: raised bz#1583643 - avoid switching back to the gfid-hashed brick once it is online(up) and instead continue reads from non-hashed brick Also raised below BZ 1583667 - nfs logs flooded with "Connection refused); disconnecting socket" even after the brick is up due to stale sockets
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607