Description of problem: When a brick is killed in a replica and `dd` is run, we see a lot of fstats being sent over the network, with a small (but very real) reduction in write throughput. Throughput and profile info on a random brick when all bricks are up: ================================================================== root@tuxpad fuse_mnt$ dd if=/dev/zero of=FILE bs=1024 count=10240 10240+0 records in 10240+0 records out 10485760 bytes (10 MB) copied, 2.9847 s, 3.5 MB/s Brick: 127.0.0.2:/home/ravi/bricks/brick1 ----------------------------------------- Cumulative Stats: Block Size: 1024b+ No. of Reads: 0 No. of Writes: 10240 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 2 RELEASE 0.00 0.00 us 0.00 us 0.00 us 6 RELEASEDIR 0.01 30.50 us 27.00 us 34.00 us 2 INODELK 0.01 70.00 us 70.00 us 70.00 us 1 OPEN 0.01 35.50 us 19.00 us 52.00 us 2 FLUSH 0.03 96.50 us 83.00 us 110.00 us 2 GETXATTR 0.04 253.00 us 253.00 us 253.00 us 1 TRUNCATE 0.07 225.50 us 202.00 us 249.00 us 2 FXATTROP 0.15 153.17 us 47.00 us 656.00 us 6 STATFS 0.17 537.00 us 207.00 us 867.00 us 2 XATTROP 0.22 685.00 us 22.00 us 1348.00 us 2 FINODELK 0.62 255.67 us 104.00 us 928.00 us 15 LOOKUP 98.66 59.72 us 35.00 us 4772.00 us 10240 WRITE Duration: 673 seconds Data Read: 0 bytes Data Written: 10485760 bytes Throughput and profile info of one of the 'up' bricks when one brick is down =========================================================================== 0:root@tuxpad fuse_mnt$ dd if=/dev/zero of=FILE bs=1024 count=10240 10240+0 records in 10240+0 records out 10485760 bytes (10 MB) copied, 4.24494 s, 2.5 MB/s Brick: 127.0.0.2:/home/ravi/bricks/brick1 ----------------------------------------- Cumulative Stats: Block Size: 1024b+ No. of Reads: 0 No. of Writes: 10240 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 2 RELEASE 0.00 0.00 us 0.00 us 0.00 us 5 RELEASEDIR 0.01 98.00 us 98.00 us 98.00 us 1 OPEN 0.01 57.50 us 43.00 us 72.00 us 2 INODELK 0.01 126.00 us 126.00 us 126.00 us 1 GETXATTR 0.02 184.00 us 184.00 us 184.00 us 1 TRUNCATE 0.02 113.00 us 109.00 us 117.00 us 2 FXATTROP 0.02 122.00 us 16.00 us 228.00 us 2 FLUSH 0.02 132.00 us 38.00 us 226.00 us 2 FINODELK 0.08 418.00 us 283.00 us 553.00 us 2 XATTROP 0.21 763.00 us 122.00 us 1630.00 us 3 LOOKUP 41.23 44.83 us 36.00 us 490.00 us 10240 WRITE 58.38 63.47 us 46.00 us 888.00 us 10240 FSTAT Duration: 75 seconds Data Read: 0 bytes Data Written: 10485760 bytes
REVIEW: http://review.gluster.org/16309 (afr: Avoid resetting event_gen when brick is always down) posted (#1) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/16309 (afr: Avoid resetting event_gen when brick is always down) posted (#2) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/16309 (afr: Avoid resetting event_gen when brick is always down) posted (#3) for review on master by Ravishankar N (ravishankar)
REVIEW: http://review.gluster.org/16309 (afr: Avoid resetting event_gen when brick is always down) posted (#4) for review on master by Ravishankar N (ravishankar)
COMMIT: http://review.gluster.org/16309 committed in master by Jeff Darcy (jdarcy) ------ commit 522640be476a3f97dac932f7046f0643ec0ec2f2 Author: Ravishankar N <ravishankar> Date: Fri Dec 30 14:57:17 2016 +0530 afr: Avoid resetting event_gen when brick is always down Problem: __afr_set_in_flight_sb_status(), which resets event_gen to zero, is called if failed_subvols[i] is non-zero for any brick. But failed_subvols[i] is true even if the brick was down *before* the transaction started. Hence say if 1 brick is down in a replica-3, every writev that comes will trigger an inode refresh because of this resetting, as seen from the no. of FSTATs in the profile info in the BZ. Fix: Reset event gen only if the brick was previously a valid read child and the FOP failed on it the first time. Also `s/afr_inode_read_subvol_reset/afr_inode_event_gen_reset` because the function only resets event gen and not the data/metadata readable. Change-Id: I603ae646cbde96995c35db77916e2ed80b602a91 BUG: 1409206 Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: http://review.gluster.org/16309 Smoke: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> Tested-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report. glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html [2] https://www.gluster.org/pipermail/gluster-users/