+++ This bug was initially created as a clone of Bug #980548 +++ Description of problem: If we run tests/bugs/bug-888174.t of test framework in a while loop it fails intermittently. Results of failed run: Test Summary Report ------------------- ./tests/bugs/bug-888174.t (Wstat: 0 Tests: 25 Failed: 4) Failed tests: 22-25 Version-Release number of selected component (if applicable): How reproducible: intermittent Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: --- Additional comment from Anand Avati on 2013-07-02 13:08:38 EDT --- REVIEW: http://review.gluster.org/5274 (cluster/afr: Make sure flush is unwound after post-op) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu) --- Additional comment from Anand Avati on 2013-07-02 20:30:35 EDT --- REVIEW: http://review.gluster.org/5274 (cluster/afr: post-op should complete before starting flush) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu) --- Additional comment from Anand Avati on 2013-07-03 03:40:48 EDT --- COMMIT: http://review.gluster.org/5274 committed in master by Vijay Bellur (vbellur) ------ commit 29619b4ee78926160435da82f9db213161e040d4 Author: Pranith Kumar K <pkarampu> Date: Wed Jul 3 05:23:46 2013 +0530 cluster/afr: post-op should complete before starting flush Problem: At the moment afr-flush makes sure that a delayed post-op is woken up but it does not wait for it to complete the post-op before flush unwinds. These are the steps that are happening: 1) flush fop comes on an fd which wakes up a delayed post-op and continues with the flush fop. 2) post-op sends fsync on the wire. 3) flush completes and unwinds to fuse. 4) graph switch happens on the fuse mount disconnecting the old graph's client connections to bricks. 5) xattrop after fsync fails with ENOTCONN because the connections from old graph are taken down now. Fix: Wait for post-op to complete before starting to flush. We could make flush act similar to fsync (i.e.) wind flush as is but wait for post-op to complete before unwinding flush, but it is better to send flush as the final fop. So wind of flush will start after post-op is complete. Had to change fsync to accommodate this change. Change-Id: I93aa642647751969511718b0e137afbd067b388a BUG: 980548 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/5274 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur>
Found this issue on the build: ============================= glusterfs 3.4.0.18rhs built on Aug 7 2013 08:02:45 Steps to recreate the bug: ========================== 1. Create 1 x 2 replicate volume. 2. Set write-behind "off" , eager-lock "on" 3. Start the volume 4. Create 2 fuse mount from a client. {RHEL 6.4} 5. Open fd for a file from each mount point : exec 5>>testfile1 {fuse1} exec 5>>testfile2 {fuse2} 6. from each of the mount point execute the following :{ execute 10 times from each mount point } for i in `seq 1 10000`; do echo "$(($(date +%s%N)/1000000))" >&5 ; sleep 1 ; done & 7. After some time set write-behind to "on" Actual result: =============== The change-logs are not cleared. File on both the bricks are in fool-fool state. Brick0: ======= root@king [Aug-13-2013-11:27:31] >getfattr -d -e hex -m . /rhs/bricks/b0/testfile1 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/b0/testfile1 trusted.afr.vol_rep-client-0=0x000000400000000000000000 trusted.afr.vol_rep-client-1=0x000000400000000000000000 trusted.gfid=0xf7f1d74d3e63447289a245a0659c5bd4 root@king [Aug-13-2013-11:27:33] >getfattr -d -e hex -m . /rhs/bricks/b0/testfile2 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/b0/testfile2 trusted.afr.vol_rep-client-0=0x0000003c0000000000000000 trusted.afr.vol_rep-client-1=0x0000003c0000000000000000 trusted.gfid=0x2c687c491eb0416ca44c3586a1c579db Brick1: ======= root@hicks [Aug-13-2013-11:26:51] >getfattr -d -e hex -m . /rhs/bricks/b1/testfile1 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/b1/testfile1 trusted.afr.vol_rep-client-0=0x000000400000000000000000 trusted.afr.vol_rep-client-1=0x000000400000000000000000 trusted.gfid=0xf7f1d74d3e63447289a245a0659c5bd4 root@hicks [Aug-13-2013-11:26:53] >getfattr -d -e hex -m . /rhs/bricks/b1/testfile2 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/b1/testfile2 trusted.afr.vol_rep-client-0=0x0000003c0000000000000000 trusted.afr.vol_rep-client-1=0x0000003c0000000000000000 trusted.gfid=0x2c687c491eb0416ca44c3586a1c579db root@king [Aug-13-2013-11:27:49] >gluster volume info vol_rep Volume Name: vol_rep Type: Replicate Volume ID: 7b421b27-1cdd-4b93-b486-e8d0c9968f4d Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: king:/rhs/bricks/b0 Brick2: hicks:/rhs/bricks/b1 Options Reconfigured: cluster.eager-lock: on performance.write-behind: on
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/ If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.