Steps to Reproduce: (Not sure of the exact steps which led to this issue. Here are the steps performed) =============================================================================== 1. Create 2 x 2 distribute-replicate volume. Start the volume 2. Create 2 fuse and 2 nfs mounts from 2 client machines. 3. Start the script on fuse mounts from each client. "./create_dirs_files_multi_thread.py --number-of-threads 100 --num-files-per-dir 25 --min-file-size 1024 --max-file-size 10240 --starting-dir-num 1 --dir-depth 5 --dir-width 4" "./create_dirs_files_multi_thread.py --number-of-threads 100 --num-files-per-dir 25 --min-file-size 1024 --max-file-size 10240 --starting-dir-num 101 --dir-depth 5 --dir-width 4" 4. While creation is in process, bring down brick1 and brick4. (crashed the brick devices using godown xfstests utility) 5. Start the script on nfs mounts from each client. "./create_dirs_files_multi_thread.py --number-of-threads 100 --num-files-per-dir 25 --min-file-size 1024 --max-file-size 10240 --starting-dir-num 1 --dir-depth 5 --dir-width 4" "./create_dirs_files_multi_thread.py --number-of-threads 100 --num-files-per-dir 25 --min-file-size 1024 --max-file-size 10240 --starting-dir-num 101 --dir-depth 5 --dir-width 4" 6. On the other fuse mount from both the clients execute: "find . -type d | wc" 7. On the other nfs mount from both the clients execute : "find . -type f | wc" 6. Add 4 bricks to the volume to make it 4 x 2 dis-rep volume. 7. bring back the bricks brick1, brick4. 8. start rebalance. 9. while rebalance is in progress, renames files.
RCA from https://bugzilla.redhat.com/show_bug.cgi?id=1141750#c3 which has same RCA but different manifestation: Simpler test to re-create the bug: 0) Create a replicate volume 1x2 start it and mount it. 1) Open a file 'a' from the mount and keep writing to it. 2) Bring one of the bricks down 3) rename the file '<mnt>/a' to '<mnt>/b' 4) Wait for at least one write to complete while the brick is still down. 5) Restart the brick 6) Wait until self-heal completes and stop the 'writing' from mount point. Root cause: When Rename happens while the brick is down after the brick comes back up, entry self-heal is triggered on the parent directory of where the rename happened, in this case that is <mnt>. As part of this entry self-heal 1) file 'a' is deleted and 2) file 'b' is re-created. 0) In parallel to this, writing fd needs to be opened on the file from the mount point. If re-opening of the file in step-0) happens before step-1) of self-heal then this issue is observed. Writes from mount keep going to the file that was deleted where as the self-heal happens on the file created at step-2. So the checksum mismatches. One more manifestation of this issue is https://bugzilla.redhat.com/show_bug.cgi?id=1139599. Where writes from the mount only increase the file on the 'always up' brick but the file on the other brick is not growing. This leads to split-brain because of size mismatch but all-zero pending changelog. It is a day-1 bug in link-self-heal of entry-self-heal. Bug exists from 2012. This is the bug in implementation of the following commit: commit 1936e29c3ac3d6466d391545d761ad8e60ae2e03 Author: Pranith Kumar K <pranithk> Date: Wed Feb 29 16:31:18 2012 +0530 cluster/afr: Hardlink Self-heal Change-Id: Iea0b38011edbc7fc9d75a95d1775139a5234e514 BUG: 765391 Signed-off-by: Pranith Kumar K <pranithk> Reviewed-on: http://review.gluster.com/2841 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Amar Tumballi <amarts> Reviewed-by: Vijay Bellur <vijay> Very good testing shwetha! Thanks a lot for the help in re-creating and condensing the test case kritika! Pranith
http://review.gluster.org/#/c/13001/