Bug 1136718

Summary: DHT + AFR :- File is truncated to lower size, while rename and self heal is in progress and copied/renamed that file to another Directory
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rachana Patel <racpatel>
Component: distributeAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED WONTFIX QA Contact: Prasad Desala <tdesala>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.0CC: ashetty, asriram, mzywusko, nbalacha, rgowdapp, rhs-bugs, smohan, tdesala
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: dht-data-loss, dht-3.2.0-proposed
Fixed In Version: Doc Type: Known Issue
Doc Text:
The afr self-heal can leave behind a partially healed file if the brick containing afr self-heal source file goes down in the middle of heal operation. If this partially healed file is migrated before the brick that was down comes online again, the migrated file would have incorrect data and the original file would be deleted.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-16 18:05:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1087818    

Description Rachana Patel 2014-09-03 07:09:08 UTC
Description of problem:
=======================
File is truncated to lower size, while rename and self heal is in progress and copied/renamed that file to another Directory

Version-Release number of selected component (if applicable):
=============================================================
3.6.0.27-6.el6rhs.x86_64

How reproducible:
=================
intermittent

Steps to Reproduce:
==================
1. created 9X2 dist-rep volume using 4 server
2. fuse mounted that volume on 4 server and nfs mount on 2 server
3. killed one brick from each replica pair and created few files from fuse mount. 
62 files zero size, 62 files having size in mb
and created hard link for each file
4. start all brick using start force option. 
5. start renaming all files (from all 6 mount) in loop.
6. add-brick and did rebalance.
7. rebalance is done then kill brick on one of the server using pkill glusterfsd
8.  start all brick using start force option.
9. terminate rename loop. start rename loop again
10. repeat steps 6 to 8 for 2-3 times and killing bricks on different server
(usually I kill one brick from each replica pair , only once i brought replica pair down)
-- no data loss so far
11. now brinkg down  one brick from each replica pair  (dht16 and dht19)
12. keep renaming files from mount point
13. copy all file from all four FUSE mount  to one directory and then move files to some other Directory
[root@dht16 ks]#  cp * -f  test1/ ; for j in {1..62}; do mv -f test1/zero$j-* test2/ ; mv -f test1/mb$j-* test2/ ; mv -f test1/zeroln$j-* test2/ ;  mv -f  test1/mbln$j-* test2/ ; done
after a while again run same command from all FUSE mount as below
cp * -f  test1/ ;
for j in {1..62}; do mv -f test1/zero$j-* test2/ ; mv -f test1/mb$j-* test2/ ; mv -f test1/zeroln$j-* test2/ ;  mv -f  test1/mbln$j-* test2/ ; done

one file is tructed from 41943040 to 31981568
[root@dht16 test2]# md5sum /mnt/ks/mb1-* /mnt/ks/test1/mb1-* /mnt/ks/test2/mb1-*
b76bd9df3e5998d4959b56dcb7602a6e  /mnt/ks/mb1-85
md5sum: /mnt/ks/test1/mb1-*: No such file or directory
ffb6ac6fb57543748103f77b4df93215  /mnt/ks/test2/mb1-30
b76bd9df3e5998d4959b56dcb7602a6e  /mnt/ks/test2/mb1-31
[root@dht16 test2]# ls -li  /mnt/ks/mb1-* /mnt/ks/test1/mb1-* /mnt/ks/test2/mb1-*
ls: cannot access /mnt/ks/test1/mb1-*: No such file or directory
11123811650566741328 -rw-r--r-- 2 root root 41943040 Aug 31 09:34 /mnt/ks/mb1-85
10308136675813588637 -rw-r--r-- 1 root root 31981568 Aug 31 10:20 /mnt/ks/test2/mb1-30   <-----------
11266794036804784411 -rw-r--r-- 1 root root 41943040 Aug 31 10:21 /mnt/ks/test2/mb1-31

Actual results:
===============
one file is tructed from 41943040 to 31981568

Comment 4 Nithya Balachandran 2016-07-21 13:16:09 UTC
Assigning this to Pranith as it is dependant on having some AFR fixes.