Bug 987422

Summary: Rebalance : NFS mount : few files missing on mount point after rebalance process
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: senaik
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED DEFERRED QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: nbalacha, nsathyan, rhs-bugs, rwheeler, spalai, surs, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: dht-file-access
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Renaming of files during rebalance/remove-brick runs might lead to data/files missing from mount-points. Consequence: Few renamed files will not be listed on the mount. Due to which applications might see failures. But the file will be available on the backend. Fix: RCA in progress Result:
Story Points: ---
Clone Of:
: 1286059 (view as bug list) Environment:
Last Closed: 2015-11-27 10:25:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 969298, 1130888, 1138395, 1139998, 1140348, 1146895    
Bug Blocks: 1286059    

Description senaik 2013-07-23 11:15:14 UTC
Description of problem:
=========================
Renaming some files, while Rebalance process was running, resulted in few files missing on the mount point after rebalance process completion . 

Version-Release number of selected component (if applicable):
============================================================= 
3.4.0.12rhs.beta5-2.el6rhs.x86_64


How reproducible:
=================== 

Steps to Reproduce:
=================== 
1.Create a distribute volume with 4 bricks and start it 

2.NFS mount the volume and create some files 
for i in {1..500} ; do dd if=/dev/urandom of=f"$i" bs=10M count=1; done

3.Calculate are-equal check sum on mount point before starting rebalance

[root@RHEL6 vOL5]# /opt/qa/tools/arequal-checksum /mnt/vOL5/

Entry counts
Regular files   : 500
Directories     : 1
Symbolic links  : 0
Other           : 0
Total           : 501

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : e9e46cb6bd7afeb5e1695e1113e67b5
Directories     : 30312a00
Symbolic links  : 0
Other           : 0
Total           : e7f2f9579c75b300

4. Add brick and start rebalance 

5. While rebalance is running , rename some files 

Node   Rebalanced-files  size   scanned  failures   status   run time in secs
-----  ---------------- ------ ---------  --------  -------  ----------------
 
localhost     6         60.0MB   7     0       in progress    3.00
10.70.34.88   0         0Bytes   163   0       in progress    3.00
10.70.34.86   0         0Bytes   144   0       in progress    3.00
10.70.34.87   0         0Bytes   166   0       in progress    3.00

volume rebalance: Vol5: success:
On the mount point : 
-------------------- 
for i in {11..400} ; do mv f"$i" files"$i" ; done

6)After rebalance is completed, calculate the are-equal checksum on the mount point 

[root@RHEL6 vOL5]# /opt/qa/tools/arequal-checksum /mnt/vOL5/

Entry counts
Regular files   : 496
Directories     : 1
Symbolic links  : 0
Other           : 0
Total           : 497

Metadata checksums
Regular files   : 3e9
Directories     : 24d74c
Symbolic links  : 3e9
Other           : 3e9

Checksums
Regular files   : a372879cf5eadea220d2d1b5c0cd3dcc
Directories     : 3c0b060000302f00
Symbolic links  : 0
Other           : 0
Total           : bfab50293517cc6e

The regular files count has changed from 500 to 496 after rebalance process . 

Files missing on mount point :
--------------------------------- 
files193 files244 files233 files248

Actual results:
================
Few files missing on the mount point after rebalance process

Expected results:
================== 
The regular files count has changed from 500 to 496 after rebalance process . 


Additional info:
================= 
 gluster v i Vol5
 
Volume Name: Vol5
Type: Distribute
Volume ID: 73357b50-2e5d-4902-8564-d9f09404da46
Status: Started
Number of Bricks: 6
Transport-type: tcp
Bricks:
Brick1: 10.70.34.85:/rhs/brick1/F1
Brick2: 10.70.34.86:/rhs/brick1/F2
Brick3: 10.70.34.87:/rhs/brick1/F3
Brick4: 10.70.34.88:/rhs/brick1/F4
Brick5: 10.70.34.85:/rhs/brick1/F5
Brick6: 10.70.34.86:/rhs/brick1/F6

Comment 4 shishir gowda 2013-07-23 12:44:54 UTC
Could you please provide the dump of all the xattrs(all the bricks) of the files  that are missing from the mount?

Comment 5 senaik 2013-07-24 06:23:53 UTC
[root@kori brick1]#  ls -l */files193

---------T. 2 root root        0 Jul 23 13:30 F4/files193

# file: F4/files193
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0xf07281d9849348a9926f810cb96deef1
trusted.glusterfs.dht.linkto=0x566f6c352d636c69656e742d3000

============================================================== 
[root@boost brick1]# ls -l */files244

---------T. 2 root root        0 Jul 23 15:53 F1/files244

# file: F1/files244
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x0159a793be6744599f4206c39996925b
trusted.glusterfs.dht.linkto=0x566f6c352d636c69656e742d3400

================================================================== 
[root@boost brick1]# ls -l */files233

---------T. 2 root root        0 Jul 23 15:53 F1/files233
# file: F1/files233
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x948eebd63ef14baabd51846517f7f223
trusted.glusterfs.dht.linkto=0x566f6c352d636c69656e742d3400


===================================================================== 
[root@boost brick1]# ls -l */files248
---------T. 2 root root        0 Jul 23 15:53 F1/files248

# file: F1/files248
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.gfid=0x6082534e6ea5492793b64f076df87c65
trusted.glusterfs.dht.linkto=0x566f6c352d636c69656e742d3400

====================================================================

Comment 6 shishir gowda 2013-07-24 07:00:44 UTC
Please dump the linkto xattrs as text.

Comment 7 senaik 2013-07-24 07:25:25 UTC
dht.linkto value is same across all the missing files 

Below is the text format :
Vol5-client-4

Comment 8 Amar Tumballi 2013-07-30 11:53:51 UTC
With rebalance (or remove-brick start) operation in progress and one doing a 'rename' (ie, mv command) operations on the files getting rebalanced, we have certain race conditions which are causing this particular bug.

This bug existed from day0 of rebalance process, and is not a regression. Development needs time to find out the root cause of this, and then take appropriate action. Hence requesting to take down the 'blocker' flag (please set it back if this comment is not sufficient to agree).

Comment 11 Susant Kumar Palai 2015-11-27 10:25:37 UTC
Cloning this bug in 3.1. To be fixed in future release.