Bug 1115428

Summary: [DHT-REBALANCE]: Few files are missing after add-brick and rebalance
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: shylesh <shmohan>
Component: distributeAssignee: Raghavendra G <rgowdapp>
Status: CLOSED ERRATA QA Contact: shylesh <shmohan>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: lmohanty, nbalacha, nsathyan, rhs-bugs, ssamanta, vagarwal
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.6.0.25-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1116236 (view as bug list) Environment:
Last Closed: 2014-09-22 19:43:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1116236    
Attachments:
Description Flags
rebalance logs none

Description shylesh 2014-07-02 10:51:20 UTC
Created attachment 914098 [details]
rebalance logs

Description of problem:
Created few hidden files, add-brick followed by rebalance causes some of the files to be missed from the mount point

Version-Release number of selected component (if applicable):

3.6.0.22-1.el6rhs.x86_64

How reproducible:
Manually not reproducible, only through automation

Steps to Reproduce:
1.created a 2 brick distribute volume
2. create some hidden files on the mount point
3. add one more brick and rebalance

Actual results:
one file missing from the mount point 

 
from rebalance logs
===============
From Node-0
================

[2014-07-01 10:14:29.139736] I [dht-common.c:1113:dht_lookup_everywhere_cbk] 0-testvol-dht: deleting stale linkfile /hidden/.16_hidden on testvol-client-2




From node-2
===========
[2014-07-01 10:14:29.144045] I [dht-rebalance.c:823:dht_migrate_file] 0-testvol-dht: /hidden/.16_hidden: attempting to move from testvol-client-0 to testvol-client-2
[2014-07-01 10:14:29.166731] I [MSGID: 109022] [dht-rebalance.c:1067:dht_migrate_file] 0-testvol-dht: completed migration of /hidden/.16_hidden from subvolume testvol-client-0 to testvol-client-2




attaching the complete logs

Comment 2 Vivek Agarwal 2014-07-03 06:15:04 UTC
Per discussion, Marking it as a blocker for Denali.

Comment 4 Raghavendra G 2014-07-03 15:24:19 UTC
Currently lookup-everywhere is deleting any link file it finds. To prevent deleting linkfiles under migration, it checks the number of fds opened and deletes the file only if count is zero. However, even this check is not foolproof and results can vary due to race-condition. Consider the following scenario:

1. rebalance process p1 lookup everywhere returns success on a file.
2. rebalance process p2 identifies the file for migration and initiates migration - opens an fd on dst node.
3. p1 goes ahead with deletion of file, since it is a linkfile AND there are no open-fds.
4. p2 completes migration without any errors since fd was opened before p1 deleted the file.
5. Though, we do lookup on file after migration, the result is logged in DEBUG log-level and are logged in the logs attached here. Also, the current code doesn't consider the lookup failure as rebalance failure.

Correct fix should make unlink of link-file and check for open-fd count as atomic operations.

Comment 6 shylesh 2014-09-19 12:19:58 UTC
Not seen on the latest build glusterfs-3.6.0.28-1

Comment 8 errata-xmlrpc 2014-09-22 19:43:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html