Created attachment 914098 [details] rebalance logs Description of problem: Created few hidden files, add-brick followed by rebalance causes some of the files to be missed from the mount point Version-Release number of selected component (if applicable): 3.6.0.22-1.el6rhs.x86_64 How reproducible: Manually not reproducible, only through automation Steps to Reproduce: 1.created a 2 brick distribute volume 2. create some hidden files on the mount point 3. add one more brick and rebalance Actual results: one file missing from the mount point from rebalance logs =============== From Node-0 ================ [2014-07-01 10:14:29.139736] I [dht-common.c:1113:dht_lookup_everywhere_cbk] 0-testvol-dht: deleting stale linkfile /hidden/.16_hidden on testvol-client-2 From node-2 =========== [2014-07-01 10:14:29.144045] I [dht-rebalance.c:823:dht_migrate_file] 0-testvol-dht: /hidden/.16_hidden: attempting to move from testvol-client-0 to testvol-client-2 [2014-07-01 10:14:29.166731] I [MSGID: 109022] [dht-rebalance.c:1067:dht_migrate_file] 0-testvol-dht: completed migration of /hidden/.16_hidden from subvolume testvol-client-0 to testvol-client-2 attaching the complete logs
Per discussion, Marking it as a blocker for Denali.
Currently lookup-everywhere is deleting any link file it finds. To prevent deleting linkfiles under migration, it checks the number of fds opened and deletes the file only if count is zero. However, even this check is not foolproof and results can vary due to race-condition. Consider the following scenario: 1. rebalance process p1 lookup everywhere returns success on a file. 2. rebalance process p2 identifies the file for migration and initiates migration - opens an fd on dst node. 3. p1 goes ahead with deletion of file, since it is a linkfile AND there are no open-fds. 4. p2 completes migration without any errors since fd was opened before p1 deleted the file. 5. Though, we do lookup on file after migration, the result is logged in DEBUG log-level and are logged in the logs attached here. Also, the current code doesn't consider the lookup failure as rebalance failure. Correct fix should make unlink of link-file and check for open-fd count as atomic operations.
Not seen on the latest build glusterfs-3.6.0.28-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-1278.html