Bug 1734251

Summary: Files inaccessible if one rebalance process is killed in a multinode volume
Product: [Community] GlusterFS Reporter: Nithya Balachandran <nbalacha>
Component: distributeAssignee: Barak Sason Rofman <bsasonro>
Status: CLOSED UPSTREAM QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6CC: bugs, spalai
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1711764 Environment:
Last Closed: 2020-03-12 12:56:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1711764    
Bug Blocks: 1714124    

Description Nithya Balachandran 2019-07-30 04:56:58 UTC
+++ This bug was initially created as a clone of Bug #1711764 +++

Description of problem:

This is a consequence of https://review.gluster.org/#/c/glusterfs/+/17239/ and lookup-optimize being enabled.


Rebalance directory processing steps on each node:

1. Set new layout on directory without the commit hash
2. List files on that local subvol. Migrate those files which fall into its bucket. Lookups are performed on the files only if it is determined that it is to be migrated by the process.
3. When done, update the layout on the local subvol with the layout containing the commit hash.

When there are multiple rebalance processes processing the same directory, they finish at different times and one process can update the layout with the commit hash before the others are done listing and migrating their files.
Clients will therefore see a complete layout even before all files have been looked up according to the new layout causing file access to fail.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a 2x2 volume spanning 2 nodes. Create some directories and files on it.
2. Add 2 bricks to convert it to a 3x2 volume.
3. Start a rebalance on the volume and break into one rebalance process before it starts processing the directories.
4. Allow the second rebalance process to complete. Kill the process that is blocked by gdb.
5. Mount the volume and try to stat the files without listing the directories.


Actual results:

The stat will fail for several files with the error :

stat: cannot stat ‘<filename>’: No such file or directory


Expected results:


Additional info:

--- Additional comment from Nithya Balachandran on 2019-05-20 05:05:30 UTC ---

The easiest solution is to have each node do the file lookups before the call to gf_defrag_should_i_migrate.


Pros:  Simple
Cons: Will introduce more lookups but is pretty much the same as the number seen before https://review.gluster.org/#/c/glusterfs/+/17239/

--- Additional comment from Worker Ant on 2019-05-20 10:01:20 UTC ---

REVIEW: https://review.gluster.org/22746 (cluster/dht: Lookup all files when processing directory) posted (#1) for review on master by N Balachandran

Comment 2 Worker Ant 2020-03-12 12:56:46 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/973, and will be tracked there from now on. Visit GitHub issues URL for further details