Bug 1711764

Summary: Files inaccessible if one rebalance process is killed in a multinode volume
Product: [Community] GlusterFS Reporter: Nithya Balachandran <nbalacha>
Component: distributeAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED NEXTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: atumball, bugs
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-7.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1714124 1734251 (view as bug list) Environment:
Last Closed: 2019-07-02 03:18:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1714124, 1734251    

Description Nithya Balachandran 2019-05-20 04:54:54 UTC
Description of problem:

This is a consequence of https://review.gluster.org/#/c/glusterfs/+/17239/ and lookup-optimize being enabled.


Rebalance directory processing steps on each node:

1. Set new layout on directory without the commit hash
2. List files on that local subvol. Migrate those files which fall into its bucket. Lookups are performed on the files only if it is determined that it is to be migrated by the process.
3. When done, update the layout on the local subvol with the layout containing the commit hash.

When there are multiple rebalance processes processing the same directory, they finish at different times and one process can update the layout with the commit hash before the others are done listing and migrating their files.
Clients will therefore see a complete layout even before all files have been looked up according to the new layout causing file access to fail.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a 2x2 volume spanning 2 nodes. Create some directories and files on it.
2. Add 2 bricks to convert it to a 3x2 volume.
3. Start a rebalance on the volume and break into one rebalance process before it starts processing the directories.
4. Allow the second rebalance process to complete. Kill the process that is blocked by gdb.
5. Mount the volume and try to stat the files without listing the directories.


Actual results:

The stat will fail for several files with the error :

stat: cannot stat ‘<filename>’: No such file or directory


Expected results:


Additional info:

Comment 1 Nithya Balachandran 2019-05-20 05:05:30 UTC
The easiest solution is to have each node do the file lookups before the call to gf_defrag_should_i_migrate.


Pros:  Simple
Cons: Will introduce more lookups but is pretty much the same as the number seen before https://review.gluster.org/#/c/glusterfs/+/17239/

Comment 2 Worker Ant 2019-05-20 10:01:20 UTC
REVIEW: https://review.gluster.org/22746 (cluster/dht: Lookup all files when processing directory) posted (#1) for review on master by N Balachandran