Bug 1711764 - Files inaccessible if one rebalance process is killed in a multinode volume
Summary: Files inaccessible if one rebalance process is killed in a multinode volume
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nithya Balachandran
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1714124 1734251
TreeView+ depends on / blocked
 
Reported: 2019-05-20 04:54 UTC by Nithya Balachandran
Modified: 2019-07-30 04:56 UTC (History)
2 users (show)

Fixed In Version: glusterfs-7.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1714124 1734251 (view as bug list)
Environment:
Last Closed: 2019-07-02 03:18:17 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 22746 0 None Open cluster/dht: Lookup all files when processing directory 2019-05-20 10:01:18 UTC

Description Nithya Balachandran 2019-05-20 04:54:54 UTC
Description of problem:

This is a consequence of https://review.gluster.org/#/c/glusterfs/+/17239/ and lookup-optimize being enabled.


Rebalance directory processing steps on each node:

1. Set new layout on directory without the commit hash
2. List files on that local subvol. Migrate those files which fall into its bucket. Lookups are performed on the files only if it is determined that it is to be migrated by the process.
3. When done, update the layout on the local subvol with the layout containing the commit hash.

When there are multiple rebalance processes processing the same directory, they finish at different times and one process can update the layout with the commit hash before the others are done listing and migrating their files.
Clients will therefore see a complete layout even before all files have been looked up according to the new layout causing file access to fail.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Create a 2x2 volume spanning 2 nodes. Create some directories and files on it.
2. Add 2 bricks to convert it to a 3x2 volume.
3. Start a rebalance on the volume and break into one rebalance process before it starts processing the directories.
4. Allow the second rebalance process to complete. Kill the process that is blocked by gdb.
5. Mount the volume and try to stat the files without listing the directories.


Actual results:

The stat will fail for several files with the error :

stat: cannot stat ‘<filename>’: No such file or directory


Expected results:


Additional info:

Comment 1 Nithya Balachandran 2019-05-20 05:05:30 UTC
The easiest solution is to have each node do the file lookups before the call to gf_defrag_should_i_migrate.


Pros:  Simple
Cons: Will introduce more lookups but is pretty much the same as the number seen before https://review.gluster.org/#/c/glusterfs/+/17239/

Comment 2 Worker Ant 2019-05-20 10:01:20 UTC
REVIEW: https://review.gluster.org/22746 (cluster/dht: Lookup all files when processing directory) posted (#1) for review on master by N Balachandran


Note You need to log in before you can comment on or make changes to this bug.